Files

GitHub Actions f4ef79def3 chore: repo cleanup by archiving plans / reports

2026-02-19 16:34:10 +00:00

8.2 KiB

Raw Blame History

CI Test Failures Remediation Plan

Date: 2026-02-16 Status: READY FOR IMPLEMENTATION Encryption Key Issue: ✅ RESOLVED Remaining Issue: 9 Test Failures

✅ CONFIRMED: Encryption Key Issue Resolved

Verification:

No "invalid key length" errors in CI logs
No "CHARON_ENCRYPTION_KEY is required" warnings
Secret format validated: r2Xfh5PUagXEVG1Qhg9Hq3ELfMdtQZx5gX0kvE23BHQ= decodes to exactly 32 bytes
Coverage: 85.1% ✅ (meets 85% requirement)

🔴 9 FAILING TESTS TO FIX

Test Failure Summary

FAILED TEST SUMMARY:
--- FAIL: TestMain_DefaultStartupGracefulShutdown_Subprocess (0.55s)
--- FAIL: TestGetAcquisitionConfig (0.00s)
--- FAIL: TestEnsureBouncerRegistration_ConcurrentCalls (0.01s)
--- FAIL: TestPluginHandler_ReloadPlugins_WithErrors (0.00s)
--- FAIL: TestFetchIndexFallbackHTTP (0.00s)
--- FAIL: TestRunScheduledBackup_CleanupFails (0.01s)
--- FAIL: TestSecurityService_LogAudit_ChannelFullFallsBackToSyncWrite (0.03s)
--- FAIL: TestCredentialService_Delete (0.01s)
--- FAIL: TestCredentialService_GetCredentialForDomain_WildcardMatch (0.01s)

Context

These failures appear to be pre-existing issues unrelated to the encryption key work:

User previously mentioned "we need to fix the other failing tests" during CI wait
Backend Dev reported fixing some of these tests locally but they're still failing in CI
This suggests environment differences between local and CI runners

📋 REMEDIATION STRATEGY

Phase 1: Local Reproduction (Priority: HIGH)

Goal: Reproduce each failure locally to understand root causes.

Actions:

Run each failing test individually with verbose output
Document exact failure messages and stack traces
Identify patterns (timeout, race condition, environment dependency, etc.)
Check if failures are deterministic or flaky

Commands:

# Run individual failing tests with verbose output
cd backend

# Test 1
go test -v -run TestMain_DefaultStartupGracefulShutdown_Subprocess ./cmd/main_test.go

# Test 2
go test -v -run TestGetAcquisitionConfig ./internal/handlers/crowdsec_handler_test.go

# Test 3
go test -v -run TestEnsureBouncerRegistration_ConcurrentCalls ./internal/services/crowdsec_service_test.go

# Test 4
go test -v -run TestPluginHandler_ReloadPlugins_WithErrors ./internal/api/handlers/plugin_handler_test.go

# Test 5
go test -v -run TestFetchIndexFallbackHTTP ./internal/crowdsec/preset_hub_test.go

# Test 6
go test -v -run TestRunScheduledBackup_CleanupFails ./internal/services/backup_service_test.go

# Test 7
go test -v -run TestSecurityService_LogAudit_ChannelFullFallsBackToSyncWrite ./internal/services/security_service_test.go

# Test 8
go test -v -run TestCredentialService_Delete ./internal/services/credential_service_test.go

# Test 9
go test -v -run TestCredentialService_GetCredentialForDomain_WildcardMatch ./internal/services/credential_service_test.go

Phase 2: Root Cause Analysis (Priority: HIGH)

Expected Failure Patterns:

Subprocess/Concurrency Tests (3 tests):
- TestMain_DefaultStartupGracefulShutdown_Subprocess
- TestEnsureBouncerRegistration_ConcurrentCalls
- TestSecurityService_LogAudit_ChannelFullFallsBackToSyncWrite
- Hypothesis: Race conditions or timing issues in CI environment
- Check: CI has different CPU/timing characteristics than local
File System Tests (2 tests):
- TestRunScheduledBackup_CleanupFails
- TestFetchIndexFallbackHTTP
- Hypothesis: Permission issues or missing temp directories in CI
- Check: Temp directory creation/cleanup
Database/Service Tests (2 tests):
- TestCredentialService_Delete
- TestCredentialService_GetCredentialForDomain_WildcardMatch
- Hypothesis: Now that encryption is working, tests may be revealing actual bugs
- Check: Encryption/decryption logic with valid keys
Handler Tests (2 tests):
- TestGetAcquisitionConfig
- TestPluginHandler_ReloadPlugins_WithErrors
- Hypothesis: Mock expectations not matching actual behavior
- Check: Mock setup vs reality

Phase 3: Fix Implementation (Priority: HIGH)

Fix Patterns by Category:

Pattern A: Race Condition Fixes

// Add proper synchronization
var mu sync.Mutex
mu.Lock()
defer mu.Unlock()

// Or increase timeouts for CI
timeout := 1 * time.Second
if os.Getenv("CI") == "true" {
    timeout = 5 * time.Second
}

Pattern B: Temp Directory Fixes

// Ensure proper cleanup
t.Cleanup(func() {
    os.RemoveAll(tempDir)
})

// Check directory exists before operations
if _, err := os.Stat(dir); os.IsNotExist(err) {
    t.Skip("Directory not accessible in CI")
}

Pattern C: Timing/Async Fixes

// Use Eventually pattern for async operations
require.Eventually(t, func() bool {
    // Check condition
    return result == expected
}, 5*time.Second, 100*time.Millisecond)

Pattern D: Database/Encryption Fixes

// Ensure encryption key is set in test
t.Setenv("CHARON_ENCRYPTION_KEY", "your-test-key")

// Verify service initialization
require.NoError(t, err, "Service should initialize with valid key")

Phase 4: Validation (Priority: CRITICAL)

Local Validation:

# Run all tests multiple times to catch flaky tests
for i in {1..5}; do
    echo "Run $i"
    go test -v -run 'TestMain_DefaultStartupGracefulShutdown_Subprocess|TestGetAcquisitionConfig|TestEnsureBouncerRegistration_ConcurrentCalls|TestPluginHandler_ReloadPlugins_WithErrors|TestFetchIndexFallbackHTTP|TestRunScheduledBackup_CleanupFails|TestSecurityService_LogAudit_ChannelFullFallsBackToSyncWrite|TestCredentialService_Delete|TestCredentialService_GetCredentialForDomain_WildcardMatch' ./...
done

CI Validation:

Push fixes to feature branch
Monitor CI workflow runs
Verify all 9 tests pass in CI
Verify coverage remains >=85%

🎯 DECISION POINT

BEFORE proceeding with fixes, we need to decide:

Option A: Fix Now (Blocking PR #666)

Investigate and fix all 9 tests
Estimated time: 2-4 hours
Blocks returning to original coverage goal
Pros: Clean slate, no test debt
Cons: Delays coverage work further

Option B: Skip for Now (Unblock Coverage Work)

Mark failing tests with t.Skip() and TODO comments
Return to original coverage goal
Fix tests in separate PR after coverage work complete
Pros: Unblocks progress on original goal
Cons: Accumulates technical debt

Option C: Parallel Work

Backend Dev fixes tests while other work continues
Pros: Maximizes throughput
Cons: Coordination overhead

RECOMMENDATION: Need user input on priority:

If coverage patch is time-sensitive → Option B
If CI stability is critical → Option A
If team has bandwidth → Option C

📊 RISK ASSESSMENT

Continuing with failing tests:

❌ PR #666 cannot merge (CI failures blocking)
❌ Coverage validation workflow unreliable
❌ May mask new test failures
⚠️ Could indicate actual bugs in production code

Fixing tests first:

✅ Clean CI before coverage expansion
✅ Higher confidence in code quality
✅ Easier to isolate coverage-related issues
⏳ Delays return to original coverage objective

👤 USER ACTION REQUIRED

Please decide:

Fix all 9 tests NOW before coverage work? (Option A)
Skip/TODO tests to unblock coverage goal? (Option B)
Parallel: Some agent fixes tests while coverage work proceeds? (Option C)

Also confirm:

Are these tests known to be flaky or is this the first failure?
Were these tests passing before the encryption key changes?
Are there any other blocking issues to address first?

📝 NOTES

Encryption key validation verified locally: echo "r2Xfh5PUagXEVG1Qhg9Hq3ELfMdtQZx5gX0kvE23BHQ=" | base64 -d | wc -c → 32 bytes ✅
CI logs show no RotationService warnings ✅
Coverage at 85.1% meets requirement ✅
Original conversation goal was: "The final part to a green CI is the codecov patch. We're missing 578 lines of coverage"

🔄 NEXT STEPS (PENDING USER DECISION)

User decides on Option A, B, or C
If Option A: Begin Phase 1 reproduction
If Option B: Skip tests with TODO comments, proceed to coverage
If Option C: Split work: Backend Dev fixes tests, Planning starts coverage plan

8.2 KiB Raw Blame History