- Increased SIGTERM signal timeout from 500ms to 1000ms
- Go 1.26.0 changed signal delivery timing on Linux
- Test now passes reliably with adequate startup grace period
Related to Go 1.26.0 upgrade (commit dc40102a)
8.2 KiB
CI Test Failures Remediation Plan
Date: 2026-02-16 Status: READY FOR IMPLEMENTATION Encryption Key Issue: ✅ RESOLVED Remaining Issue: 9 Test Failures
✅ CONFIRMED: Encryption Key Issue Resolved
Verification:
- No "invalid key length" errors in CI logs
- No "CHARON_ENCRYPTION_KEY is required" warnings
- Secret format validated:
r2Xfh5PUagXEVG1Qhg9Hq3ELfMdtQZx5gX0kvE23BHQ=decodes to exactly 32 bytes - Coverage: 85.1% ✅ (meets 85% requirement)
🔴 9 FAILING TESTS TO FIX
Test Failure Summary
FAILED TEST SUMMARY:
--- FAIL: TestMain_DefaultStartupGracefulShutdown_Subprocess (0.55s)
--- FAIL: TestGetAcquisitionConfig (0.00s)
--- FAIL: TestEnsureBouncerRegistration_ConcurrentCalls (0.01s)
--- FAIL: TestPluginHandler_ReloadPlugins_WithErrors (0.00s)
--- FAIL: TestFetchIndexFallbackHTTP (0.00s)
--- FAIL: TestRunScheduledBackup_CleanupFails (0.01s)
--- FAIL: TestSecurityService_LogAudit_ChannelFullFallsBackToSyncWrite (0.03s)
--- FAIL: TestCredentialService_Delete (0.01s)
--- FAIL: TestCredentialService_GetCredentialForDomain_WildcardMatch (0.01s)
Context
These failures appear to be pre-existing issues unrelated to the encryption key work:
- User previously mentioned "we need to fix the other failing tests" during CI wait
- Backend Dev reported fixing some of these tests locally but they're still failing in CI
- This suggests environment differences between local and CI runners
📋 REMEDIATION STRATEGY
Phase 1: Local Reproduction (Priority: HIGH)
Goal: Reproduce each failure locally to understand root causes.
Actions:
- Run each failing test individually with verbose output
- Document exact failure messages and stack traces
- Identify patterns (timeout, race condition, environment dependency, etc.)
- Check if failures are deterministic or flaky
Commands:
# Run individual failing tests with verbose output
cd backend
# Test 1
go test -v -run TestMain_DefaultStartupGracefulShutdown_Subprocess ./cmd/main_test.go
# Test 2
go test -v -run TestGetAcquisitionConfig ./internal/handlers/crowdsec_handler_test.go
# Test 3
go test -v -run TestEnsureBouncerRegistration_ConcurrentCalls ./internal/services/crowdsec_service_test.go
# Test 4
go test -v -run TestPluginHandler_ReloadPlugins_WithErrors ./internal/api/handlers/plugin_handler_test.go
# Test 5
go test -v -run TestFetchIndexFallbackHTTP ./internal/crowdsec/preset_hub_test.go
# Test 6
go test -v -run TestRunScheduledBackup_CleanupFails ./internal/services/backup_service_test.go
# Test 7
go test -v -run TestSecurityService_LogAudit_ChannelFullFallsBackToSyncWrite ./internal/services/security_service_test.go
# Test 8
go test -v -run TestCredentialService_Delete ./internal/services/credential_service_test.go
# Test 9
go test -v -run TestCredentialService_GetCredentialForDomain_WildcardMatch ./internal/services/credential_service_test.go
Phase 2: Root Cause Analysis (Priority: HIGH)
Expected Failure Patterns:
-
Subprocess/Concurrency Tests (3 tests):
- TestMain_DefaultStartupGracefulShutdown_Subprocess
- TestEnsureBouncerRegistration_ConcurrentCalls
- TestSecurityService_LogAudit_ChannelFullFallsBackToSyncWrite
- Hypothesis: Race conditions or timing issues in CI environment
- Check: CI has different CPU/timing characteristics than local
-
File System Tests (2 tests):
- TestRunScheduledBackup_CleanupFails
- TestFetchIndexFallbackHTTP
- Hypothesis: Permission issues or missing temp directories in CI
- Check: Temp directory creation/cleanup
-
Database/Service Tests (2 tests):
- TestCredentialService_Delete
- TestCredentialService_GetCredentialForDomain_WildcardMatch
- Hypothesis: Now that encryption is working, tests may be revealing actual bugs
- Check: Encryption/decryption logic with valid keys
-
Handler Tests (2 tests):
- TestGetAcquisitionConfig
- TestPluginHandler_ReloadPlugins_WithErrors
- Hypothesis: Mock expectations not matching actual behavior
- Check: Mock setup vs reality
Phase 3: Fix Implementation (Priority: HIGH)
Fix Patterns by Category:
Pattern A: Race Condition Fixes
// Add proper synchronization
var mu sync.Mutex
mu.Lock()
defer mu.Unlock()
// Or increase timeouts for CI
timeout := 1 * time.Second
if os.Getenv("CI") == "true" {
timeout = 5 * time.Second
}
Pattern B: Temp Directory Fixes
// Ensure proper cleanup
t.Cleanup(func() {
os.RemoveAll(tempDir)
})
// Check directory exists before operations
if _, err := os.Stat(dir); os.IsNotExist(err) {
t.Skip("Directory not accessible in CI")
}
Pattern C: Timing/Async Fixes
// Use Eventually pattern for async operations
require.Eventually(t, func() bool {
// Check condition
return result == expected
}, 5*time.Second, 100*time.Millisecond)
Pattern D: Database/Encryption Fixes
// Ensure encryption key is set in test
t.Setenv("CHARON_ENCRYPTION_KEY", "your-test-key")
// Verify service initialization
require.NoError(t, err, "Service should initialize with valid key")
Phase 4: Validation (Priority: CRITICAL)
Local Validation:
# Run all tests multiple times to catch flaky tests
for i in {1..5}; do
echo "Run $i"
go test -v -run 'TestMain_DefaultStartupGracefulShutdown_Subprocess|TestGetAcquisitionConfig|TestEnsureBouncerRegistration_ConcurrentCalls|TestPluginHandler_ReloadPlugins_WithErrors|TestFetchIndexFallbackHTTP|TestRunScheduledBackup_CleanupFails|TestSecurityService_LogAudit_ChannelFullFallsBackToSyncWrite|TestCredentialService_Delete|TestCredentialService_GetCredentialForDomain_WildcardMatch' ./...
done
CI Validation:
- Push fixes to feature branch
- Monitor CI workflow runs
- Verify all 9 tests pass in CI
- Verify coverage remains >=85%
🎯 DECISION POINT
BEFORE proceeding with fixes, we need to decide:
Option A: Fix Now (Blocking PR #666)
- Investigate and fix all 9 tests
- Estimated time: 2-4 hours
- Blocks returning to original coverage goal
- Pros: Clean slate, no test debt
- Cons: Delays coverage work further
Option B: Skip for Now (Unblock Coverage Work)
- Mark failing tests with
t.Skip()and TODO comments - Return to original coverage goal
- Fix tests in separate PR after coverage work complete
- Pros: Unblocks progress on original goal
- Cons: Accumulates technical debt
Option C: Parallel Work
- Backend Dev fixes tests while other work continues
- Pros: Maximizes throughput
- Cons: Coordination overhead
RECOMMENDATION: Need user input on priority:
- If coverage patch is time-sensitive → Option B
- If CI stability is critical → Option A
- If team has bandwidth → Option C
📊 RISK ASSESSMENT
Continuing with failing tests:
- ❌ PR #666 cannot merge (CI failures blocking)
- ❌ Coverage validation workflow unreliable
- ❌ May mask new test failures
- ⚠️ Could indicate actual bugs in production code
Fixing tests first:
- ✅ Clean CI before coverage expansion
- ✅ Higher confidence in code quality
- ✅ Easier to isolate coverage-related issues
- ⏳ Delays return to original coverage objective
👤 USER ACTION REQUIRED
Please decide:
- Fix all 9 tests NOW before coverage work? (Option A)
- Skip/TODO tests to unblock coverage goal? (Option B)
- Parallel: Some agent fixes tests while coverage work proceeds? (Option C)
Also confirm:
- Are these tests known to be flaky or is this the first failure?
- Were these tests passing before the encryption key changes?
- Are there any other blocking issues to address first?
📝 NOTES
- Encryption key validation verified locally:
echo "r2Xfh5PUagXEVG1Qhg9Hq3ELfMdtQZx5gX0kvE23BHQ=" | base64 -d | wc -c→ 32 bytes ✅ - CI logs show no RotationService warnings ✅
- Coverage at 85.1% meets requirement ✅
- Original conversation goal was: "The final part to a green CI is the codecov patch. We're missing 578 lines of coverage"
🔄 NEXT STEPS (PENDING USER DECISION)
- User decides on Option A, B, or C
- If Option A: Begin Phase 1 reproduction
- If Option B: Skip tests with TODO comments, proceed to coverage
- If Option C: Split work: Backend Dev fixes tests, Planning starts coverage plan