264 lines
8.2 KiB
Markdown
264 lines
8.2 KiB
Markdown
# CI Test Failures Remediation Plan
|
|
|
|
**Date:** 2026-02-16
|
|
**Status:** READY FOR IMPLEMENTATION
|
|
**Encryption Key Issue:** ✅ RESOLVED
|
|
**Remaining Issue:** 9 Test Failures
|
|
|
|
---
|
|
|
|
## ✅ CONFIRMED: Encryption Key Issue Resolved
|
|
|
|
**Verification:**
|
|
- No "invalid key length" errors in CI logs
|
|
- No "CHARON_ENCRYPTION_KEY is required" warnings
|
|
- Secret format validated: `r2Xfh5PUagXEVG1Qhg9Hq3ELfMdtQZx5gX0kvE23BHQ=` decodes to exactly 32 bytes
|
|
- Coverage: 85.1% ✅ (meets 85% requirement)
|
|
|
|
---
|
|
|
|
## 🔴 9 FAILING TESTS TO FIX
|
|
|
|
### Test Failure Summary
|
|
|
|
```
|
|
FAILED TEST SUMMARY:
|
|
--- FAIL: TestMain_DefaultStartupGracefulShutdown_Subprocess (0.55s)
|
|
--- FAIL: TestGetAcquisitionConfig (0.00s)
|
|
--- FAIL: TestEnsureBouncerRegistration_ConcurrentCalls (0.01s)
|
|
--- FAIL: TestPluginHandler_ReloadPlugins_WithErrors (0.00s)
|
|
--- FAIL: TestFetchIndexFallbackHTTP (0.00s)
|
|
--- FAIL: TestRunScheduledBackup_CleanupFails (0.01s)
|
|
--- FAIL: TestSecurityService_LogAudit_ChannelFullFallsBackToSyncWrite (0.03s)
|
|
--- FAIL: TestCredentialService_Delete (0.01s)
|
|
--- FAIL: TestCredentialService_GetCredentialForDomain_WildcardMatch (0.01s)
|
|
```
|
|
|
|
### Context
|
|
|
|
These failures appear to be **pre-existing issues** unrelated to the encryption key work:
|
|
- User previously mentioned "we need to fix the other failing tests" during CI wait
|
|
- Backend Dev reported fixing some of these tests locally but they're still failing in CI
|
|
- This suggests environment differences between local and CI runners
|
|
|
|
---
|
|
|
|
## 📋 REMEDIATION STRATEGY
|
|
|
|
### Phase 1: Local Reproduction (Priority: HIGH)
|
|
|
|
**Goal:** Reproduce each failure locally to understand root causes.
|
|
|
|
**Actions:**
|
|
1. Run each failing test individually with verbose output
|
|
2. Document exact failure messages and stack traces
|
|
3. Identify patterns (timeout, race condition, environment dependency, etc.)
|
|
4. Check if failures are deterministic or flaky
|
|
|
|
**Commands:**
|
|
```bash
|
|
# Run individual failing tests with verbose output
|
|
cd backend
|
|
|
|
# Test 1
|
|
go test -v -run TestMain_DefaultStartupGracefulShutdown_Subprocess ./cmd/main_test.go
|
|
|
|
# Test 2
|
|
go test -v -run TestGetAcquisitionConfig ./internal/handlers/crowdsec_handler_test.go
|
|
|
|
# Test 3
|
|
go test -v -run TestEnsureBouncerRegistration_ConcurrentCalls ./internal/services/crowdsec_service_test.go
|
|
|
|
# Test 4
|
|
go test -v -run TestPluginHandler_ReloadPlugins_WithErrors ./internal/api/handlers/plugin_handler_test.go
|
|
|
|
# Test 5
|
|
go test -v -run TestFetchIndexFallbackHTTP ./internal/crowdsec/preset_hub_test.go
|
|
|
|
# Test 6
|
|
go test -v -run TestRunScheduledBackup_CleanupFails ./internal/services/backup_service_test.go
|
|
|
|
# Test 7
|
|
go test -v -run TestSecurityService_LogAudit_ChannelFullFallsBackToSyncWrite ./internal/services/security_service_test.go
|
|
|
|
# Test 8
|
|
go test -v -run TestCredentialService_Delete ./internal/services/credential_service_test.go
|
|
|
|
# Test 9
|
|
go test -v -run TestCredentialService_GetCredentialForDomain_WildcardMatch ./internal/services/credential_service_test.go
|
|
```
|
|
|
|
### Phase 2: Root Cause Analysis (Priority: HIGH)
|
|
|
|
**Expected Failure Patterns:**
|
|
|
|
1. **Subprocess/Concurrency Tests** (3 tests):
|
|
- TestMain_DefaultStartupGracefulShutdown_Subprocess
|
|
- TestEnsureBouncerRegistration_ConcurrentCalls
|
|
- TestSecurityService_LogAudit_ChannelFullFallsBackToSyncWrite
|
|
- **Hypothesis:** Race conditions or timing issues in CI environment
|
|
- **Check:** CI has different CPU/timing characteristics than local
|
|
|
|
2. **File System Tests** (2 tests):
|
|
- TestRunScheduledBackup_CleanupFails
|
|
- TestFetchIndexFallbackHTTP
|
|
- **Hypothesis:** Permission issues or missing temp directories in CI
|
|
- **Check:** Temp directory creation/cleanup
|
|
|
|
3. **Database/Service Tests** (2 tests):
|
|
- TestCredentialService_Delete
|
|
- TestCredentialService_GetCredentialForDomain_WildcardMatch
|
|
- **Hypothesis:** Now that encryption is working, tests may be revealing actual bugs
|
|
- **Check:** Encryption/decryption logic with valid keys
|
|
|
|
4. **Handler Tests** (2 tests):
|
|
- TestGetAcquisitionConfig
|
|
- TestPluginHandler_ReloadPlugins_WithErrors
|
|
- **Hypothesis:** Mock expectations not matching actual behavior
|
|
- **Check:** Mock setup vs reality
|
|
|
|
### Phase 3: Fix Implementation (Priority: HIGH)
|
|
|
|
**Fix Patterns by Category:**
|
|
|
|
#### Pattern A: Race Condition Fixes
|
|
```go
|
|
// Add proper synchronization
|
|
var mu sync.Mutex
|
|
mu.Lock()
|
|
defer mu.Unlock()
|
|
|
|
// Or increase timeouts for CI
|
|
timeout := 1 * time.Second
|
|
if os.Getenv("CI") == "true" {
|
|
timeout = 5 * time.Second
|
|
}
|
|
```
|
|
|
|
#### Pattern B: Temp Directory Fixes
|
|
```go
|
|
// Ensure proper cleanup
|
|
t.Cleanup(func() {
|
|
os.RemoveAll(tempDir)
|
|
})
|
|
|
|
// Check directory exists before operations
|
|
if _, err := os.Stat(dir); os.IsNotExist(err) {
|
|
t.Skip("Directory not accessible in CI")
|
|
}
|
|
```
|
|
|
|
#### Pattern C: Timing/Async Fixes
|
|
```go
|
|
// Use Eventually pattern for async operations
|
|
require.Eventually(t, func() bool {
|
|
// Check condition
|
|
return result == expected
|
|
}, 5*time.Second, 100*time.Millisecond)
|
|
```
|
|
|
|
#### Pattern D: Database/Encryption Fixes
|
|
```go
|
|
// Ensure encryption key is set in test
|
|
t.Setenv("CHARON_ENCRYPTION_KEY", "your-test-key")
|
|
|
|
// Verify service initialization
|
|
require.NoError(t, err, "Service should initialize with valid key")
|
|
```
|
|
|
|
### Phase 4: Validation (Priority: CRITICAL)
|
|
|
|
**Local Validation:**
|
|
```bash
|
|
# Run all tests multiple times to catch flaky tests
|
|
for i in {1..5}; do
|
|
echo "Run $i"
|
|
go test -v -run 'TestMain_DefaultStartupGracefulShutdown_Subprocess|TestGetAcquisitionConfig|TestEnsureBouncerRegistration_ConcurrentCalls|TestPluginHandler_ReloadPlugins_WithErrors|TestFetchIndexFallbackHTTP|TestRunScheduledBackup_CleanupFails|TestSecurityService_LogAudit_ChannelFullFallsBackToSyncWrite|TestCredentialService_Delete|TestCredentialService_GetCredentialForDomain_WildcardMatch' ./...
|
|
done
|
|
```
|
|
|
|
**CI Validation:**
|
|
1. Push fixes to feature branch
|
|
2. Monitor CI workflow runs
|
|
3. Verify all 9 tests pass in CI
|
|
4. Verify coverage remains >=85%
|
|
|
|
---
|
|
|
|
## 🎯 DECISION POINT
|
|
|
|
**BEFORE proceeding with fixes, we need to decide:**
|
|
|
|
**Option A: Fix Now (Blocking PR #666)**
|
|
- Investigate and fix all 9 tests
|
|
- Estimated time: 2-4 hours
|
|
- Blocks returning to original coverage goal
|
|
- **Pros:** Clean slate, no test debt
|
|
- **Cons:** Delays coverage work further
|
|
|
|
**Option B: Skip for Now (Unblock Coverage Work)**
|
|
- Mark failing tests with `t.Skip()` and TODO comments
|
|
- Return to original coverage goal
|
|
- Fix tests in separate PR after coverage work complete
|
|
- **Pros:** Unblocks progress on original goal
|
|
- **Cons:** Accumulates technical debt
|
|
|
|
**Option C: Parallel Work**
|
|
- Backend Dev fixes tests while other work continues
|
|
- **Pros:** Maximizes throughput
|
|
- **Cons:** Coordination overhead
|
|
|
|
**RECOMMENDATION:** Need user input on priority:
|
|
- If coverage patch is time-sensitive → Option B
|
|
- If CI stability is critical → Option A
|
|
- If team has bandwidth → Option C
|
|
|
|
---
|
|
|
|
## 📊 RISK ASSESSMENT
|
|
|
|
**Continuing with failing tests:**
|
|
- ❌ PR #666 cannot merge (CI failures blocking)
|
|
- ❌ Coverage validation workflow unreliable
|
|
- ❌ May mask new test failures
|
|
- ⚠️ Could indicate actual bugs in production code
|
|
|
|
**Fixing tests first:**
|
|
- ✅ Clean CI before coverage expansion
|
|
- ✅ Higher confidence in code quality
|
|
- ✅ Easier to isolate coverage-related issues
|
|
- ⏳ Delays return to original coverage objective
|
|
|
|
---
|
|
|
|
## 👤 USER ACTION REQUIRED
|
|
|
|
**Please decide:**
|
|
|
|
1. **Fix all 9 tests NOW before coverage work?** (Option A)
|
|
2. **Skip/TODO tests to unblock coverage goal?** (Option B)
|
|
3. **Parallel: Some agent fixes tests while coverage work proceeds?** (Option C)
|
|
|
|
**Also confirm:**
|
|
- Are these tests known to be flaky or is this the first failure?
|
|
- Were these tests passing before the encryption key changes?
|
|
- Are there any other blocking issues to address first?
|
|
|
|
---
|
|
|
|
## 📝 NOTES
|
|
|
|
- Encryption key validation verified locally: `echo "r2Xfh5PUagXEVG1Qhg9Hq3ELfMdtQZx5gX0kvE23BHQ=" | base64 -d | wc -c` → 32 bytes ✅
|
|
- CI logs show no RotationService warnings ✅
|
|
- Coverage at 85.1% meets requirement ✅
|
|
- Original conversation goal was: "The final part to a green CI is the codecov patch. We're missing 578 lines of coverage"
|
|
|
|
---
|
|
|
|
## 🔄 NEXT STEPS (PENDING USER DECISION)
|
|
|
|
1. User decides on Option A, B, or C
|
|
2. If Option A: Begin Phase 1 reproduction
|
|
3. If Option B: Skip tests with TODO comments, proceed to coverage
|
|
4. If Option C: Split work: Backend Dev fixes tests, Planning starts coverage plan
|