Charon/docs/plans/archive/CI_TEST_FAILURES_REMEDIATION.md

# CI Test Failures Remediation Plan

**Date:** 2026-02-16
**Status:** READY FOR IMPLEMENTATION
**Encryption Key Issue:** ✅ RESOLVED
**Remaining Issue:** 9 Test Failures

---

## ✅ CONFIRMED: Encryption Key Issue Resolved

**Verification:**
- No "invalid key length" errors in CI logs
- No "CHARON_ENCRYPTION_KEY is required" warnings
- Secret format validated: `r2Xfh5PUagXEVG1Qhg9Hq3ELfMdtQZx5gX0kvE23BHQ=` decodes to exactly 32 bytes
- Coverage: 85.1% ✅ (meets 85% requirement)

---

## 🔴 9 FAILING TESTS TO FIX

### Test Failure Summary

```
FAILED TEST SUMMARY:
--- FAIL: TestMain_DefaultStartupGracefulShutdown_Subprocess (0.55s)
--- FAIL: TestGetAcquisitionConfig (0.00s)
--- FAIL: TestEnsureBouncerRegistration_ConcurrentCalls (0.01s)
--- FAIL: TestPluginHandler_ReloadPlugins_WithErrors (0.00s)
--- FAIL: TestFetchIndexFallbackHTTP (0.00s)
--- FAIL: TestRunScheduledBackup_CleanupFails (0.01s)
--- FAIL: TestSecurityService_LogAudit_ChannelFullFallsBackToSyncWrite (0.03s)
--- FAIL: TestCredentialService_Delete (0.01s)
--- FAIL: TestCredentialService_GetCredentialForDomain_WildcardMatch (0.01s)
```

### Context

These failures appear to be **pre-existing issues** unrelated to the encryption key work:
- User previously mentioned "we need to fix the other failing tests" during CI wait
- Backend Dev reported fixing some of these tests locally but they're still failing in CI
- This suggests environment differences between local and CI runners

---

## 📋 REMEDIATION STRATEGY

### Phase 1: Local Reproduction (Priority: HIGH)

**Goal:** Reproduce each failure locally to understand root causes.

**Actions:**
1. Run each failing test individually with verbose output
2. Document exact failure messages and stack traces
3. Identify patterns (timeout, race condition, environment dependency, etc.)
4. Check if failures are deterministic or flaky

**Commands:**
```bash
# Run individual failing tests with verbose output
cd backend

# Test 1
go test -v -run TestMain_DefaultStartupGracefulShutdown_Subprocess ./cmd/main_test.go

# Test 2
go test -v -run TestGetAcquisitionConfig ./internal/handlers/crowdsec_handler_test.go

# Test 3
go test -v -run TestEnsureBouncerRegistration_ConcurrentCalls ./internal/services/crowdsec_service_test.go

# Test 4
go test -v -run TestPluginHandler_ReloadPlugins_WithErrors ./internal/api/handlers/plugin_handler_test.go

# Test 5
go test -v -run TestFetchIndexFallbackHTTP ./internal/crowdsec/preset_hub_test.go

# Test 6
go test -v -run TestRunScheduledBackup_CleanupFails ./internal/services/backup_service_test.go

# Test 7
go test -v -run TestSecurityService_LogAudit_ChannelFullFallsBackToSyncWrite ./internal/services/security_service_test.go

# Test 8
go test -v -run TestCredentialService_Delete ./internal/services/credential_service_test.go

# Test 9
go test -v -run TestCredentialService_GetCredentialForDomain_WildcardMatch ./internal/services/credential_service_test.go
```

### Phase 2: Root Cause Analysis (Priority: HIGH)

**Expected Failure Patterns:**

1. **Subprocess/Concurrency Tests** (3 tests):
   - TestMain_DefaultStartupGracefulShutdown_Subprocess
   - TestEnsureBouncerRegistration_ConcurrentCalls
   - TestSecurityService_LogAudit_ChannelFullFallsBackToSyncWrite
   - **Hypothesis:** Race conditions or timing issues in CI environment
   - **Check:** CI has different CPU/timing characteristics than local

2. **File System Tests** (2 tests):
   - TestRunScheduledBackup_CleanupFails
   - TestFetchIndexFallbackHTTP
   - **Hypothesis:** Permission issues or missing temp directories in CI
   - **Check:** Temp directory creation/cleanup

3. **Database/Service Tests** (2 tests):
   - TestCredentialService_Delete
   - TestCredentialService_GetCredentialForDomain_WildcardMatch
   - **Hypothesis:** Now that encryption is working, tests may be revealing actual bugs
   - **Check:** Encryption/decryption logic with valid keys

4. **Handler Tests** (2 tests):
   - TestGetAcquisitionConfig
   - TestPluginHandler_ReloadPlugins_WithErrors
   - **Hypothesis:** Mock expectations not matching actual behavior
   - **Check:** Mock setup vs reality

### Phase 3: Fix Implementation (Priority: HIGH)

**Fix Patterns by Category:**

#### Pattern A: Race Condition Fixes
```go
// Add proper synchronization
var mu sync.Mutex
mu.Lock()
defer mu.Unlock()

// Or increase timeouts for CI
timeout := 1 * time.Second
if os.Getenv("CI") == "true" {
    timeout = 5 * time.Second
}
```

#### Pattern B: Temp Directory Fixes
```go
// Ensure proper cleanup
t.Cleanup(func() {
    os.RemoveAll(tempDir)
})

// Check directory exists before operations
if _, err := os.Stat(dir); os.IsNotExist(err) {
    t.Skip("Directory not accessible in CI")
}
```

#### Pattern C: Timing/Async Fixes
```go
// Use Eventually pattern for async operations
require.Eventually(t, func() bool {
    // Check condition
    return result == expected
}, 5*time.Second, 100*time.Millisecond)
```

#### Pattern D: Database/Encryption Fixes
```go
// Ensure encryption key is set in test
t.Setenv("CHARON_ENCRYPTION_KEY", "your-test-key")

// Verify service initialization
require.NoError(t, err, "Service should initialize with valid key")
```

### Phase 4: Validation (Priority: CRITICAL)

**Local Validation:**
```bash
# Run all tests multiple times to catch flaky tests
for i in {1..5}; do
    echo "Run $i"
    go test -v -run 'TestMain_DefaultStartupGracefulShutdown_Subprocess|TestGetAcquisitionConfig|TestEnsureBouncerRegistration_ConcurrentCalls|TestPluginHandler_ReloadPlugins_WithErrors|TestFetchIndexFallbackHTTP|TestRunScheduledBackup_CleanupFails|TestSecurityService_LogAudit_ChannelFullFallsBackToSyncWrite|TestCredentialService_Delete|TestCredentialService_GetCredentialForDomain_WildcardMatch' ./...
done
```

**CI Validation:**
1. Push fixes to feature branch
2. Monitor CI workflow runs
3. Verify all 9 tests pass in CI
4. Verify coverage remains >=85%

---

## 🎯 DECISION POINT

**BEFORE proceeding with fixes, we need to decide:**

**Option A: Fix Now (Blocking PR #666)**
- Investigate and fix all 9 tests
- Estimated time: 2-4 hours
- Blocks returning to original coverage goal
- **Pros:** Clean slate, no test debt
- **Cons:** Delays coverage work further

**Option B: Skip for Now (Unblock Coverage Work)**
- Mark failing tests with `t.Skip()` and TODO comments
- Return to original coverage goal
- Fix tests in separate PR after coverage work complete
- **Pros:** Unblocks progress on original goal
- **Cons:** Accumulates technical debt

**Option C: Parallel Work**
- Backend Dev fixes tests while other work continues
- **Pros:** Maximizes throughput
- **Cons:** Coordination overhead

**RECOMMENDATION:** Need user input on priority:
- If coverage patch is time-sensitive → Option B
- If CI stability is critical → Option A
- If team has bandwidth → Option C

---

## 📊 RISK ASSESSMENT

**Continuing with failing tests:**
- ❌ PR #666 cannot merge (CI failures blocking)
- ❌ Coverage validation workflow unreliable
- ❌ May mask new test failures
- ⚠️ Could indicate actual bugs in production code

**Fixing tests first:**
- ✅ Clean CI before coverage expansion
- ✅ Higher confidence in code quality
- ✅ Easier to isolate coverage-related issues
- ⏳ Delays return to original coverage objective

---

## 👤 USER ACTION REQUIRED

**Please decide:**

1. **Fix all 9 tests NOW before coverage work?** (Option A)
2. **Skip/TODO tests to unblock coverage goal?** (Option B)
3. **Parallel: Some agent fixes tests while coverage work proceeds?** (Option C)

**Also confirm:**
- Are these tests known to be flaky or is this the first failure?
- Were these tests passing before the encryption key changes?
- Are there any other blocking issues to address first?

---

## 📝 NOTES

- Encryption key validation verified locally: `echo "r2Xfh5PUagXEVG1Qhg9Hq3ELfMdtQZx5gX0kvE23BHQ=" | base64 -d | wc -c` → 32 bytes ✅
- CI logs show no RotationService warnings ✅
- Coverage at 85.1% meets requirement ✅
- Original conversation goal was: "The final part to a green CI is the codecov patch. We're missing 578 lines of coverage"

---

## 🔄 NEXT STEPS (PENDING USER DECISION)

1. User decides on Option A, B, or C
2. If Option A: Begin Phase 1 reproduction
3. If Option B: Skip tests with TODO comments, proceed to coverage
4. If Option C: Split work: Backend Dev fixes tests, Planning starts coverage plan