Charon/docs/plans/CI_TEST_FAILURES_DETAILED_REMEDIATION.md

# CI Test Failures Detailed Remediation Plan

**Date:** 2026-02-16
**Workflow Run:** 22079827893 (codecov-upload.yml)
**Branch:** feature/beta-release
**Status:** 🔴 BLOCKING - 9+ tests failing

## Executive Summary

**CRITICAL DISCOVERY:** The test failures are **NOT** related to `CHARON_ENCRYPTION_KEY` environment variable. The encryption key is properly set and working in CI. The failures are due to various test-specific issues including HTTP status codes, timing, concurrency, and database state.

**Evidence:**
- CI logs show NO warnings about "CHARON_ENCRYPTION_KEY is required"
- CI logs show NO errors about "invalid key length"
- Services initialize successfully with encryption
- Coverage at 85.1% (meets requirement)

**Actual Root Cause:** Individual test logic, timing, or environmental differences between local and CI execution.

---

##Failed Tests with Actual Errors

### 1. TestMain_DefaultStartupGracefulShutdown_Subprocess
**File:** `backend/cmd/api/main_test.go`
**Error:** Process terminated with `signal: terminated` after 0.57s
**Observation:** Subprocess starts successfully (logs show server initialization) but then receives termination signal
**Root Cause Hypothesis:**
- Subprocess doesn't terminate gracefully within expected time
- Missing or delayed signal handling in test
- Race condition between parent sending signal and subprocess responding

**Local vs CI:** May pass locally due to faster execution or different signal handling
**Priority:** 🔴 HIGH - Main server startup flow must work
**Fix Complexity:** MEDIUM (2-3 hours)
**Remediation:**
- Read test to understand subprocess lifecycle
- Check timeout values and signal handling
- Verify graceful shutdown logic waits for server to bind port before terminating

---

### 2. TestGetAcquisitionConfig
**File:** `backend/internal/handlers/crowdsec_handler_test.go` (assumed)
**Error:** `Should not be: 404` - Getting unexpected 404 HTTP status
**Root Cause Hypothesis:**
- CrowdSec config endpoint returns 404 when SecurityConfig table missing
- Test expects config to exist but CI database doesn't have it migrated
- Local database might have lingering state from previous runs

**Local vs CI:** Local database persistence vs fresh CI database
**Priority:** 🟡 MEDIUM - CrowdSec feature must work
**Fix Complexity:** EASY (30 minutes)
**Remediation:**
- Ensure test migrates SecurityConfig table before testing
- Or adjust expected behavior when config doesn't exist

---

### 3. TestEnsureBouncerRegistration_ConcurrentCalls
**File:** `backend/internal/services/crowdsec_lapi_service_test.go` (assumed)
**Error:** `Not equal: expected: 1` - Count assertion failing
**Root Cause Hypothesis:**
- Race condition in concurrent bouncer registration
- Test expects exactly 1 bouncer but gets 0 or >1 due to timing
- CI environment slower causing timeout or race window

**Local vs CI:** Different CPU cores or timing characteristics
**Priority:** 🟡 MEDIUM - Concurrency safety important
**Fix Complexity:** HARD (3-4 hours)
**Remediation:**
- Add explicit synchronization or retries in test
- Increase timeout for concurrent operations
- Use eventually assertions instead of immediate checks

---

### 4. TestPluginHandler_ReloadPlugins_WithErrors
**File:** `backend/internal/api/handlers/plugin_handler_test.go`
**Error:** `Not equal: expected: 200` - HTTP status code not 200
**Root Cause Hypothesis:**
- Plugin reload returns error status (likely 500 or 400) instead of 200
- Test expects reload to succeed even with errors (bad plugin files)
- Endpoint behavior might differ when plugin directory doesn't exist

**Local vs CI:** Local might have plugin directory setup, CI starts fresh
**Priority:** 🟢 LOW - Plugin system edge case
**Fix Complexity:** EASY (1 hour)
**Remediation:**
- Read test to understand expected behavior with errors
- Adjust expectation or ensure test setup creates proper plugin state

---

### 5. TestFetchIndexFallbackHTTP
**File:** `backend/internal/services/crowdsec_preset_service_test.go` (assumed)
**Error:** `Received unexpected error:` - Some error occurred during HTTP fallback
**Root Cause Hypothesis:**
- HTTP fallback mechanism fails when primary fetch method unavailable
- Network request in test might be blocked in CI
- Missing mock or test fixture for HTTP response

**Local vs CI:** CI network restrictions or missing test server
**Priority:** 🟢 LOW - Fallback mechanism edge case
**Fix Complexity:** MEDIUM (1-2 hours)
**Remediation:**
- Ensure test uses mock HTTP server, not real network
- Check if test fixture files exist in CI
- Verify fallback logic handles all error cases

---

### 6. TestRunScheduledBackup_CleanupFails
**File:** `backend/internal/services/backup_service_test.go`
**Error:** `"0" is not greater than or equal to "1"` - Cleanup count assertion
**Root Cause Hypothesis:**
- Test simulates cleanup failure but checks for at least 1 deletion
- Cleanup function doesn't attempt deletion when it should
- Race condition or timing issue preventing cleanup execution

**Local vs CI:** Filesystem timing or goroutine scheduling
**Priority:** 🟡 MEDIUM - Backup reliability important
**Fix Complexity:** MEDIUM (1-2 hours)
**Remediation:**
- Read test to understand cleanup failure scenario
- Verify test assertion matches expected behavior
- Add debug logging to see what cleanup actually does

---

### 7. TestSecurityService_LogAudit_ChannelFullFallsBackToSyncWrite
**File:** `backend/internal/services/security_service_test.go`
**Error:** `Not equal: expected: "sync-fallback"` - Audit log type mismatch
**Root Cause Hypothesis:**
- Test fills audit channel to trigger sync fallback
- Fallback not triggered or audit records wrong log type
- Timing issue - channel drains before fallback needed

**Local vs CI:** Goroutine scheduling or channel buffer behavior
**Priority:** 🟡 MEDIUM - Audit reliability important
**Fix Complexity:** MEDIUM (2 hours)
**Remediation:**
- Verify channel size and fill logic in test
- Check if fallback logic correctly sets log type
- Add explicit synchronization to ensure channel full before write

---

### 8. TestCredentialService_GetCredentialForDomain_ExactMatch
**File:** `backend/internal/services/credential_service_test.go`
**Error:** `Received unexpected error:` - Method returns error instead of success
**Root Cause Hypothesis:**
- Credential lookup fails due to missing data or encryption issue
- Database state corrupted or incomplete in test
- Service initialization error (though encryption key IS present)

**Local vs CI:** Unknown - need to read test implementation
**Priority:** 🔴 HIGH - Core credential management feature
**Fix Complexity:** UNKNOWN (needs investigation)
**Remediation:**
- Read test file starting at line 265
- Check test setup creates proper credential with zone filter
- Verify encryption service initialized correctly in test
- Add debug logging to see actual error message

---

### 9. TestCredentialService_GetCredentialForDomain_WildcardMatch
**File:** `backend/internal/services/credential_service_test.go`
**Error:** `Received unexpected error:` - Method returns error instead of success
**Root Cause Hypothesis:**
- Similar to ExactMatch test - credential lookup fails
- Wildcard matching logic has bug or missing data
- Zone filter parsing error

**Local vs CI:** Unknown - need to read test implementation
**Priority:** 🔴 HIGH - Core credential management feature
**Fix Complexity:** UNKNOWN (needs investigation)
**Remediation:**
- Read test file starting at line 297
- Check wildcard zone filter setup (e.g., "*.example.com")
- Verify wildcard matching algorithm
- Add debug logging to see actual error message

---

### 10. TestDeleteCertificate_CreatesBackup ⚠️
**File:** `backend/internal/services/certificate_service_test.go`
**Error:** `no such table: proxy_hosts` (database query error)
**Note:** Similar tests with same error PASS (e.g., TestDeleteCertificate_UsageCheckError)
**Root Cause Hypothesis:**
- Test database missing proxy_hosts table migration
- Test expects error and handles it, but THIS specific test doesn't
- Test assertion checks backup creation AFTER checking proxy_hosts (fails early)

**Local vs CI:** Local database might have full schema
**Priority:** 🟢 LOW - May be expected behavior
**Fix Complexity:** EASY (30 minutes)
**Remediation:**
- Read test to see what it actually expects
- Either add proxy_hosts to test database migration
- Or adjust test to expect "table not found" error

---

## Remediation Options

### Option A: Fix All Now (Recommended for Blocking Quality Gate)
**Time Estimate:** 8-14 hours (1-2 days)
**Pros:**
- Comprehensive fix, no technical debt
- High confidence in test suite
- Unblocks CI completely
**Cons:**
- Delays coverage patch work
- Some fixes may be complex (concurrency tests)

**Implementation Plan:**
1. **Phase 1: High Priority** (4-6 hours)
   - TestMain_DefaultStartupGracefulShutdown_Subprocess
   - TestCredentialService ExactMatch & WildcardMatch
2. **Phase 2: Medium Priority** (2-4 hours)
   - TestGetAcquisitionConfig
   - TestEnsureBouncerRegistration_ConcurrentCalls
   - TestRunScheduledBackup_CleanupFails
   - TestSecurityService_LogAudit
3. **Phase 3: Low Priority** (2-4 hours)
   - TestPluginHandler_ReloadPlugins_WithErrors
   - TestFetchIndexFallbackHTTP
   - TestDeleteCertificate_CreatesBackup

---

### Option B: Skip Non-Critical Tests (Fastest)
**Time Estimate:** 1-2 hours
**Pros:**
- Fastest path to green CI
- Focus on coverage patch work immediately
**Cons:**
- Technical debt accumulates
- May mask real bugs
- Need to track TODOs

**Implementation:**
- Add `t.Skip("CI environment test - tracked in issue #XXX")` to low/medium priority tests
- Keep HIGH priority tests (Main server startup, credential service)
- Create GitHub issues for each skipped test
- Fix during next sprint

---

### Option C: Parallel Work (Balanced)
**Time Estimate:** 4-6 hours first pass, then monitor
**Pros:**
- Unblock critical paths quickly
- Comprehensive fix in parallel
**Cons:**
- More context switching
- Risk of merge conflicts

**Implementation:**
1. Skip low-priority tests immediately (TestPlugin*, TestFetchIndex, TestDeleteCert)
2. Fix HIGH priority tests in parallel with coverage work
3. Tackle MEDIUM priority tests after coverage patch merged

---

## Decision Matrix

| Criteria | Option A | Option B | Option C |
|----------|----------|----------|----------|
| Time to Green CI | 1-2 days | 1-2 hours | 4-6 hours |
| Technical Debt | None | High | Medium |
| Risk of Masking Bugs | Low | High | Medium |
| Coverage Patch Delay | High | None | Low |
| Long-term Quality | Best | Worst | Good |

---

## Recommended Approach

**OPTION A - Fix All Now**

**Reasoning:**
1. Test failures indicate real issues in application logic or test environment
2. Skipping tests hides potential bugs that could affect production
3. The 9 failures represent core features (server startup, credentials, security auditing, backups)
4. Encryption key issue was a red herring - actual fixes should be straightforward
5. Better to have stable CI before moving to coverage patch work

**Next Steps:**
1. User approves Option A
2. Delegate to `Backend_Dev` agent: "Fix test failures following Phase 1 → Phase 2 → Phase 3 order"
3. For each test:
   - Read test file
   - Understand expected behavior
   - Reproduce locally if possible (with fresh database)
   - Fix root cause
   - Verify fix locally
   - Commit with descriptive message
4. Push all fixes as single logical commit
5. Monitor CI workflows for green status
6. Return to coverage patch work

---

## Confidence Assessment

**Root Cause Identified:** ✅ YES - Not encryption key, but individual test issues
**Fix Complexity:** 🟡 MEDIUM - Mix of easy and hard fixes
**Upstream Blockers:** ❌ NONE - All fixes are local test changes
**Risk of Regression:** 🟢 LOW - Tests are isolated, fixes won't affect production code

---

## Notes for Implementation

- All fixes should be in test files only (`*_test.go`)
- Production code should NOT need changes (except if real bugs found)
- Add comments explaining CI-specific behavior if needed
- Use `t.Logf()` for debug output during investigation
- Commit frequently with descriptive messages per fix group
- Run `go test -v -run TestName ./path/` to test individually

---

## Final Recommendation

**DO NOT** skip tests. These failures represent real issues that need fixing:
- Server graceful shutdown
- Credential domain matching (core feature)
- Security audit logging (compliance requirement)
- CrowdSec bouncer registration (security feature)
- Backup cleanup (data integrity)

Proceed with **Option A: Fix All Now**.