12 KiB
Executable File
CI Test Failures Detailed Remediation Plan
Date: 2026-02-16 Workflow Run: 22079827893 (codecov-upload.yml) Branch: feature/beta-release Status: 🔴 BLOCKING - 9+ tests failing
Executive Summary
CRITICAL DISCOVERY: The test failures are NOT related to CHARON_ENCRYPTION_KEY environment variable. The encryption key is properly set and working in CI. The failures are due to various test-specific issues including HTTP status codes, timing, concurrency, and database state.
Evidence:
- CI logs show NO warnings about "CHARON_ENCRYPTION_KEY is required"
- CI logs show NO errors about "invalid key length"
- Services initialize successfully with encryption
- Coverage at 85.1% (meets requirement)
Actual Root Cause: Individual test logic, timing, or environmental differences between local and CI execution.
##Failed Tests with Actual Errors
1. TestMain_DefaultStartupGracefulShutdown_Subprocess
File: backend/cmd/api/main_test.go
Error: Process terminated with signal: terminated after 0.57s
Observation: Subprocess starts successfully (logs show server initialization) but then receives termination signal
Root Cause Hypothesis:
- Subprocess doesn't terminate gracefully within expected time
- Missing or delayed signal handling in test
- Race condition between parent sending signal and subprocess responding
Local vs CI: May pass locally due to faster execution or different signal handling Priority: 🔴 HIGH - Main server startup flow must work Fix Complexity: MEDIUM (2-3 hours) Remediation:
- Read test to understand subprocess lifecycle
- Check timeout values and signal handling
- Verify graceful shutdown logic waits for server to bind port before terminating
2. TestGetAcquisitionConfig
File: backend/internal/handlers/crowdsec_handler_test.go (assumed)
Error: Should not be: 404 - Getting unexpected 404 HTTP status
Root Cause Hypothesis:
- CrowdSec config endpoint returns 404 when SecurityConfig table missing
- Test expects config to exist but CI database doesn't have it migrated
- Local database might have lingering state from previous runs
Local vs CI: Local database persistence vs fresh CI database Priority: 🟡 MEDIUM - CrowdSec feature must work Fix Complexity: EASY (30 minutes) Remediation:
- Ensure test migrates SecurityConfig table before testing
- Or adjust expected behavior when config doesn't exist
3. TestEnsureBouncerRegistration_ConcurrentCalls
File: backend/internal/services/crowdsec_lapi_service_test.go (assumed)
Error: Not equal: expected: 1 - Count assertion failing
Root Cause Hypothesis:
- Race condition in concurrent bouncer registration
- Test expects exactly 1 bouncer but gets 0 or >1 due to timing
- CI environment slower causing timeout or race window
Local vs CI: Different CPU cores or timing characteristics Priority: 🟡 MEDIUM - Concurrency safety important Fix Complexity: HARD (3-4 hours) Remediation:
- Add explicit synchronization or retries in test
- Increase timeout for concurrent operations
- Use eventually assertions instead of immediate checks
4. TestPluginHandler_ReloadPlugins_WithErrors
File: backend/internal/api/handlers/plugin_handler_test.go
Error: Not equal: expected: 200 - HTTP status code not 200
Root Cause Hypothesis:
- Plugin reload returns error status (likely 500 or 400) instead of 200
- Test expects reload to succeed even with errors (bad plugin files)
- Endpoint behavior might differ when plugin directory doesn't exist
Local vs CI: Local might have plugin directory setup, CI starts fresh Priority: 🟢 LOW - Plugin system edge case Fix Complexity: EASY (1 hour) Remediation:
- Read test to understand expected behavior with errors
- Adjust expectation or ensure test setup creates proper plugin state
5. TestFetchIndexFallbackHTTP
File: backend/internal/services/crowdsec_preset_service_test.go (assumed)
Error: Received unexpected error: - Some error occurred during HTTP fallback
Root Cause Hypothesis:
- HTTP fallback mechanism fails when primary fetch method unavailable
- Network request in test might be blocked in CI
- Missing mock or test fixture for HTTP response
Local vs CI: CI network restrictions or missing test server Priority: 🟢 LOW - Fallback mechanism edge case Fix Complexity: MEDIUM (1-2 hours) Remediation:
- Ensure test uses mock HTTP server, not real network
- Check if test fixture files exist in CI
- Verify fallback logic handles all error cases
6. TestRunScheduledBackup_CleanupFails
File: backend/internal/services/backup_service_test.go
Error: "0" is not greater than or equal to "1" - Cleanup count assertion
Root Cause Hypothesis:
- Test simulates cleanup failure but checks for at least 1 deletion
- Cleanup function doesn't attempt deletion when it should
- Race condition or timing issue preventing cleanup execution
Local vs CI: Filesystem timing or goroutine scheduling Priority: 🟡 MEDIUM - Backup reliability important Fix Complexity: MEDIUM (1-2 hours) Remediation:
- Read test to understand cleanup failure scenario
- Verify test assertion matches expected behavior
- Add debug logging to see what cleanup actually does
7. TestSecurityService_LogAudit_ChannelFullFallsBackToSyncWrite
File: backend/internal/services/security_service_test.go
Error: Not equal: expected: "sync-fallback" - Audit log type mismatch
Root Cause Hypothesis:
- Test fills audit channel to trigger sync fallback
- Fallback not triggered or audit records wrong log type
- Timing issue - channel drains before fallback needed
Local vs CI: Goroutine scheduling or channel buffer behavior Priority: 🟡 MEDIUM - Audit reliability important Fix Complexity: MEDIUM (2 hours) Remediation:
- Verify channel size and fill logic in test
- Check if fallback logic correctly sets log type
- Add explicit synchronization to ensure channel full before write
8. TestCredentialService_GetCredentialForDomain_ExactMatch
File: backend/internal/services/credential_service_test.go
Error: Received unexpected error: - Method returns error instead of success
Root Cause Hypothesis:
- Credential lookup fails due to missing data or encryption issue
- Database state corrupted or incomplete in test
- Service initialization error (though encryption key IS present)
Local vs CI: Unknown - need to read test implementation Priority: 🔴 HIGH - Core credential management feature Fix Complexity: UNKNOWN (needs investigation) Remediation:
- Read test file starting at line 265
- Check test setup creates proper credential with zone filter
- Verify encryption service initialized correctly in test
- Add debug logging to see actual error message
9. TestCredentialService_GetCredentialForDomain_WildcardMatch
File: backend/internal/services/credential_service_test.go
Error: Received unexpected error: - Method returns error instead of success
Root Cause Hypothesis:
- Similar to ExactMatch test - credential lookup fails
- Wildcard matching logic has bug or missing data
- Zone filter parsing error
Local vs CI: Unknown - need to read test implementation Priority: 🔴 HIGH - Core credential management feature Fix Complexity: UNKNOWN (needs investigation) Remediation:
- Read test file starting at line 297
- Check wildcard zone filter setup (e.g., "*.example.com")
- Verify wildcard matching algorithm
- Add debug logging to see actual error message
10. TestDeleteCertificate_CreatesBackup ⚠️
File: backend/internal/services/certificate_service_test.go
Error: no such table: proxy_hosts (database query error)
Note: Similar tests with same error PASS (e.g., TestDeleteCertificate_UsageCheckError)
Root Cause Hypothesis:
- Test database missing proxy_hosts table migration
- Test expects error and handles it, but THIS specific test doesn't
- Test assertion checks backup creation AFTER checking proxy_hosts (fails early)
Local vs CI: Local database might have full schema Priority: 🟢 LOW - May be expected behavior Fix Complexity: EASY (30 minutes) Remediation:
- Read test to see what it actually expects
- Either add proxy_hosts to test database migration
- Or adjust test to expect "table not found" error
Remediation Options
Option A: Fix All Now (Recommended for Blocking Quality Gate)
Time Estimate: 8-14 hours (1-2 days) Pros:
- Comprehensive fix, no technical debt
- High confidence in test suite
- Unblocks CI completely Cons:
- Delays coverage patch work
- Some fixes may be complex (concurrency tests)
Implementation Plan:
- Phase 1: High Priority (4-6 hours)
- TestMain_DefaultStartupGracefulShutdown_Subprocess
- TestCredentialService ExactMatch & WildcardMatch
- Phase 2: Medium Priority (2-4 hours)
- TestGetAcquisitionConfig
- TestEnsureBouncerRegistration_ConcurrentCalls
- TestRunScheduledBackup_CleanupFails
- TestSecurityService_LogAudit
- Phase 3: Low Priority (2-4 hours)
- TestPluginHandler_ReloadPlugins_WithErrors
- TestFetchIndexFallbackHTTP
- TestDeleteCertificate_CreatesBackup
Option B: Skip Non-Critical Tests (Fastest)
Time Estimate: 1-2 hours Pros:
- Fastest path to green CI
- Focus on coverage patch work immediately Cons:
- Technical debt accumulates
- May mask real bugs
- Need to track TODOs
Implementation:
- Add
t.Skip("CI environment test - tracked in issue #XXX")to low/medium priority tests - Keep HIGH priority tests (Main server startup, credential service)
- Create GitHub issues for each skipped test
- Fix during next sprint
Option C: Parallel Work (Balanced)
Time Estimate: 4-6 hours first pass, then monitor Pros:
- Unblock critical paths quickly
- Comprehensive fix in parallel Cons:
- More context switching
- Risk of merge conflicts
Implementation:
- Skip low-priority tests immediately (TestPlugin*, TestFetchIndex, TestDeleteCert)
- Fix HIGH priority tests in parallel with coverage work
- Tackle MEDIUM priority tests after coverage patch merged
Decision Matrix
| Criteria | Option A | Option B | Option C |
|---|---|---|---|
| Time to Green CI | 1-2 days | 1-2 hours | 4-6 hours |
| Technical Debt | None | High | Medium |
| Risk of Masking Bugs | Low | High | Medium |
| Coverage Patch Delay | High | None | Low |
| Long-term Quality | Best | Worst | Good |
Recommended Approach
OPTION A - Fix All Now
Reasoning:
- Test failures indicate real issues in application logic or test environment
- Skipping tests hides potential bugs that could affect production
- The 9 failures represent core features (server startup, credentials, security auditing, backups)
- Encryption key issue was a red herring - actual fixes should be straightforward
- Better to have stable CI before moving to coverage patch work
Next Steps:
- User approves Option A
- Delegate to
Backend_Devagent: "Fix test failures following Phase 1 → Phase 2 → Phase 3 order" - For each test:
- Read test file
- Understand expected behavior
- Reproduce locally if possible (with fresh database)
- Fix root cause
- Verify fix locally
- Commit with descriptive message
- Push all fixes as single logical commit
- Monitor CI workflows for green status
- Return to coverage patch work
Confidence Assessment
Root Cause Identified: ✅ YES - Not encryption key, but individual test issues Fix Complexity: 🟡 MEDIUM - Mix of easy and hard fixes Upstream Blockers: ❌ NONE - All fixes are local test changes Risk of Regression: 🟢 LOW - Tests are isolated, fixes won't affect production code
Notes for Implementation
- All fixes should be in test files only (
*_test.go) - Production code should NOT need changes (except if real bugs found)
- Add comments explaining CI-specific behavior if needed
- Use
t.Logf()for debug output during investigation - Commit frequently with descriptive messages per fix group
- Run
go test -v -run TestName ./path/to test individually
Final Recommendation
DO NOT skip tests. These failures represent real issues that need fixing:
- Server graceful shutdown
- Credential domain matching (core feature)
- Security audit logging (compliance requirement)
- CrowdSec bouncer registration (security feature)
- Backup cleanup (data integrity)
Proceed with Option A: Fix All Now.