Files
Charon/docs/plans/archive/CI_TEST_FAILURES_DETAILED_REMEDIATION.md
akanealw eec8c28fb3
Some checks are pending
Go Benchmark / Performance Regression Check (push) Waiting to run
Cerberus Integration / Cerberus Security Stack Integration (push) Waiting to run
Upload Coverage to Codecov / Backend Codecov Upload (push) Waiting to run
Upload Coverage to Codecov / Frontend Codecov Upload (push) Waiting to run
CodeQL - Analyze / CodeQL analysis (go) (push) Waiting to run
CodeQL - Analyze / CodeQL analysis (javascript-typescript) (push) Waiting to run
CrowdSec Integration / CrowdSec Bouncer Integration (push) Waiting to run
Docker Build, Publish & Test / build-and-push (push) Waiting to run
Docker Build, Publish & Test / Security Scan PR Image (push) Blocked by required conditions
Quality Checks / Auth Route Protection Contract (push) Waiting to run
Quality Checks / Codecov Trigger/Comment Parity Guard (push) Waiting to run
Quality Checks / Backend (Go) (push) Waiting to run
Quality Checks / Frontend (React) (push) Waiting to run
Rate Limit integration / Rate Limiting Integration (push) Waiting to run
Security Scan (PR) / Trivy Binary Scan (push) Waiting to run
Supply Chain Verification (PR) / Verify Supply Chain (push) Waiting to run
WAF integration / Coraza WAF Integration (push) Waiting to run
changed perms
2026-04-22 18:19:14 +00:00

12 KiB
Executable File

CI Test Failures Detailed Remediation Plan

Date: 2026-02-16 Workflow Run: 22079827893 (codecov-upload.yml) Branch: feature/beta-release Status: 🔴 BLOCKING - 9+ tests failing

Executive Summary

CRITICAL DISCOVERY: The test failures are NOT related to CHARON_ENCRYPTION_KEY environment variable. The encryption key is properly set and working in CI. The failures are due to various test-specific issues including HTTP status codes, timing, concurrency, and database state.

Evidence:

  • CI logs show NO warnings about "CHARON_ENCRYPTION_KEY is required"
  • CI logs show NO errors about "invalid key length"
  • Services initialize successfully with encryption
  • Coverage at 85.1% (meets requirement)

Actual Root Cause: Individual test logic, timing, or environmental differences between local and CI execution.


##Failed Tests with Actual Errors

1. TestMain_DefaultStartupGracefulShutdown_Subprocess

File: backend/cmd/api/main_test.go Error: Process terminated with signal: terminated after 0.57s Observation: Subprocess starts successfully (logs show server initialization) but then receives termination signal Root Cause Hypothesis:

  • Subprocess doesn't terminate gracefully within expected time
  • Missing or delayed signal handling in test
  • Race condition between parent sending signal and subprocess responding

Local vs CI: May pass locally due to faster execution or different signal handling Priority: 🔴 HIGH - Main server startup flow must work Fix Complexity: MEDIUM (2-3 hours) Remediation:

  • Read test to understand subprocess lifecycle
  • Check timeout values and signal handling
  • Verify graceful shutdown logic waits for server to bind port before terminating

2. TestGetAcquisitionConfig

File: backend/internal/handlers/crowdsec_handler_test.go (assumed) Error: Should not be: 404 - Getting unexpected 404 HTTP status Root Cause Hypothesis:

  • CrowdSec config endpoint returns 404 when SecurityConfig table missing
  • Test expects config to exist but CI database doesn't have it migrated
  • Local database might have lingering state from previous runs

Local vs CI: Local database persistence vs fresh CI database Priority: 🟡 MEDIUM - CrowdSec feature must work Fix Complexity: EASY (30 minutes) Remediation:

  • Ensure test migrates SecurityConfig table before testing
  • Or adjust expected behavior when config doesn't exist

3. TestEnsureBouncerRegistration_ConcurrentCalls

File: backend/internal/services/crowdsec_lapi_service_test.go (assumed) Error: Not equal: expected: 1 - Count assertion failing Root Cause Hypothesis:

  • Race condition in concurrent bouncer registration
  • Test expects exactly 1 bouncer but gets 0 or >1 due to timing
  • CI environment slower causing timeout or race window

Local vs CI: Different CPU cores or timing characteristics Priority: 🟡 MEDIUM - Concurrency safety important Fix Complexity: HARD (3-4 hours) Remediation:

  • Add explicit synchronization or retries in test
  • Increase timeout for concurrent operations
  • Use eventually assertions instead of immediate checks

4. TestPluginHandler_ReloadPlugins_WithErrors

File: backend/internal/api/handlers/plugin_handler_test.go Error: Not equal: expected: 200 - HTTP status code not 200 Root Cause Hypothesis:

  • Plugin reload returns error status (likely 500 or 400) instead of 200
  • Test expects reload to succeed even with errors (bad plugin files)
  • Endpoint behavior might differ when plugin directory doesn't exist

Local vs CI: Local might have plugin directory setup, CI starts fresh Priority: 🟢 LOW - Plugin system edge case Fix Complexity: EASY (1 hour) Remediation:

  • Read test to understand expected behavior with errors
  • Adjust expectation or ensure test setup creates proper plugin state

5. TestFetchIndexFallbackHTTP

File: backend/internal/services/crowdsec_preset_service_test.go (assumed) Error: Received unexpected error: - Some error occurred during HTTP fallback Root Cause Hypothesis:

  • HTTP fallback mechanism fails when primary fetch method unavailable
  • Network request in test might be blocked in CI
  • Missing mock or test fixture for HTTP response

Local vs CI: CI network restrictions or missing test server Priority: 🟢 LOW - Fallback mechanism edge case Fix Complexity: MEDIUM (1-2 hours) Remediation:

  • Ensure test uses mock HTTP server, not real network
  • Check if test fixture files exist in CI
  • Verify fallback logic handles all error cases

6. TestRunScheduledBackup_CleanupFails

File: backend/internal/services/backup_service_test.go Error: "0" is not greater than or equal to "1" - Cleanup count assertion Root Cause Hypothesis:

  • Test simulates cleanup failure but checks for at least 1 deletion
  • Cleanup function doesn't attempt deletion when it should
  • Race condition or timing issue preventing cleanup execution

Local vs CI: Filesystem timing or goroutine scheduling Priority: 🟡 MEDIUM - Backup reliability important Fix Complexity: MEDIUM (1-2 hours) Remediation:

  • Read test to understand cleanup failure scenario
  • Verify test assertion matches expected behavior
  • Add debug logging to see what cleanup actually does

7. TestSecurityService_LogAudit_ChannelFullFallsBackToSyncWrite

File: backend/internal/services/security_service_test.go Error: Not equal: expected: "sync-fallback" - Audit log type mismatch Root Cause Hypothesis:

  • Test fills audit channel to trigger sync fallback
  • Fallback not triggered or audit records wrong log type
  • Timing issue - channel drains before fallback needed

Local vs CI: Goroutine scheduling or channel buffer behavior Priority: 🟡 MEDIUM - Audit reliability important Fix Complexity: MEDIUM (2 hours) Remediation:

  • Verify channel size and fill logic in test
  • Check if fallback logic correctly sets log type
  • Add explicit synchronization to ensure channel full before write

8. TestCredentialService_GetCredentialForDomain_ExactMatch

File: backend/internal/services/credential_service_test.go Error: Received unexpected error: - Method returns error instead of success Root Cause Hypothesis:

  • Credential lookup fails due to missing data or encryption issue
  • Database state corrupted or incomplete in test
  • Service initialization error (though encryption key IS present)

Local vs CI: Unknown - need to read test implementation Priority: 🔴 HIGH - Core credential management feature Fix Complexity: UNKNOWN (needs investigation) Remediation:

  • Read test file starting at line 265
  • Check test setup creates proper credential with zone filter
  • Verify encryption service initialized correctly in test
  • Add debug logging to see actual error message

9. TestCredentialService_GetCredentialForDomain_WildcardMatch

File: backend/internal/services/credential_service_test.go Error: Received unexpected error: - Method returns error instead of success Root Cause Hypothesis:

  • Similar to ExactMatch test - credential lookup fails
  • Wildcard matching logic has bug or missing data
  • Zone filter parsing error

Local vs CI: Unknown - need to read test implementation Priority: 🔴 HIGH - Core credential management feature Fix Complexity: UNKNOWN (needs investigation) Remediation:

  • Read test file starting at line 297
  • Check wildcard zone filter setup (e.g., "*.example.com")
  • Verify wildcard matching algorithm
  • Add debug logging to see actual error message

10. TestDeleteCertificate_CreatesBackup ⚠️

File: backend/internal/services/certificate_service_test.go Error: no such table: proxy_hosts (database query error) Note: Similar tests with same error PASS (e.g., TestDeleteCertificate_UsageCheckError) Root Cause Hypothesis:

  • Test database missing proxy_hosts table migration
  • Test expects error and handles it, but THIS specific test doesn't
  • Test assertion checks backup creation AFTER checking proxy_hosts (fails early)

Local vs CI: Local database might have full schema Priority: 🟢 LOW - May be expected behavior Fix Complexity: EASY (30 minutes) Remediation:

  • Read test to see what it actually expects
  • Either add proxy_hosts to test database migration
  • Or adjust test to expect "table not found" error

Remediation Options

Time Estimate: 8-14 hours (1-2 days) Pros:

  • Comprehensive fix, no technical debt
  • High confidence in test suite
  • Unblocks CI completely Cons:
  • Delays coverage patch work
  • Some fixes may be complex (concurrency tests)

Implementation Plan:

  1. Phase 1: High Priority (4-6 hours)
    • TestMain_DefaultStartupGracefulShutdown_Subprocess
    • TestCredentialService ExactMatch & WildcardMatch
  2. Phase 2: Medium Priority (2-4 hours)
    • TestGetAcquisitionConfig
    • TestEnsureBouncerRegistration_ConcurrentCalls
    • TestRunScheduledBackup_CleanupFails
    • TestSecurityService_LogAudit
  3. Phase 3: Low Priority (2-4 hours)
    • TestPluginHandler_ReloadPlugins_WithErrors
    • TestFetchIndexFallbackHTTP
    • TestDeleteCertificate_CreatesBackup

Option B: Skip Non-Critical Tests (Fastest)

Time Estimate: 1-2 hours Pros:

  • Fastest path to green CI
  • Focus on coverage patch work immediately Cons:
  • Technical debt accumulates
  • May mask real bugs
  • Need to track TODOs

Implementation:

  • Add t.Skip("CI environment test - tracked in issue #XXX") to low/medium priority tests
  • Keep HIGH priority tests (Main server startup, credential service)
  • Create GitHub issues for each skipped test
  • Fix during next sprint

Option C: Parallel Work (Balanced)

Time Estimate: 4-6 hours first pass, then monitor Pros:

  • Unblock critical paths quickly
  • Comprehensive fix in parallel Cons:
  • More context switching
  • Risk of merge conflicts

Implementation:

  1. Skip low-priority tests immediately (TestPlugin*, TestFetchIndex, TestDeleteCert)
  2. Fix HIGH priority tests in parallel with coverage work
  3. Tackle MEDIUM priority tests after coverage patch merged

Decision Matrix

Criteria Option A Option B Option C
Time to Green CI 1-2 days 1-2 hours 4-6 hours
Technical Debt None High Medium
Risk of Masking Bugs Low High Medium
Coverage Patch Delay High None Low
Long-term Quality Best Worst Good

OPTION A - Fix All Now

Reasoning:

  1. Test failures indicate real issues in application logic or test environment
  2. Skipping tests hides potential bugs that could affect production
  3. The 9 failures represent core features (server startup, credentials, security auditing, backups)
  4. Encryption key issue was a red herring - actual fixes should be straightforward
  5. Better to have stable CI before moving to coverage patch work

Next Steps:

  1. User approves Option A
  2. Delegate to Backend_Dev agent: "Fix test failures following Phase 1 → Phase 2 → Phase 3 order"
  3. For each test:
    • Read test file
    • Understand expected behavior
    • Reproduce locally if possible (with fresh database)
    • Fix root cause
    • Verify fix locally
    • Commit with descriptive message
  4. Push all fixes as single logical commit
  5. Monitor CI workflows for green status
  6. Return to coverage patch work

Confidence Assessment

Root Cause Identified: YES - Not encryption key, but individual test issues Fix Complexity: 🟡 MEDIUM - Mix of easy and hard fixes Upstream Blockers: NONE - All fixes are local test changes Risk of Regression: 🟢 LOW - Tests are isolated, fixes won't affect production code


Notes for Implementation

  • All fixes should be in test files only (*_test.go)
  • Production code should NOT need changes (except if real bugs found)
  • Add comments explaining CI-specific behavior if needed
  • Use t.Logf() for debug output during investigation
  • Commit frequently with descriptive messages per fix group
  • Run go test -v -run TestName ./path/ to test individually

Final Recommendation

DO NOT skip tests. These failures represent real issues that need fixing:

  • Server graceful shutdown
  • Credential domain matching (core feature)
  • Security audit logging (compliance requirement)
  • CrowdSec bouncer registration (security feature)
  • Backup cleanup (data integrity)

Proceed with Option A: Fix All Now.