Files
Charon/docs/plans/CI_TEST_FAILURES_DETAILED_REMEDIATION.md
GitHub Actions 66cb95275d fix(tests): adapt TestMain_DefaultStartupGracefulShutdown_Subprocess to Go 1.26.0 signal handling
- Increased SIGTERM signal timeout from 500ms to 1000ms
- Go 1.26.0 changed signal delivery timing on Linux
- Test now passes reliably with adequate startup grace period

Related to Go 1.26.0 upgrade (commit dc40102a)
2026-02-16 23:53:30 +00:00

337 lines
12 KiB
Markdown

# CI Test Failures Detailed Remediation Plan
**Date:** 2026-02-16
**Workflow Run:** 22079827893 (codecov-upload.yml)
**Branch:** feature/beta-release
**Status:** 🔴 BLOCKING - 9+ tests failing
## Executive Summary
**CRITICAL DISCOVERY:** The test failures are **NOT** related to `CHARON_ENCRYPTION_KEY` environment variable. The encryption key is properly set and working in CI. The failures are due to various test-specific issues including HTTP status codes, timing, concurrency, and database state.
**Evidence:**
- CI logs show NO warnings about "CHARON_ENCRYPTION_KEY is required"
- CI logs show NO errors about "invalid key length"
- Services initialize successfully with encryption
- Coverage at 85.1% (meets requirement)
**Actual Root Cause:** Individual test logic, timing, or environmental differences between local and CI execution.
---
##Failed Tests with Actual Errors
### 1. TestMain_DefaultStartupGracefulShutdown_Subprocess
**File:** `backend/cmd/api/main_test.go`
**Error:** Process terminated with `signal: terminated` after 0.57s
**Observation:** Subprocess starts successfully (logs show server initialization) but then receives termination signal
**Root Cause Hypothesis:**
- Subprocess doesn't terminate gracefully within expected time
- Missing or delayed signal handling in test
- Race condition between parent sending signal and subprocess responding
**Local vs CI:** May pass locally due to faster execution or different signal handling
**Priority:** 🔴 HIGH - Main server startup flow must work
**Fix Complexity:** MEDIUM (2-3 hours)
**Remediation:**
- Read test to understand subprocess lifecycle
- Check timeout values and signal handling
- Verify graceful shutdown logic waits for server to bind port before terminating
---
### 2. TestGetAcquisitionConfig
**File:** `backend/internal/handlers/crowdsec_handler_test.go` (assumed)
**Error:** `Should not be: 404` - Getting unexpected 404 HTTP status
**Root Cause Hypothesis:**
- CrowdSec config endpoint returns 404 when SecurityConfig table missing
- Test expects config to exist but CI database doesn't have it migrated
- Local database might have lingering state from previous runs
**Local vs CI:** Local database persistence vs fresh CI database
**Priority:** 🟡 MEDIUM - CrowdSec feature must work
**Fix Complexity:** EASY (30 minutes)
**Remediation:**
- Ensure test migrates SecurityConfig table before testing
- Or adjust expected behavior when config doesn't exist
---
### 3. TestEnsureBouncerRegistration_ConcurrentCalls
**File:** `backend/internal/services/crowdsec_lapi_service_test.go` (assumed)
**Error:** `Not equal: expected: 1` - Count assertion failing
**Root Cause Hypothesis:**
- Race condition in concurrent bouncer registration
- Test expects exactly 1 bouncer but gets 0 or >1 due to timing
- CI environment slower causing timeout or race window
**Local vs CI:** Different CPU cores or timing characteristics
**Priority:** 🟡 MEDIUM - Concurrency safety important
**Fix Complexity:** HARD (3-4 hours)
**Remediation:**
- Add explicit synchronization or retries in test
- Increase timeout for concurrent operations
- Use eventually assertions instead of immediate checks
---
### 4. TestPluginHandler_ReloadPlugins_WithErrors
**File:** `backend/internal/api/handlers/plugin_handler_test.go`
**Error:** `Not equal: expected: 200` - HTTP status code not 200
**Root Cause Hypothesis:**
- Plugin reload returns error status (likely 500 or 400) instead of 200
- Test expects reload to succeed even with errors (bad plugin files)
- Endpoint behavior might differ when plugin directory doesn't exist
**Local vs CI:** Local might have plugin directory setup, CI starts fresh
**Priority:** 🟢 LOW - Plugin system edge case
**Fix Complexity:** EASY (1 hour)
**Remediation:**
- Read test to understand expected behavior with errors
- Adjust expectation or ensure test setup creates proper plugin state
---
### 5. TestFetchIndexFallbackHTTP
**File:** `backend/internal/services/crowdsec_preset_service_test.go` (assumed)
**Error:** `Received unexpected error:` - Some error occurred during HTTP fallback
**Root Cause Hypothesis:**
- HTTP fallback mechanism fails when primary fetch method unavailable
- Network request in test might be blocked in CI
- Missing mock or test fixture for HTTP response
**Local vs CI:** CI network restrictions or missing test server
**Priority:** 🟢 LOW - Fallback mechanism edge case
**Fix Complexity:** MEDIUM (1-2 hours)
**Remediation:**
- Ensure test uses mock HTTP server, not real network
- Check if test fixture files exist in CI
- Verify fallback logic handles all error cases
---
### 6. TestRunScheduledBackup_CleanupFails
**File:** `backend/internal/services/backup_service_test.go`
**Error:** `"0" is not greater than or equal to "1"` - Cleanup count assertion
**Root Cause Hypothesis:**
- Test simulates cleanup failure but checks for at least 1 deletion
- Cleanup function doesn't attempt deletion when it should
- Race condition or timing issue preventing cleanup execution
**Local vs CI:** Filesystem timing or goroutine scheduling
**Priority:** 🟡 MEDIUM - Backup reliability important
**Fix Complexity:** MEDIUM (1-2 hours)
**Remediation:**
- Read test to understand cleanup failure scenario
- Verify test assertion matches expected behavior
- Add debug logging to see what cleanup actually does
---
### 7. TestSecurityService_LogAudit_ChannelFullFallsBackToSyncWrite
**File:** `backend/internal/services/security_service_test.go`
**Error:** `Not equal: expected: "sync-fallback"` - Audit log type mismatch
**Root Cause Hypothesis:**
- Test fills audit channel to trigger sync fallback
- Fallback not triggered or audit records wrong log type
- Timing issue - channel drains before fallback needed
**Local vs CI:** Goroutine scheduling or channel buffer behavior
**Priority:** 🟡 MEDIUM - Audit reliability important
**Fix Complexity:** MEDIUM (2 hours)
**Remediation:**
- Verify channel size and fill logic in test
- Check if fallback logic correctly sets log type
- Add explicit synchronization to ensure channel full before write
---
### 8. TestCredentialService_GetCredentialForDomain_ExactMatch
**File:** `backend/internal/services/credential_service_test.go`
**Error:** `Received unexpected error:` - Method returns error instead of success
**Root Cause Hypothesis:**
- Credential lookup fails due to missing data or encryption issue
- Database state corrupted or incomplete in test
- Service initialization error (though encryption key IS present)
**Local vs CI:** Unknown - need to read test implementation
**Priority:** 🔴 HIGH - Core credential management feature
**Fix Complexity:** UNKNOWN (needs investigation)
**Remediation:**
- Read test file starting at line 265
- Check test setup creates proper credential with zone filter
- Verify encryption service initialized correctly in test
- Add debug logging to see actual error message
---
### 9. TestCredentialService_GetCredentialForDomain_WildcardMatch
**File:** `backend/internal/services/credential_service_test.go`
**Error:** `Received unexpected error:` - Method returns error instead of success
**Root Cause Hypothesis:**
- Similar to ExactMatch test - credential lookup fails
- Wildcard matching logic has bug or missing data
- Zone filter parsing error
**Local vs CI:** Unknown - need to read test implementation
**Priority:** 🔴 HIGH - Core credential management feature
**Fix Complexity:** UNKNOWN (needs investigation)
**Remediation:**
- Read test file starting at line 297
- Check wildcard zone filter setup (e.g., "*.example.com")
- Verify wildcard matching algorithm
- Add debug logging to see actual error message
---
### 10. TestDeleteCertificate_CreatesBackup ⚠️
**File:** `backend/internal/services/certificate_service_test.go`
**Error:** `no such table: proxy_hosts` (database query error)
**Note:** Similar tests with same error PASS (e.g., TestDeleteCertificate_UsageCheckError)
**Root Cause Hypothesis:**
- Test database missing proxy_hosts table migration
- Test expects error and handles it, but THIS specific test doesn't
- Test assertion checks backup creation AFTER checking proxy_hosts (fails early)
**Local vs CI:** Local database might have full schema
**Priority:** 🟢 LOW - May be expected behavior
**Fix Complexity:** EASY (30 minutes)
**Remediation:**
- Read test to see what it actually expects
- Either add proxy_hosts to test database migration
- Or adjust test to expect "table not found" error
---
## Remediation Options
### Option A: Fix All Now (Recommended for Blocking Quality Gate)
**Time Estimate:** 8-14 hours (1-2 days)
**Pros:**
- Comprehensive fix, no technical debt
- High confidence in test suite
- Unblocks CI completely
**Cons:**
- Delays coverage patch work
- Some fixes may be complex (concurrency tests)
**Implementation Plan:**
1. **Phase 1: High Priority** (4-6 hours)
- TestMain_DefaultStartupGracefulShutdown_Subprocess
- TestCredentialService ExactMatch & WildcardMatch
2. **Phase 2: Medium Priority** (2-4 hours)
- TestGetAcquisitionConfig
- TestEnsureBouncerRegistration_ConcurrentCalls
- TestRunScheduledBackup_CleanupFails
- TestSecurityService_LogAudit
3. **Phase 3: Low Priority** (2-4 hours)
- TestPluginHandler_ReloadPlugins_WithErrors
- TestFetchIndexFallbackHTTP
- TestDeleteCertificate_CreatesBackup
---
### Option B: Skip Non-Critical Tests (Fastest)
**Time Estimate:** 1-2 hours
**Pros:**
- Fastest path to green CI
- Focus on coverage patch work immediately
**Cons:**
- Technical debt accumulates
- May mask real bugs
- Need to track TODOs
**Implementation:**
- Add `t.Skip("CI environment test - tracked in issue #XXX")` to low/medium priority tests
- Keep HIGH priority tests (Main server startup, credential service)
- Create GitHub issues for each skipped test
- Fix during next sprint
---
### Option C: Parallel Work (Balanced)
**Time Estimate:** 4-6 hours first pass, then monitor
**Pros:**
- Unblock critical paths quickly
- Comprehensive fix in parallel
**Cons:**
- More context switching
- Risk of merge conflicts
**Implementation:**
1. Skip low-priority tests immediately (TestPlugin*, TestFetchIndex, TestDeleteCert)
2. Fix HIGH priority tests in parallel with coverage work
3. Tackle MEDIUM priority tests after coverage patch merged
---
## Decision Matrix
| Criteria | Option A | Option B | Option C |
|----------|----------|----------|----------|
| Time to Green CI | 1-2 days | 1-2 hours | 4-6 hours |
| Technical Debt | None | High | Medium |
| Risk of Masking Bugs | Low | High | Medium |
| Coverage Patch Delay | High | None | Low |
| Long-term Quality | Best | Worst | Good |
---
## Recommended Approach
**OPTION A - Fix All Now**
**Reasoning:**
1. Test failures indicate real issues in application logic or test environment
2. Skipping tests hides potential bugs that could affect production
3. The 9 failures represent core features (server startup, credentials, security auditing, backups)
4. Encryption key issue was a red herring - actual fixes should be straightforward
5. Better to have stable CI before moving to coverage patch work
**Next Steps:**
1. User approves Option A
2. Delegate to `Backend_Dev` agent: "Fix test failures following Phase 1 → Phase 2 → Phase 3 order"
3. For each test:
- Read test file
- Understand expected behavior
- Reproduce locally if possible (with fresh database)
- Fix root cause
- Verify fix locally
- Commit with descriptive message
4. Push all fixes as single logical commit
5. Monitor CI workflows for green status
6. Return to coverage patch work
---
## Confidence Assessment
**Root Cause Identified:** ✅ YES - Not encryption key, but individual test issues
**Fix Complexity:** 🟡 MEDIUM - Mix of easy and hard fixes
**Upstream Blockers:** ❌ NONE - All fixes are local test changes
**Risk of Regression:** 🟢 LOW - Tests are isolated, fixes won't affect production code
---
## Notes for Implementation
- All fixes should be in test files only (`*_test.go`)
- Production code should NOT need changes (except if real bugs found)
- Add comments explaining CI-specific behavior if needed
- Use `t.Logf()` for debug output during investigation
- Commit frequently with descriptive messages per fix group
- Run `go test -v -run TestName ./path/` to test individually
---
## Final Recommendation
**DO NOT** skip tests. These failures represent real issues that need fixing:
- Server graceful shutdown
- Credential domain matching (core feature)
- Security audit logging (compliance requirement)
- CrowdSec bouncer registration (security feature)
- Backup cleanup (data integrity)
Proceed with **Option A: Fix All Now**.