chore: git cache cleanup
This commit is contained in:
336
docs/plans/archive/CI_TEST_FAILURES_DETAILED_REMEDIATION.md
Normal file
336
docs/plans/archive/CI_TEST_FAILURES_DETAILED_REMEDIATION.md
Normal file
@@ -0,0 +1,336 @@
|
||||
# CI Test Failures Detailed Remediation Plan
|
||||
|
||||
**Date:** 2026-02-16
|
||||
**Workflow Run:** 22079827893 (codecov-upload.yml)
|
||||
**Branch:** feature/beta-release
|
||||
**Status:** 🔴 BLOCKING - 9+ tests failing
|
||||
|
||||
## Executive Summary
|
||||
|
||||
**CRITICAL DISCOVERY:** The test failures are **NOT** related to `CHARON_ENCRYPTION_KEY` environment variable. The encryption key is properly set and working in CI. The failures are due to various test-specific issues including HTTP status codes, timing, concurrency, and database state.
|
||||
|
||||
**Evidence:**
|
||||
- CI logs show NO warnings about "CHARON_ENCRYPTION_KEY is required"
|
||||
- CI logs show NO errors about "invalid key length"
|
||||
- Services initialize successfully with encryption
|
||||
- Coverage at 85.1% (meets requirement)
|
||||
|
||||
**Actual Root Cause:** Individual test logic, timing, or environmental differences between local and CI execution.
|
||||
|
||||
---
|
||||
|
||||
##Failed Tests with Actual Errors
|
||||
|
||||
### 1. TestMain_DefaultStartupGracefulShutdown_Subprocess
|
||||
**File:** `backend/cmd/api/main_test.go`
|
||||
**Error:** Process terminated with `signal: terminated` after 0.57s
|
||||
**Observation:** Subprocess starts successfully (logs show server initialization) but then receives termination signal
|
||||
**Root Cause Hypothesis:**
|
||||
- Subprocess doesn't terminate gracefully within expected time
|
||||
- Missing or delayed signal handling in test
|
||||
- Race condition between parent sending signal and subprocess responding
|
||||
|
||||
**Local vs CI:** May pass locally due to faster execution or different signal handling
|
||||
**Priority:** 🔴 HIGH - Main server startup flow must work
|
||||
**Fix Complexity:** MEDIUM (2-3 hours)
|
||||
**Remediation:**
|
||||
- Read test to understand subprocess lifecycle
|
||||
- Check timeout values and signal handling
|
||||
- Verify graceful shutdown logic waits for server to bind port before terminating
|
||||
|
||||
---
|
||||
|
||||
### 2. TestGetAcquisitionConfig
|
||||
**File:** `backend/internal/handlers/crowdsec_handler_test.go` (assumed)
|
||||
**Error:** `Should not be: 404` - Getting unexpected 404 HTTP status
|
||||
**Root Cause Hypothesis:**
|
||||
- CrowdSec config endpoint returns 404 when SecurityConfig table missing
|
||||
- Test expects config to exist but CI database doesn't have it migrated
|
||||
- Local database might have lingering state from previous runs
|
||||
|
||||
**Local vs CI:** Local database persistence vs fresh CI database
|
||||
**Priority:** 🟡 MEDIUM - CrowdSec feature must work
|
||||
**Fix Complexity:** EASY (30 minutes)
|
||||
**Remediation:**
|
||||
- Ensure test migrates SecurityConfig table before testing
|
||||
- Or adjust expected behavior when config doesn't exist
|
||||
|
||||
---
|
||||
|
||||
### 3. TestEnsureBouncerRegistration_ConcurrentCalls
|
||||
**File:** `backend/internal/services/crowdsec_lapi_service_test.go` (assumed)
|
||||
**Error:** `Not equal: expected: 1` - Count assertion failing
|
||||
**Root Cause Hypothesis:**
|
||||
- Race condition in concurrent bouncer registration
|
||||
- Test expects exactly 1 bouncer but gets 0 or >1 due to timing
|
||||
- CI environment slower causing timeout or race window
|
||||
|
||||
**Local vs CI:** Different CPU cores or timing characteristics
|
||||
**Priority:** 🟡 MEDIUM - Concurrency safety important
|
||||
**Fix Complexity:** HARD (3-4 hours)
|
||||
**Remediation:**
|
||||
- Add explicit synchronization or retries in test
|
||||
- Increase timeout for concurrent operations
|
||||
- Use eventually assertions instead of immediate checks
|
||||
|
||||
---
|
||||
|
||||
### 4. TestPluginHandler_ReloadPlugins_WithErrors
|
||||
**File:** `backend/internal/api/handlers/plugin_handler_test.go`
|
||||
**Error:** `Not equal: expected: 200` - HTTP status code not 200
|
||||
**Root Cause Hypothesis:**
|
||||
- Plugin reload returns error status (likely 500 or 400) instead of 200
|
||||
- Test expects reload to succeed even with errors (bad plugin files)
|
||||
- Endpoint behavior might differ when plugin directory doesn't exist
|
||||
|
||||
**Local vs CI:** Local might have plugin directory setup, CI starts fresh
|
||||
**Priority:** 🟢 LOW - Plugin system edge case
|
||||
**Fix Complexity:** EASY (1 hour)
|
||||
**Remediation:**
|
||||
- Read test to understand expected behavior with errors
|
||||
- Adjust expectation or ensure test setup creates proper plugin state
|
||||
|
||||
---
|
||||
|
||||
### 5. TestFetchIndexFallbackHTTP
|
||||
**File:** `backend/internal/services/crowdsec_preset_service_test.go` (assumed)
|
||||
**Error:** `Received unexpected error:` - Some error occurred during HTTP fallback
|
||||
**Root Cause Hypothesis:**
|
||||
- HTTP fallback mechanism fails when primary fetch method unavailable
|
||||
- Network request in test might be blocked in CI
|
||||
- Missing mock or test fixture for HTTP response
|
||||
|
||||
**Local vs CI:** CI network restrictions or missing test server
|
||||
**Priority:** 🟢 LOW - Fallback mechanism edge case
|
||||
**Fix Complexity:** MEDIUM (1-2 hours)
|
||||
**Remediation:**
|
||||
- Ensure test uses mock HTTP server, not real network
|
||||
- Check if test fixture files exist in CI
|
||||
- Verify fallback logic handles all error cases
|
||||
|
||||
---
|
||||
|
||||
### 6. TestRunScheduledBackup_CleanupFails
|
||||
**File:** `backend/internal/services/backup_service_test.go`
|
||||
**Error:** `"0" is not greater than or equal to "1"` - Cleanup count assertion
|
||||
**Root Cause Hypothesis:**
|
||||
- Test simulates cleanup failure but checks for at least 1 deletion
|
||||
- Cleanup function doesn't attempt deletion when it should
|
||||
- Race condition or timing issue preventing cleanup execution
|
||||
|
||||
**Local vs CI:** Filesystem timing or goroutine scheduling
|
||||
**Priority:** 🟡 MEDIUM - Backup reliability important
|
||||
**Fix Complexity:** MEDIUM (1-2 hours)
|
||||
**Remediation:**
|
||||
- Read test to understand cleanup failure scenario
|
||||
- Verify test assertion matches expected behavior
|
||||
- Add debug logging to see what cleanup actually does
|
||||
|
||||
---
|
||||
|
||||
### 7. TestSecurityService_LogAudit_ChannelFullFallsBackToSyncWrite
|
||||
**File:** `backend/internal/services/security_service_test.go`
|
||||
**Error:** `Not equal: expected: "sync-fallback"` - Audit log type mismatch
|
||||
**Root Cause Hypothesis:**
|
||||
- Test fills audit channel to trigger sync fallback
|
||||
- Fallback not triggered or audit records wrong log type
|
||||
- Timing issue - channel drains before fallback needed
|
||||
|
||||
**Local vs CI:** Goroutine scheduling or channel buffer behavior
|
||||
**Priority:** 🟡 MEDIUM - Audit reliability important
|
||||
**Fix Complexity:** MEDIUM (2 hours)
|
||||
**Remediation:**
|
||||
- Verify channel size and fill logic in test
|
||||
- Check if fallback logic correctly sets log type
|
||||
- Add explicit synchronization to ensure channel full before write
|
||||
|
||||
---
|
||||
|
||||
### 8. TestCredentialService_GetCredentialForDomain_ExactMatch
|
||||
**File:** `backend/internal/services/credential_service_test.go`
|
||||
**Error:** `Received unexpected error:` - Method returns error instead of success
|
||||
**Root Cause Hypothesis:**
|
||||
- Credential lookup fails due to missing data or encryption issue
|
||||
- Database state corrupted or incomplete in test
|
||||
- Service initialization error (though encryption key IS present)
|
||||
|
||||
**Local vs CI:** Unknown - need to read test implementation
|
||||
**Priority:** 🔴 HIGH - Core credential management feature
|
||||
**Fix Complexity:** UNKNOWN (needs investigation)
|
||||
**Remediation:**
|
||||
- Read test file starting at line 265
|
||||
- Check test setup creates proper credential with zone filter
|
||||
- Verify encryption service initialized correctly in test
|
||||
- Add debug logging to see actual error message
|
||||
|
||||
---
|
||||
|
||||
### 9. TestCredentialService_GetCredentialForDomain_WildcardMatch
|
||||
**File:** `backend/internal/services/credential_service_test.go`
|
||||
**Error:** `Received unexpected error:` - Method returns error instead of success
|
||||
**Root Cause Hypothesis:**
|
||||
- Similar to ExactMatch test - credential lookup fails
|
||||
- Wildcard matching logic has bug or missing data
|
||||
- Zone filter parsing error
|
||||
|
||||
**Local vs CI:** Unknown - need to read test implementation
|
||||
**Priority:** 🔴 HIGH - Core credential management feature
|
||||
**Fix Complexity:** UNKNOWN (needs investigation)
|
||||
**Remediation:**
|
||||
- Read test file starting at line 297
|
||||
- Check wildcard zone filter setup (e.g., "*.example.com")
|
||||
- Verify wildcard matching algorithm
|
||||
- Add debug logging to see actual error message
|
||||
|
||||
---
|
||||
|
||||
### 10. TestDeleteCertificate_CreatesBackup ⚠️
|
||||
**File:** `backend/internal/services/certificate_service_test.go`
|
||||
**Error:** `no such table: proxy_hosts` (database query error)
|
||||
**Note:** Similar tests with same error PASS (e.g., TestDeleteCertificate_UsageCheckError)
|
||||
**Root Cause Hypothesis:**
|
||||
- Test database missing proxy_hosts table migration
|
||||
- Test expects error and handles it, but THIS specific test doesn't
|
||||
- Test assertion checks backup creation AFTER checking proxy_hosts (fails early)
|
||||
|
||||
**Local vs CI:** Local database might have full schema
|
||||
**Priority:** 🟢 LOW - May be expected behavior
|
||||
**Fix Complexity:** EASY (30 minutes)
|
||||
**Remediation:**
|
||||
- Read test to see what it actually expects
|
||||
- Either add proxy_hosts to test database migration
|
||||
- Or adjust test to expect "table not found" error
|
||||
|
||||
---
|
||||
|
||||
## Remediation Options
|
||||
|
||||
### Option A: Fix All Now (Recommended for Blocking Quality Gate)
|
||||
**Time Estimate:** 8-14 hours (1-2 days)
|
||||
**Pros:**
|
||||
- Comprehensive fix, no technical debt
|
||||
- High confidence in test suite
|
||||
- Unblocks CI completely
|
||||
**Cons:**
|
||||
- Delays coverage patch work
|
||||
- Some fixes may be complex (concurrency tests)
|
||||
|
||||
**Implementation Plan:**
|
||||
1. **Phase 1: High Priority** (4-6 hours)
|
||||
- TestMain_DefaultStartupGracefulShutdown_Subprocess
|
||||
- TestCredentialService ExactMatch & WildcardMatch
|
||||
2. **Phase 2: Medium Priority** (2-4 hours)
|
||||
- TestGetAcquisitionConfig
|
||||
- TestEnsureBouncerRegistration_ConcurrentCalls
|
||||
- TestRunScheduledBackup_CleanupFails
|
||||
- TestSecurityService_LogAudit
|
||||
3. **Phase 3: Low Priority** (2-4 hours)
|
||||
- TestPluginHandler_ReloadPlugins_WithErrors
|
||||
- TestFetchIndexFallbackHTTP
|
||||
- TestDeleteCertificate_CreatesBackup
|
||||
|
||||
---
|
||||
|
||||
### Option B: Skip Non-Critical Tests (Fastest)
|
||||
**Time Estimate:** 1-2 hours
|
||||
**Pros:**
|
||||
- Fastest path to green CI
|
||||
- Focus on coverage patch work immediately
|
||||
**Cons:**
|
||||
- Technical debt accumulates
|
||||
- May mask real bugs
|
||||
- Need to track TODOs
|
||||
|
||||
**Implementation:**
|
||||
- Add `t.Skip("CI environment test - tracked in issue #XXX")` to low/medium priority tests
|
||||
- Keep HIGH priority tests (Main server startup, credential service)
|
||||
- Create GitHub issues for each skipped test
|
||||
- Fix during next sprint
|
||||
|
||||
---
|
||||
|
||||
### Option C: Parallel Work (Balanced)
|
||||
**Time Estimate:** 4-6 hours first pass, then monitor
|
||||
**Pros:**
|
||||
- Unblock critical paths quickly
|
||||
- Comprehensive fix in parallel
|
||||
**Cons:**
|
||||
- More context switching
|
||||
- Risk of merge conflicts
|
||||
|
||||
**Implementation:**
|
||||
1. Skip low-priority tests immediately (TestPlugin*, TestFetchIndex, TestDeleteCert)
|
||||
2. Fix HIGH priority tests in parallel with coverage work
|
||||
3. Tackle MEDIUM priority tests after coverage patch merged
|
||||
|
||||
---
|
||||
|
||||
## Decision Matrix
|
||||
|
||||
| Criteria | Option A | Option B | Option C |
|
||||
|----------|----------|----------|----------|
|
||||
| Time to Green CI | 1-2 days | 1-2 hours | 4-6 hours |
|
||||
| Technical Debt | None | High | Medium |
|
||||
| Risk of Masking Bugs | Low | High | Medium |
|
||||
| Coverage Patch Delay | High | None | Low |
|
||||
| Long-term Quality | Best | Worst | Good |
|
||||
|
||||
---
|
||||
|
||||
## Recommended Approach
|
||||
|
||||
**OPTION A - Fix All Now**
|
||||
|
||||
**Reasoning:**
|
||||
1. Test failures indicate real issues in application logic or test environment
|
||||
2. Skipping tests hides potential bugs that could affect production
|
||||
3. The 9 failures represent core features (server startup, credentials, security auditing, backups)
|
||||
4. Encryption key issue was a red herring - actual fixes should be straightforward
|
||||
5. Better to have stable CI before moving to coverage patch work
|
||||
|
||||
**Next Steps:**
|
||||
1. User approves Option A
|
||||
2. Delegate to `Backend_Dev` agent: "Fix test failures following Phase 1 → Phase 2 → Phase 3 order"
|
||||
3. For each test:
|
||||
- Read test file
|
||||
- Understand expected behavior
|
||||
- Reproduce locally if possible (with fresh database)
|
||||
- Fix root cause
|
||||
- Verify fix locally
|
||||
- Commit with descriptive message
|
||||
4. Push all fixes as single logical commit
|
||||
5. Monitor CI workflows for green status
|
||||
6. Return to coverage patch work
|
||||
|
||||
---
|
||||
|
||||
## Confidence Assessment
|
||||
|
||||
**Root Cause Identified:** ✅ YES - Not encryption key, but individual test issues
|
||||
**Fix Complexity:** 🟡 MEDIUM - Mix of easy and hard fixes
|
||||
**Upstream Blockers:** ❌ NONE - All fixes are local test changes
|
||||
**Risk of Regression:** 🟢 LOW - Tests are isolated, fixes won't affect production code
|
||||
|
||||
---
|
||||
|
||||
## Notes for Implementation
|
||||
|
||||
- All fixes should be in test files only (`*_test.go`)
|
||||
- Production code should NOT need changes (except if real bugs found)
|
||||
- Add comments explaining CI-specific behavior if needed
|
||||
- Use `t.Logf()` for debug output during investigation
|
||||
- Commit frequently with descriptive messages per fix group
|
||||
- Run `go test -v -run TestName ./path/` to test individually
|
||||
|
||||
---
|
||||
|
||||
## Final Recommendation
|
||||
|
||||
**DO NOT** skip tests. These failures represent real issues that need fixing:
|
||||
- Server graceful shutdown
|
||||
- Credential domain matching (core feature)
|
||||
- Security audit logging (compliance requirement)
|
||||
- CrowdSec bouncer registration (security feature)
|
||||
- Backup cleanup (data integrity)
|
||||
|
||||
Proceed with **Option A: Fix All Now**.
|
||||
Reference in New Issue
Block a user