chore: git cache cleanup

2026-03-04 18:34:49 +00:00
parent c32cce2a88
commit 27c252600a
2001 changed files with 683185 additions and 0 deletions
--- a/docs/plans/archive/CI_TEST_FAILURES_DETAILED_REMEDIATION.md
+++ b/docs/plans/archive/CI_TEST_FAILURES_DETAILED_REMEDIATION.md
@@ -0,0 +1,336 @@
+# CI Test Failures Detailed Remediation Plan
+
+**Date:** 2026-02-16
+**Workflow Run:** 22079827893 (codecov-upload.yml)
+**Branch:** feature/beta-release
+**Status:** 🔴 BLOCKING - 9+ tests failing
+
+## Executive Summary
+
+**CRITICAL DISCOVERY:** The test failures are **NOT** related to `CHARON_ENCRYPTION_KEY` environment variable. The encryption key is properly set and working in CI. The failures are due to various test-specific issues including HTTP status codes, timing, concurrency, and database state.
+
+**Evidence:**
+- CI logs show NO warnings about "CHARON_ENCRYPTION_KEY is required"
+- CI logs show NO errors about "invalid key length"
+- Services initialize successfully with encryption
+- Coverage at 85.1% (meets requirement)
+
+**Actual Root Cause:** Individual test logic, timing, or environmental differences between local and CI execution.
+
+---
+
+##Failed Tests with Actual Errors
+
+### 1. TestMain_DefaultStartupGracefulShutdown_Subprocess
+**File:** `backend/cmd/api/main_test.go`
+**Error:** Process terminated with `signal: terminated` after 0.57s
+**Observation:** Subprocess starts successfully (logs show server initialization) but then receives termination signal
+**Root Cause Hypothesis:**
+- Subprocess doesn't terminate gracefully within expected time
+- Missing or delayed signal handling in test
+- Race condition between parent sending signal and subprocess responding
+
+**Local vs CI:** May pass locally due to faster execution or different signal handling
+**Priority:** 🔴 HIGH - Main server startup flow must work
+**Fix Complexity:** MEDIUM (2-3 hours)
+**Remediation:**
+- Read test to understand subprocess lifecycle
+- Check timeout values and signal handling
+- Verify graceful shutdown logic waits for server to bind port before terminating
+
+---
+
+### 2. TestGetAcquisitionConfig
+**File:** `backend/internal/handlers/crowdsec_handler_test.go` (assumed)
+**Error:** `Should not be: 404` - Getting unexpected 404 HTTP status
+**Root Cause Hypothesis:**
+- CrowdSec config endpoint returns 404 when SecurityConfig table missing
+- Test expects config to exist but CI database doesn't have it migrated
+- Local database might have lingering state from previous runs
+
+**Local vs CI:** Local database persistence vs fresh CI database
+**Priority:** 🟡 MEDIUM - CrowdSec feature must work
+**Fix Complexity:** EASY (30 minutes)
+**Remediation:**
+- Ensure test migrates SecurityConfig table before testing
+- Or adjust expected behavior when config doesn't exist
+
+---
+
+### 3. TestEnsureBouncerRegistration_ConcurrentCalls
+**File:** `backend/internal/services/crowdsec_lapi_service_test.go` (assumed)
+**Error:** `Not equal: expected: 1` - Count assertion failing
+**Root Cause Hypothesis:**
+- Race condition in concurrent bouncer registration
+- Test expects exactly 1 bouncer but gets 0 or >1 due to timing
+- CI environment slower causing timeout or race window
+
+**Local vs CI:** Different CPU cores or timing characteristics
+**Priority:** 🟡 MEDIUM - Concurrency safety important
+**Fix Complexity:** HARD (3-4 hours)
+**Remediation:**
+- Add explicit synchronization or retries in test
+- Increase timeout for concurrent operations
+- Use eventually assertions instead of immediate checks
+
+---
+
+### 4. TestPluginHandler_ReloadPlugins_WithErrors
+**File:** `backend/internal/api/handlers/plugin_handler_test.go`
+**Error:** `Not equal: expected: 200` - HTTP status code not 200
+**Root Cause Hypothesis:**
+- Plugin reload returns error status (likely 500 or 400) instead of 200
+- Test expects reload to succeed even with errors (bad plugin files)
+- Endpoint behavior might differ when plugin directory doesn't exist
+
+**Local vs CI:** Local might have plugin directory setup, CI starts fresh
+**Priority:** 🟢 LOW - Plugin system edge case
+**Fix Complexity:** EASY (1 hour)
+**Remediation:**
+- Read test to understand expected behavior with errors
+- Adjust expectation or ensure test setup creates proper plugin state
+
+---
+
+### 5. TestFetchIndexFallbackHTTP
+**File:** `backend/internal/services/crowdsec_preset_service_test.go` (assumed)
+**Error:** `Received unexpected error:` - Some error occurred during HTTP fallback
+**Root Cause Hypothesis:**
+- HTTP fallback mechanism fails when primary fetch method unavailable
+- Network request in test might be blocked in CI
+- Missing mock or test fixture for HTTP response
+
+**Local vs CI:** CI network restrictions or missing test server
+**Priority:** 🟢 LOW - Fallback mechanism edge case
+**Fix Complexity:** MEDIUM (1-2 hours)
+**Remediation:**
+- Ensure test uses mock HTTP server, not real network
+- Check if test fixture files exist in CI
+- Verify fallback logic handles all error cases
+
+---
+
+### 6. TestRunScheduledBackup_CleanupFails
+**File:** `backend/internal/services/backup_service_test.go`
+**Error:** `"0" is not greater than or equal to "1"` - Cleanup count assertion
+**Root Cause Hypothesis:**
+- Test simulates cleanup failure but checks for at least 1 deletion
+- Cleanup function doesn't attempt deletion when it should
+- Race condition or timing issue preventing cleanup execution
+
+**Local vs CI:** Filesystem timing or goroutine scheduling
+**Priority:** 🟡 MEDIUM - Backup reliability important
+**Fix Complexity:** MEDIUM (1-2 hours)
+**Remediation:**
+- Read test to understand cleanup failure scenario
+- Verify test assertion matches expected behavior
+- Add debug logging to see what cleanup actually does
+
+---
+
+### 7. TestSecurityService_LogAudit_ChannelFullFallsBackToSyncWrite
+**File:** `backend/internal/services/security_service_test.go`
+**Error:** `Not equal: expected: "sync-fallback"` - Audit log type mismatch
+**Root Cause Hypothesis:**
+- Test fills audit channel to trigger sync fallback
+- Fallback not triggered or audit records wrong log type
+- Timing issue - channel drains before fallback needed
+
+**Local vs CI:** Goroutine scheduling or channel buffer behavior
+**Priority:** 🟡 MEDIUM - Audit reliability important
+**Fix Complexity:** MEDIUM (2 hours)
+**Remediation:**
+- Verify channel size and fill logic in test
+- Check if fallback logic correctly sets log type
+- Add explicit synchronization to ensure channel full before write
+
+---
+
+### 8. TestCredentialService_GetCredentialForDomain_ExactMatch
+**File:** `backend/internal/services/credential_service_test.go`
+**Error:** `Received unexpected error:` - Method returns error instead of success
+**Root Cause Hypothesis:**
+- Credential lookup fails due to missing data or encryption issue
+- Database state corrupted or incomplete in test
+- Service initialization error (though encryption key IS present)
+
+**Local vs CI:** Unknown - need to read test implementation
+**Priority:** 🔴 HIGH - Core credential management feature
+**Fix Complexity:** UNKNOWN (needs investigation)
+**Remediation:**
+- Read test file starting at line 265
+- Check test setup creates proper credential with zone filter
+- Verify encryption service initialized correctly in test
+- Add debug logging to see actual error message
+
+---
+
+### 9. TestCredentialService_GetCredentialForDomain_WildcardMatch
+**File:** `backend/internal/services/credential_service_test.go`
+**Error:** `Received unexpected error:` - Method returns error instead of success
+**Root Cause Hypothesis:**
+- Similar to ExactMatch test - credential lookup fails
+- Wildcard matching logic has bug or missing data
+- Zone filter parsing error
+
+**Local vs CI:** Unknown - need to read test implementation
+**Priority:** 🔴 HIGH - Core credential management feature
+**Fix Complexity:** UNKNOWN (needs investigation)
+**Remediation:**
+- Read test file starting at line 297
+- Check wildcard zone filter setup (e.g., "*.example.com")
+- Verify wildcard matching algorithm
+- Add debug logging to see actual error message
+
+---
+
+### 10. TestDeleteCertificate_CreatesBackup ⚠️
+**File:** `backend/internal/services/certificate_service_test.go`
+**Error:** `no such table: proxy_hosts` (database query error)
+**Note:** Similar tests with same error PASS (e.g., TestDeleteCertificate_UsageCheckError)
+**Root Cause Hypothesis:**
+- Test database missing proxy_hosts table migration
+- Test expects error and handles it, but THIS specific test doesn't
+- Test assertion checks backup creation AFTER checking proxy_hosts (fails early)
+
+**Local vs CI:** Local database might have full schema
+**Priority:** 🟢 LOW - May be expected behavior
+**Fix Complexity:** EASY (30 minutes)
+**Remediation:**
+- Read test to see what it actually expects
+- Either add proxy_hosts to test database migration
+- Or adjust test to expect "table not found" error
+
+---
+
+## Remediation Options
+
+### Option A: Fix All Now (Recommended for Blocking Quality Gate)
+**Time Estimate:** 8-14 hours (1-2 days)
+**Pros:**
+- Comprehensive fix, no technical debt
+- High confidence in test suite
+- Unblocks CI completely
+**Cons:**
+- Delays coverage patch work
+- Some fixes may be complex (concurrency tests)
+
+**Implementation Plan:**
+1. **Phase 1: High Priority** (4-6 hours)
+   - TestMain_DefaultStartupGracefulShutdown_Subprocess
+   - TestCredentialService ExactMatch & WildcardMatch
+2. **Phase 2: Medium Priority** (2-4 hours)
+   - TestGetAcquisitionConfig
+   - TestEnsureBouncerRegistration_ConcurrentCalls
+   - TestRunScheduledBackup_CleanupFails
+   - TestSecurityService_LogAudit
+3. **Phase 3: Low Priority** (2-4 hours)
+   - TestPluginHandler_ReloadPlugins_WithErrors
+   - TestFetchIndexFallbackHTTP
+   - TestDeleteCertificate_CreatesBackup
+
+---
+
+### Option B: Skip Non-Critical Tests (Fastest)
+**Time Estimate:** 1-2 hours
+**Pros:**
+- Fastest path to green CI
+- Focus on coverage patch work immediately
+**Cons:**
+- Technical debt accumulates
+- May mask real bugs
+- Need to track TODOs
+
+**Implementation:**
+- Add `t.Skip("CI environment test - tracked in issue #XXX")` to low/medium priority tests
+- Keep HIGH priority tests (Main server startup, credential service)
+- Create GitHub issues for each skipped test
+- Fix during next sprint
+
+---
+
+### Option C: Parallel Work (Balanced)
+**Time Estimate:** 4-6 hours first pass, then monitor
+**Pros:**
+- Unblock critical paths quickly
+- Comprehensive fix in parallel
+**Cons:**
+- More context switching
+- Risk of merge conflicts
+
+**Implementation:**
+1. Skip low-priority tests immediately (TestPlugin*, TestFetchIndex, TestDeleteCert)
+2. Fix HIGH priority tests in parallel with coverage work
+3. Tackle MEDIUM priority tests after coverage patch merged
+
+---
+
+## Decision Matrix
+
+| Criteria | Option A | Option B | Option C |
+|----------|----------|----------|----------|
+| Time to Green CI | 1-2 days | 1-2 hours | 4-6 hours |
+| Technical Debt | None | High | Medium |
+| Risk of Masking Bugs | Low | High | Medium |
+| Coverage Patch Delay | High | None | Low |
+| Long-term Quality | Best | Worst | Good |
+
+---
+
+## Recommended Approach
+
+**OPTION A - Fix All Now**
+
+**Reasoning:**
+1. Test failures indicate real issues in application logic or test environment
+2. Skipping tests hides potential bugs that could affect production
+3. The 9 failures represent core features (server startup, credentials, security auditing, backups)
+4. Encryption key issue was a red herring - actual fixes should be straightforward
+5. Better to have stable CI before moving to coverage patch work
+
+**Next Steps:**
+1. User approves Option A
+2. Delegate to `Backend_Dev` agent: "Fix test failures following Phase 1 → Phase 2 → Phase 3 order"
+3. For each test:
+   - Read test file
+   - Understand expected behavior
+   - Reproduce locally if possible (with fresh database)
+   - Fix root cause
+   - Verify fix locally
+   - Commit with descriptive message
+4. Push all fixes as single logical commit
+5. Monitor CI workflows for green status
+6. Return to coverage patch work
+
+---
+
+## Confidence Assessment
+
+**Root Cause Identified:** ✅ YES - Not encryption key, but individual test issues
+**Fix Complexity:** 🟡 MEDIUM - Mix of easy and hard fixes
+**Upstream Blockers:** ❌ NONE - All fixes are local test changes
+**Risk of Regression:** 🟢 LOW - Tests are isolated, fixes won't affect production code
+
+---
+
+## Notes for Implementation
+
+- All fixes should be in test files only (`*_test.go`)
+- Production code should NOT need changes (except if real bugs found)
+- Add comments explaining CI-specific behavior if needed
+- Use `t.Logf()` for debug output during investigation
+- Commit frequently with descriptive messages per fix group
+- Run `go test -v -run TestName ./path/` to test individually
+
+---
+
+## Final Recommendation
+
+**DO NOT** skip tests. These failures represent real issues that need fixing:
+- Server graceful shutdown
+- Credential domain matching (core feature)
+- Security audit logging (compliance requirement)
+- CrowdSec bouncer registration (security feature)
+- Backup cleanup (data integrity)
+
+Proceed with **Option A: Fix All Now**.