Files
Charon/docs/plans/ci_codecov_backend_failure_remediation.md
GitHub Actions 4d00af75b6 fix(tests): enhance database setup for unit tests and address CI failures
- Implemented a function to create a valid SQLite database for testing in db_health_handler_test.go.
- Replaced dummy database file creation with a proper SQLite setup to ensure tests run against a valid database.
- Set CHARON_ENCRYPTION_KEY environment variable in dns_provider_service_test.go to prevent RotationService initialization warnings.
- Added detailed remediation plan for CI Codecov backend test failures, addressing encryption key requirements and database record not found errors.
2026-02-16 21:20:34 +00:00

1185 lines
38 KiB
Markdown

# CI Codecov Backend Test Failures - Remediation Plan
**Date:** 2026-02-16
**Status:** Investigation Complete - Ready for Implementation
**Priority:** CRITICAL CI BLOCKER
**Workflow:** `.github/workflows/codecov-upload.yml``backend-codecov` job
---
## Executive Summary
**CRITICAL: Multiple CI workflows are failing** with the same root cause. Investigation reveals these failures affect 3 workflows, not just codecov-upload.
### Affected Workflows
| Workflow File | Purpose | Job Name(s) | Test Command | Status | Priority |
|---------------|---------|-------------|--------------|--------|----------|
| `codecov-upload.yml` | Coverage upload to Codecov | `backend-codecov` | `go-test-coverage.sh` | ❌ Failing | **CRITICAL** |
| `quality-checks.yml` | PR quality gates | `backend-quality` | `go-test-coverage.sh` + `go test -run TestPerf` | ❌ Failing | **CRITICAL** |
| `benchmark.yml` | Performance regression checks | `benchmark` | `go test -bench` + `go test -run TestPerf` | ⚠️ At Risk | **HIGH** |
**Other Workflows Analyzed (NOT affected):**
-`e2e-tests-split.yml` - Already has `CHARON_ENCRYPTION_KEY` configured (6+ locations)
-`cerberus-integration.yml` - Runs integration scripts, not Go unit tests
-`crowdsec-integration.yml` - Runs integration scripts, not Go unit tests
- ✅ All other workflows - Do not run backend Go tests
### Root Cause Issues
1. **RotationService Initialization Warnings** (Non-blocking but pollutes logs)
- Multiple services print: "Warning: RotationService initialization failed, using basic encryption: CHARON_ENCRYPTION_KEY is required"
- Root cause: Missing `CHARON_ENCRYPTION_KEY` environment variable in ALL 3 affected workflows
- Impact: Services fall back to basic encryption (no test failures, but warnings appear)
2. **GORM "record not found" Errors** (Blocking failures)
- Source: `backend/internal/services/proxyhost_service.go:194`
- Root cause: Tests calling `GetByID()` without proper test data setup
- Impact: Tests expecting proxy host records fail with `gorm.ErrRecordNotFound`
---
## Investigation Findings
### 1. Encryption Key Requirements
#### File Analysis: `.github/workflows/codecov-upload.yml`
**Path:** `/projects/Charon/.github/workflows/codecov-upload.yml`
**Lines:** 43-53 (backend-codecov job)
**Current Environment Variables:**
```yaml
env:
CGO_ENABLED: 1
```
**Missing Variables:**
- `CHARON_ENCRYPTION_KEY` (required for RotationService)
#### File Analysis: `backend/internal/crypto/rotation_service.go`
**Path:** `/projects/Charon/backend/internal/crypto/rotation_service.go`
**Lines:** 63-75
**Error Trigger:**
```go
func NewRotationService(db *gorm.DB) (*RotationService, error) {
// Load current key (required)
currentKeyB64 := os.Getenv("CHARON_ENCRYPTION_KEY")
if currentKeyB64 == "" {
return nil, fmt.Errorf("CHARON_ENCRYPTION_KEY is required")
}
// ...
}
```
#### File Analysis: Service Dependencies
**Affected Services:**
- `backend/internal/services/dns_provider_service.go:145` - Calls `crypto.NewRotationService(db)`
- `backend/internal/services/credential_service.go:72` - Calls `crypto.NewRotationService(db)`
**Fallback Behavior:**
```go
rotationService, err := crypto.NewRotationService(db)
if err != nil {
// Fallback to non-rotation mode
fmt.Printf("Warning: RotationService initialization failed, using basic encryption: %v\n", err)
}
```
**Test Setup Comparison:**
| Test File | Sets CHARON_ENCRYPTION_KEY? | Uses RotationService? |
|-----------|----------------------------|-----------------------|
| `rotation_service_test.go` | ✅ Yes (via `setupTestKeys()`) | ✅ Yes |
| `dns_provider_service_test.go` | ❌ No (hardcoded test key) | ⚠️ Tries but falls back |
| `credential_service_test.go` | ❌ No (hardcoded test key) | ⚠️ Tries but falls back |
#### Example: How Tests Set Encryption Keys
**File:** `backend/internal/crypto/rotation_service_test.go:28-41`
```go
func setupTestKeys(t *testing.T) (currentKey, nextKey, legacyKey string) {
currentKey, err := GenerateNewKey()
require.NoError(t, err)
_ = os.Setenv("CHARON_ENCRYPTION_KEY", currentKey)
t.Cleanup(func() { _ = os.Unsetenv("CHARON_ENCRYPTION_KEY") })
return currentKey, nextKey, legacyKey
}
```
**File:** `backend/internal/services/dns_provider_service_test.go:62`
```go
// Does NOT set CHARON_ENCRYPTION_KEY
encryptor, err := crypto.NewEncryptionService("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=")
```
### 2. ProxyHost "Record Not Found" Errors
#### File Analysis: `backend/internal/services/proxyhost_service.go`
**Path:** `/projects/Charon/backend/internal/services/proxyhost_service.go`
**Lines:** 192-197
**Error Source:**
```go
func (s *ProxyHostService) GetByID(id uint) (*models.ProxyHost, error) {
var host models.ProxyHost
if err := s.db.Where("id = ?", id).First(&host).Error; err != nil {
return nil, err // Returns gorm.ErrRecordNotFound if no record
}
return &host, nil
}
```
**GORM Error Type:** `gorm.ErrRecordNotFound` (not explicitly handled in ProxyHostService)
#### Test Pattern Analysis
**File:** `backend/internal/services/proxyhost_service_test.go:73-102`
**Working Test Pattern:**
```go
func TestProxyHostService_CRUD(t *testing.T) {
db := setupProxyHostTestDB(t)
service := NewProxyHostService(db)
// Create test data BEFORE calling GetByID
host := &models.ProxyHost{
UUID: "uuid-1",
DomainNames: "test.example.com",
ForwardHost: "127.0.0.1",
ForwardPort: 8080,
}
err := service.Create(host) // Creates record in DB
assert.NoError(t, err)
assert.NotZero(t, host.ID)
// Now GetByID works because record exists
fetched, err := service.GetByID(host.ID)
assert.NoError(t, err)
assert.Equal(t, host.DomainNames, fetched.DomainNames)
}
```
**File:** `backend/internal/api/handlers/proxy_host_handler_update_test.go:50-60`
**Helper Function Pattern:**
```go
func createTestProxyHost(t *testing.T, db *gorm.DB, name string) models.ProxyHost {
host := models.ProxyHost{
UUID: uuid.NewString(),
Name: name,
DomainNames: name + ".test.com",
ForwardScheme: "http",
ForwardHost: "localhost",
ForwardPort: 8080,
Enabled: true,
}
require.NoError(t, db.Create(&host).Error)
return host
}
```
#### Likely Failure Scenario
**Hypothesis:** Some tests are calling `GetByID()` with a hardcoded ID (e.g., `GetByID(1)`) expecting a record to exist, but:
- SQLite in-memory DB is empty at test start
- Test doesn't create the record before calling `GetByID()`
- Test previously relied on global seeding that no longer runs
**To Identify Failing Tests:**
```bash
# Search for tests calling GetByID without creating the record first
grep -r "GetByID(" backend/**/*_test.go
```
---
## Root Cause Analysis
### Why Were Tests Passing Before?
**Encryption Key Warnings:**
- Tests have ALWAYS printed these warnings (not a recent regression)
- Warnings are to stderr, don't fail tests
- This is "noise" that should be cleaned up
**ProxyHost Errors:**
- **Likely Recent Change:**
- A test was recently modified to call `GetByID()` without proper setup
- A global test fixture/seed was removed
- Test database setup order changed
- **Verification Needed:** Check recent commits to `*_test.go` files
### CI vs. Local Test Differences
**CI Environment (`codecov-upload.yml`):**
- No environment variables set beyond `CGO_ENABLED=1`
- Fresh test database for each test run
- No `.env` file loaded
**Local Environment:**
- May have `.env` file with `CHARON_ENCRYPTION_KEY` set
- Test setup may differ from CI
- Local runs might have different test execution order
**Key Files Checked:**
- `.env.example` - Shows `CHARON_ENCRYPTION_KEY=` (empty, requires generation)
- `scripts/go-test-coverage.sh` - Does NOT set `CHARON_ENCRYPTION_KEY`
- `scripts/setup-e2e-env.sh` - Generates key for E2E tests (NOT unit tests)
---
## Remediation Plan
### Phase 1: Environment Variable Configuration (WARNING ELIMINATION)
**Objective:** Eliminate RotationService initialization warnings in CI logs across ALL affected workflows
#### Implementation Strategy
**Single Secret for All Workflows:**
- Use one GitHub Secret: `CHARON_ENCRYPTION_KEY_TEST`
- Apply to all 3 workflows consistently
- Same security model across all test runs
#### Option A: Set in GitHub Actions (RECOMMENDED)
**Security:** Use GitHub Repository Secrets for production-like CI
**Implementation:**
1. **Generate Test Key:**
```bash
# Local execution to generate key
openssl rand -base64 32
```
2. **Add to GitHub Secrets:**
- Navigate to: Repository → Settings → Secrets → Actions
- Create new secret: `CHARON_ENCRYPTION_KEY_TEST`
- Value: Generated base64 key from step 1
3. **Update ALL 3 Workflows:**
**Workflow 1: codecov-upload.yml**
**File:** `.github/workflows/codecov-upload.yml`
**Location:** Line 53-60 (backend-codecov job, "Run Go tests with coverage" step)
```yaml
- name: Run Go tests with coverage
working-directory: ${{ github.workspace }}
env:
CGO_ENABLED: 1
CHARON_ENCRYPTION_KEY: ${{ secrets.CHARON_ENCRYPTION_KEY_TEST }} # ADD THIS LINE
run: |
bash scripts/go-test-coverage.sh 2>&1 | tee backend/test-output.txt
exit "${PIPESTATUS[0]}"
```
**Workflow 2: quality-checks.yml (Test Coverage Step)**
**File:** `.github/workflows/quality-checks.yml`
**Location:** Line 37-45 (backend-quality job, "Run Go tests" step)
```yaml
- name: Run Go tests
id: go-tests
working-directory: ${{ github.workspace }}
env:
CGO_ENABLED: 1
CHARON_ENCRYPTION_KEY: ${{ secrets.CHARON_ENCRYPTION_KEY_TEST }} # ADD THIS LINE
run: |
bash "scripts/go-test-coverage.sh" 2>&1 | tee backend/test-output.txt
exit "${PIPESTATUS[0]}"
```
**Workflow 2: quality-checks.yml (Perf Tests Step)**
**File:** `.github/workflows/quality-checks.yml`
**Location:** Line 115-124 (backend-quality job, "Run Perf Asserts" step)
```yaml
- name: Run Perf Asserts
working-directory: backend
env:
# Conservative defaults to avoid flakiness on CI; tune as necessary
PERF_MAX_MS_GETSTATUS_P95: 500ms
PERF_MAX_MS_GETSTATUS_P95_PARALLEL: 1500ms
PERF_MAX_MS_LISTDECISIONS_P95: 2000ms
CHARON_ENCRYPTION_KEY: ${{ secrets.CHARON_ENCRYPTION_KEY_TEST }} # ADD THIS LINE
run: |
{
echo "## 🔍 Running performance assertions (TestPerf)"
go test -run TestPerf -v ./internal/api/handlers -count=1 | tee perf-output.txt
} >> "$GITHUB_STEP_SUMMARY"
exit "${PIPESTATUS[0]}"
```
**Workflow 3: benchmark.yml (Benchmark Step)**
**File:** `.github/workflows/benchmark.yml`
**Location:** Line 44 (benchmark job, "Run Benchmark" step)
```yaml
- name: Run Benchmark
working-directory: backend
env:
CHARON_ENCRYPTION_KEY: ${{ secrets.CHARON_ENCRYPTION_KEY_TEST }} # ADD THIS LINE
run: go test -bench=. -benchmem -run='^$' ./... | tee output.txt
```
**Workflow 3: benchmark.yml (Perf Asserts Step)**
**File:** `.github/workflows/benchmark.yml`
**Location:** Line 74 (benchmark job, "Run Perf Asserts" step)
```yaml
- name: Run Perf Asserts
working-directory: backend
env:
PERF_MAX_MS_GETSTATUS_P95: 500ms
PERF_MAX_MS_GETSTATUS_P95_PARALLEL: 1500ms
PERF_MAX_MS_LISTDECISIONS_P95: 2000ms
CHARON_ENCRYPTION_KEY: ${{ secrets.CHARON_ENCRYPTION_KEY_TEST }} # ADD THIS LINE
run: |
echo "## 🔍 Running performance assertions (TestPerf)" >> "$GITHUB_STEP_SUMMARY"
go test -run TestPerf -v ./internal/api/handlers -count=1 | tee perf-output.txt
exit "${PIPESTATUS[0]}"
```
**Summary of Changes:**
- **3 workflow files** to modify
- **5 env sections** to update (2 in quality-checks, 2 in benchmark, 1 in codecov-upload)
- **1 GitHub Secret** to create
**Pros:**
- Secrets are encrypted at rest
- Key never appears in logs
- Matches production security model
- Consistent across all workflows
**Cons:**
- Requires GitHub repository admin access
- Key rotation requires updating secret (but affects all workflows at once)
#### Option B: Generate Ephemeral Key (ALTERNATIVE)
**Security:** Generate temporary key for each CI run
**Implementation:**
Apply this pattern to all 3 workflows. Each workflow generates its own ephemeral key.
**Workflow 1: codecov-upload.yml**
**File:** `.github/workflows/codecov-upload.yml`
**Location:** Before "Run Go tests with coverage" step (after "Set up Go")
```yaml
- name: Generate test encryption key
id: test-key
run: |
TEST_KEY=$(openssl rand -base64 32)
echo "::add-mask::${TEST_KEY}"
echo "CHARON_ENCRYPTION_KEY=${TEST_KEY}" >> $GITHUB_ENV
- name: Run Go tests with coverage
working-directory: ${{ github.workspace }}
env:
CGO_ENABLED: 1
# CHARON_ENCRYPTION_KEY inherited from $GITHUB_ENV
run: |
bash scripts/go-test-coverage.sh 2>&1 | tee backend/test-output.txt
exit "${PIPESTATUS[0]}"
```
**Workflow 2: quality-checks.yml**
**File:** `.github/workflows/quality-checks.yml`
**Location:** Before "Run Go tests" step (after "Repo health check")
```yaml
- name: Generate test encryption key
id: test-key
run: |
TEST_KEY=$(openssl rand -base64 32)
echo "::add-mask::${TEST_KEY}"
echo "CHARON_ENCRYPTION_KEY=${TEST_KEY}" >> $GITHUB_ENV
- name: Run Go tests
id: go-tests
working-directory: ${{ github.workspace }}
env:
CGO_ENABLED: 1
# CHARON_ENCRYPTION_KEY inherited from $GITHUB_ENV
run: |
bash "scripts/go-test-coverage.sh" 2>&1 | tee backend/test-output.txt
exit "${PIPESTATUS[0]}"
# ... later in the same job ...
- name: Run Perf Asserts
working-directory: backend
env:
PERF_MAX_MS_GETSTATUS_P95: 500ms
PERF_MAX_MS_GETSTATUS_P95_PARALLEL: 1500ms
PERF_MAX_MS_LISTDECISIONS_P95: 2000ms
# CHARON_ENCRYPTION_KEY inherited from $GITHUB_ENV
run: |
{
echo "## 🔍 Running performance assertions (TestPerf)"
go test -run TestPerf -v ./internal/api/handlers -count=1 | tee perf-output.txt
} >> "$GITHUB_STEP_SUMMARY"
exit "${PIPESTATUS[0]}"
```
**Workflow 3: benchmark.yml**
**File:** `.github/workflows/benchmark.yml`
**Location:** Before "Run Benchmark" step (after "Set up Go")
```yaml
- name: Generate test encryption key
id: test-key
run: |
TEST_KEY=$(openssl rand -base64 32)
echo "::add-mask::${TEST_KEY}"
echo "CHARON_ENCRYPTION_KEY=${TEST_KEY}" >> $GITHUB_ENV
- name: Run Benchmark
working-directory: backend
env:
# CHARON_ENCRYPTION_KEY inherited from $GITHUB_ENV
run: go test -bench=. -benchmem -run='^$' ./... | tee output.txt
# ... later in the same job ...
- name: Run Perf Asserts
working-directory: backend
env:
PERF_MAX_MS_GETSTATUS_P95: 500ms
PERF_MAX_MS_GETSTATUS_P95_PARALLEL: 1500ms
PERF_MAX_MS_LISTDECISIONS_P95: 2000ms
# CHARON_ENCRYPTION_KEY inherited from $GITHUB_ENV
run: |
echo "## 🔍 Running performance assertions (TestPerf)" >> "$GITHUB_STEP_SUMMARY"
go test -run TestPerf -v ./internal/api/handlers -count=1 | tee perf-output.txt
exit "${PIPESTATUS[0]}"
```
**Pros:**
- No secrets management needed
- Key is ephemeral (discarded after run)
- Simpler to implement
- Each workflow run gets its own unique key
**Cons:**
- Generates new key on every run (minimal overhead ~0.1s)
- Doesn't test key persistence scenarios
#### Option C: Inline Test Key (NOT RECOMMENDED)
**Security:** Hardcode a test-only key in workflow
**Implementation:**
Apply same hardcoded key to all 3 workflows:
```yaml
- name: Run Go tests with coverage # or Run Benchmark, or Run Perf Asserts
working-directory: ${{ github.workspace }}
env:
CGO_ENABLED: 1
CHARON_ENCRYPTION_KEY: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=" # Hardcoded test key
run: |
bash scripts/go-test-coverage.sh 2>&1 | tee backend/test-output.txt
exit "${PIPESTATUS[0]}"
```
**Apply to:**
- `.github/workflows/codecov-upload.yml` - Line 53 env block
- `.github/workflows/quality-checks.yml` - Lines 37 and 115 env blocks
- `.github/workflows/benchmark.yml` - Lines 44 and 74 env blocks
**Pros:**
- Simplest to implement (just add one line per env block)
- No secrets management
- No key generation overhead
**Cons:**
- ⚠️ Key visible in workflow file and logs
- ⚠️ Security audit will flag this
- ⚠️ Doesn't test real key loading from environment
- ⚠️ Not recommended for repos with security compliance requirements
**Recommendation:** Use **Option A** (GitHub Secrets) for production readiness and security compliance, or **Option B** (Ephemeral) for simplicity without security concerns. Avoid Option C unless this is a demo/test repository.
---
### Phase 2: Database Seeding/Test Setup (ERROR ELIMINATION)
**Objective:** Fix ProxyHost "record not found" failures
#### Step 1: Identify Failing Tests
**Action:** Run tests locally and capture failures
```bash
cd backend
go test -v ./... 2>&1 | tee test-output.txt
grep -i "record not found" test-output.txt
```
**Expected Output:**
```
--- FAIL: TestSomeFunction (0.00s)
service_test.go:123: Error getting proxy host: record not found
```
#### Step 2: Classify Failures
For each failing test, determine:
1. **Test calls `GetByID()` without creating record?**
- Fix: Add `createTestProxyHost()` call before `GetByID()`
2. **Test expects a specific ID (e.g., ID=1)?**
- Fix: Store the returned ID from `Create()` and use it in `GetByID()`
3. **Test relies on global seed data?**
- Fix: Add explicit test data creation in test setup
#### Step 3: Apply Fixes
**Pattern 1: Missing Test Data Creation**
**Before (Broken):**
```go
func TestSomeFunction(t *testing.T) {
db := setupTestDB(t)
service := NewProxyHostService(db)
// Assumes ID=1 exists (WRONG)
host, err := service.GetByID(1)
require.NoError(t, err)
}
```
**After (Fixed):**
```go
func TestSomeFunction(t *testing.T) {
db := setupTestDB(t)
service := NewProxyHostService(db)
// Create test data first
testHost := &models.ProxyHost{
UUID: "test-uuid",
DomainNames: "test.example.com",
ForwardHost: "localhost",
ForwardPort: 8080,
}
require.NoError(t, service.Create(testHost))
// Now fetch it by the auto-assigned ID
host, err := service.GetByID(testHost.ID)
require.NoError(t, err)
assert.Equal(t, "test.example.com", host.DomainNames)
}
```
**Pattern 2: Expecting Specific Error**
**Option A: Handle gorm.ErrRecordNotFound**
```go
func TestGetByID_NotFound(t *testing.T) {
db := setupTestDB(t)
service := NewProxyHostService(db)
// Test error handling for non-existent ID
_, err := service.GetByID(999)
require.Error(t, err)
assert.True(t, errors.Is(err, gorm.ErrRecordNotFound))
}
```
**Option B: Wrap Error in Service (BETTER)**
Modify `ProxyHostService.GetByID()` to return a domain-specific error:
**File:** `backend/internal/services/proxyhost_service.go:192-197`
```go
var ErrProxyHostNotFound = errors.New("proxy host not found")
func (s *ProxyHostService) GetByID(id uint) (*models.ProxyHost, error) {
var host models.ProxyHost
if err := s.db.Where("id = ?", id).First(&host).Error; err != nil {
if errors.Is(err, gorm.ErrRecordNotFound) {
return nil, ErrProxyHostNotFound
}
return nil, err
}
return &host, nil
}
```
**Then tests become:**
```go
func TestGetByID_NotFound(t *testing.T) {
db := setupTestDB(t)
service := NewProxyHostService(db)
_, err := service.GetByID(999)
require.Error(t, err)
assert.True(t, errors.Is(err, services.ErrProxyHostNotFound))
}
```
#### Step 4: Add Missing Test Utilities
**Create Shared Test Helper:**
**File:** `backend/internal/services/testutil/proxyhost_fixtures.go` (NEW FILE)
```go
package testutil
import (
"testing"
"github.com/Wikid82/charon/backend/internal/models"
"github.com/google/uuid"
"github.com/stretchr/testify/require"
"gorm.io/gorm"
)
// CreateTestProxyHost creates a proxy host with sensible defaults for testing.
func CreateTestProxyHost(t *testing.T, db *gorm.DB, overrides ...func(*models.ProxyHost)) *models.ProxyHost {
t.Helper()
host := &models.ProxyHost{
UUID: uuid.NewString(),
Name: "Test Proxy",
DomainNames: "test.example.com",
ForwardScheme: "http",
ForwardHost: "localhost",
ForwardPort: 8080,
Enabled: true,
}
// Apply overrides
for _, override := range overrides {
override(host)
}
require.NoError(t, db.Create(host).Error)
return host
}
```
**Usage in Tests:**
```go
import "github.com/Wikid82/charon/backend/internal/services/testutil"
func TestSomeFunction(t *testing.T) {
db := setupTestDB(t)
service := NewProxyHostService(db)
// Create test data with defaults
host1 := testutil.CreateTestProxyHost(t, db)
// Create test data with custom values
host2 := testutil.CreateTestProxyHost(t, db, func(h *models.ProxyHost) {
h.Name = "Custom Name"
h.ForwardPort = 9000
})
// Now use them
fetched, err := service.GetByID(host1.ID)
require.NoError(t, err)
}
```
---
## Phase 3: Validation
### Consolidated Implementation Checklist
**Phase 1: Multi-Workflow Environment Variable Fix**
- [ ] **Generate or configure secret:**
- Option A: Generate key with `openssl rand -base64 32`, add to GitHub Secrets as `CHARON_ENCRYPTION_KEY_TEST`
- Option B: Add key generation step to each workflow (ephemeral keys)
- Option C: Use hardcoded test key (not recommended)
- [ ] **Update Workflow 1 (Priority: CRITICAL):**
- **File:** `.github/workflows/quality-checks.yml`
- **Location 1:** Line 37-45 - Add `CHARON_ENCRYPTION_KEY` to "Run Go tests" step
- **Location 2:** Line 115-124 - Add `CHARON_ENCRYPTION_KEY` to "Run Perf Asserts" step
- **Verification:** Both test steps have the env var
- [ ] **Update Workflow 2 (Priority: HIGH):**
- **File:** `.github/workflows/codecov-upload.yml`
- **Location:** Line 53-60 - Add `CHARON_ENCRYPTION_KEY` to "Run Go tests with coverage" step
- **Verification:** Test step has the env var
- [ ] **Update Workflow 3 (Priority: MEDIUM):**
- **File:** `.github/workflows/benchmark.yml`
- **Location 1:** Line 44 - Add `CHARON_ENCRYPTION_KEY` to "Run Benchmark" step
- **Location 2:** Line 74 - Add `CHARON_ENCRYPTION_KEY` to "Run Perf Asserts" step
- **Verification:** Both test steps have the env var
- [ ] **Total changes:** 3 files, 5 env blocks updated
**Phase 2: Test Data Setup Fixes**
- [ ] Identify failing tests with "record not found" errors
- [ ] Fix each test by adding proper test data creation
- [ ] Add `testutil.CreateTestProxyHost()` helper if needed
- [ ] Verify all tests pass locally
**Phase 3: Multi-Workflow Validation**
- [ ] Local validation (all tests pass with encryption key set)
- [ ] Push to feature branch
- [ ] Monitor **all 3 workflow runs** in GitHub Actions
- [ ] Verify each workflow:
- ✅ quality-checks.yml - No warnings, tests pass
- ✅ codecov-upload.yml - No warnings, tests pass, coverage uploaded
- ✅ benchmark.yml - No warnings, benchmarks complete
---
## Phase 3: Validation (Detailed Procedures)
### Step 1: Local Validation
**Execute Before Pushing:**
```bash
# 1. Set encryption key locally (matches CI)
export CHARON_ENCRYPTION_KEY=$(openssl rand -base64 32)
# 2. Run backend tests
cd /projects/Charon
.github/skills/scripts/skill-runner.sh test-backend-coverage
# 3. Verify no warnings in output
# Look for: "Warning: RotationService initialization failed"
# Expected: No warnings
# 4. Verify coverage pass
# Expected: "Coverage requirement met"
# 5. Check for test failures
# Expected: All tests pass
```
**Success Criteria:**
- ✅ No "RotationService initialization failed" warnings
- ✅ No "record not found" errors
- ✅ Coverage >= 85%
- ✅ All tests pass
### Step 2: CI Validation
**Push to Branch and Monitor:**
```bash
git checkout -b fix/ci-backend-test-failures
git add .github/workflows/codecov-upload.yml
git add .github/workflows/quality-checks.yml
git add .github/workflows/benchmark.yml
git add backend/internal/services/proxyhost_service.go # If modified
git add backend/internal/services/*_test.go # Any test fixes
git commit -m "fix(ci): resolve backend test failures across all workflows
- Add CHARON_ENCRYPTION_KEY to quality-checks, codecov-upload, and benchmark workflows
- Fix ProxyHost test data setup in service tests
- Eliminate RotationService initialization warnings
Affected workflows:
- quality-checks.yml (CRITICAL: PR blocker)
- codecov-upload.yml (HIGH: coverage tracking)
- benchmark.yml (MEDIUM: performance regression)
Resolves: backend test job failures across 3 CI workflows"
git push origin fix/ci-backend-test-failures
```
**Monitor All 3 CI Workflows:**
1. Navigate to GitHub Actions → Your PR
2. Verify these workflow runs appear:
- ✅ **Quality Checks** (most critical)
- ✅ **Upload Coverage to Codecov**
- ✅ **Go Benchmark** (may run later via workflow_run trigger)
3. **For each workflow, verify:**
- No stderr warnings in test execution steps
- Test output shows all tests passing
- No "RotationService initialization failed" messages
- No "record not found" errors
4. **Quality Checks specific checks:**
- "Run Go tests" step succeeds
- "Run Perf Asserts" step succeeds
- GORM Security Scanner passes
- Frontend tests pass (unrelated but monitored)
5. **Codecov Upload specific checks:**
- Backend tests pass
- Coverage upload succeeds
- Coverage report appears on PR
6. **Benchmark specific checks:**
- Benchmarks complete without errors
- Performance assertions pass
- (Note: Results may only store on main branch pushes)
**Expected Duration:**
- quality-checks.yml: ~3-5 minutes
- codecov-upload.yml: ~3-5 minutes
- benchmark.yml: ~4-6 minutes
**Success Criteria - ALL workflows must:**
- ✅ Complete without failures
- ✅ Show no encryption key warnings
- ✅ Show no database record errors
- ✅ Maintain or improve coverage/performance baselines
---
## Dependencies & Risks
### Dependencies
**Internal:**
- GitHub repository secrets access (for Option A)
- Ability to modify 3 workflow files: `.github/workflows/{codecov-upload,quality-checks,benchmark}.yml`
- Go test environment (local and CI)
**External:**
- Codecov service (for coverage upload)
- GitHub Actions runner availability
### Risks
| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| Tests fail after adding encryption key | Low | Medium | Test locally first with same env var |
| New test failures introduced by fixes | Medium | Medium | Validate each test fix individually |
| Coverage drops below 85% | Low | High | Add tests alongside fixes, not after |
| Codecov upload still fails | Low | High | Verify Codecov token is valid |
| Breaking other tests by modifying ProxyHostService | Low | High | Only add error wrapping, don't change logic |
| **Missing affected workflows (incomplete fix)** | **Low** | **Critical** | Verified all workflows via grep search; only 3 run Go tests |
| **Workflow fixes out of sync** | **Medium** | **High** | Use same env var name (`CHARON_ENCRYPTION_KEY`) across all workflows |
| **Quality checks workflow more critical than codecov** | **N/A** | **Critical** | Prioritize quality-checks.yml - it blocks PR merges |
| **Benchmark workflow fails silently** | **Low** | **Medium** | Add same fix proactively even if not currently failing |
### Multi-Workflow Coordination
**Critical Insight:** The `quality-checks.yml` workflow is MORE important than `codecov-upload.yml` because:
- Quality checks run on every PR and block merges
- Codecov upload is informational and doesn't block merges
- Quality checks includes multiple test types (unit tests + perf tests)
**Implementation Priority:**
1. **FIRST:** Fix `quality-checks.yml` (most critical - PR blocker)
2. **SECOND:** Fix `codecov-upload.yml` (high priority - coverage tracking)
3. **THIRD:** Fix `benchmark.yml` (proactive - prevent future issues)
**Consistency Requirements:**
- All workflows MUST use the same environment variable name: `CHARON_ENCRYPTION_KEY`
- If using Option A (GitHub Secrets), all workflows MUST reference the same secret: `CHARON_ENCRYPTION_KEY_TEST`
- If using Option B (Ephemeral), all workflows MUST generate keys the same way for consistency
### Technical Debt Created
1. **Test Helper Utilities:**
- New `testutil` package should be documented
- Consider creating similar helpers for other models
2. **Error Handling Consistency:**
- If wrapping `gorm.ErrRecordNotFound`, apply same pattern to all services
- Document error handling conventions
3. **Environment Variable Documentation:**
- Update `docs/development.md` with required CI env vars
- Document test key generation process
---
## Stop/Go Rules
### Stop Conditions
**Phase 1 (Environment Variables):**
- STOP if: Local tests fail after setting `CHARON_ENCRYPTION_KEY`
- **Action:** Investigate why encryption key breaks tests
- **Escalate to:** Backend service owners
**Phase 2 (Test Fixes):**
- STOP if: More than 5 test files need modifications
- **Action:** Consider global test fixture/seed instead
- **Escalate to:** Test infrastructure team
- STOP if: Fixing tests requires production code changes beyond error wrapping
- **Action:** Escalate as potential design issue
**Phase 3 (Validation):**
- STOP if: CI still fails after local validation passes
- **Action:** Compare CI environment vs. local (Go version, SQLite version, etc.)
- **Escalate to:** DevOps/CI team
### Go Conditions
**Phase 1 → Phase 2:**
- GO if: Tests run with no RotationService warnings
- GO if: Coverage remains >= 85%
**Phase 2 → Phase 3:**
- GO if: All identified test failures are fixed
- GO if: No new test failures introduced
**Phase 3 → Complete:**
- GO if: CI run passes with all checks green
- GO if: Codecov upload succeeds
---
## Success Metrics
### Quantitative
1. **RotationService Warnings:** 0 occurrences in CI logs
2. **Test Failures:** 0 "record not found" errors
3. **Coverage:** Maintain >= 85% backend coverage
4. **CI Duration:** No increase in test execution time
5. **Test Pass Rate:** 100% (all tests pass)
### Qualitative
1. **Code Quality:** Test fixes follow established patterns
2. **Documentation:** Changes are self-explanatory or documented
3. **Maintainability:** Future tests can easily create test data
4. **Security:** Encryption key handling follows best practices
---
## Timeline Estimate
| Phase | Estimated Duration | Confidence |
|-------|-------------------|-----------|
| Phase 1: Environment Variable (3 workflows) | 45 minutes | High |
| Phase 2: Test Fixes | 1-3 hours | Medium |
| Phase 3: Validation (3 workflows) | 45 minutes | High |
| **Total** | **2.5-4.5 hours** | **Medium** |
**Assumptions:**
- Fewer than 5 tests need fixing
- No production code changes required (beyond error wrapping)
- CI environment is stable
- All 3 workflows can be tested in parallel
**Phase 1 Breakdown:**
- Generate/configure secret: 5 minutes
- Update quality-checks.yml (2 env blocks): 15 minutes
- Update codecov-upload.yml (1 env block): 10 minutes
- Update benchmark.yml (2 env blocks): 10 minutes
- Document changes and verify: 5 minutes
**Contingency:**
- If more than 5 tests fail: +2 hours
- If production code needs refactoring: +4 hours
- If CI environment has additional issues: +1 hour
- If workflows have unexpected dependencies: +1 hour
---
## Follow-Up Actions
### Immediate (This PR)
1. ✅ Add `CHARON_ENCRYPTION_KEY` to CI workflow
2. ✅ Fix all identified test failures
3. ✅ Verify CI passes
### Short-Term (Next Sprint)
1. **Test Infrastructure Audit:**
- Document all required environment variables for tests
- Create standardized test setup utilities (`testutil` package)
- Add linting rule to catch missing test data setup
2. **Error Handling Standardization:**
- Define domain-specific errors for all services (not just ProxyHost)
- Document error handling conventions
- Apply pattern to all `*Service.GetByID()` methods
3. **CI Environment Documentation:**
- Document all GitHub Secrets required for workflows
- Create key rotation procedure
- Add CI environment variable checklist
### Long-Term (Future)
1. **Test Fixture Framework:**
- Evaluate using `testfixtures` or similar library
- Create declarative test data setup
- Reduce boilerplate in test files
2. **Integration Testing:**
- Separate unit tests (fast, mocked) from integration tests (real DB)
- Use build tags: `//go:build integration`
- Run integration tests separately in CI
3. **Service Constructor Refactoring:**
- Make `RotationService` initialization explicit
- Allow tests to inject mock `RotationService`
- Reduce warning messages in test output
---
## References
### Files Analyzed
**CI Configuration:**
- `.github/workflows/codecov-upload.yml` (workflow definition)
**Backend Services:**
- `backend/internal/crypto/rotation_service.go` (encryption key loading)
- `backend/internal/services/dns_provider_service.go` (RotationService usage)
- `backend/internal/services/credential_service.go` (RotationService usage)
- `backend/internal/services/proxyhost_service.go` (GetByID implementation)
**Tests:**
- `backend/internal/crypto/rotation_service_test.go` (key setup pattern)
- `backend/internal/services/dns_provider_service_test.go` (test setup)
- `backend/internal/services/credential_service_test.go` (test setup)
- `backend/internal/services/proxyhost_service_test.go` (CRUD test pattern)
- `backend/internal/api/handlers/proxy_host_handler_update_test.go` (test helper)
**Documentation:**
- `.env.example` (environment variable reference)
- `ARCHITECTURE.md` (encryption key documentation)
- `docs/guides/dns-providers.md` (encryption key usage guide)
### External Resources
- [GORM Error Handling](https://gorm.io/docs/error_handling.html)
- [GitHub Actions Secrets](https://docs.github.com/en/actions/security-guides/encrypted-secrets)
- [Go Testing Best Practices](https://go.dev/doc/effective_go#testing)
- [SQLite In-Memory Databases](https://www.sqlite.org/inmemorydb.html)
---
## Appendix A: Workflow Analysis Details
### Analysis Methodology
**Search Commands Used:**
```bash
# Find all workflow files
find .github/workflows -name "*.yml"
# Find workflows running Go tests
grep -r "go test\|go-test-coverage\.sh" .github/workflows/*.yml
# Find workflows with encryption key
grep -r "CHARON_ENCRYPTION_KEY" .github/workflows/*.yml
```
**Results:**
- **39 total workflow files** in `.github/workflows/`
- **3 workflows run Go unit tests** (affected by missing encryption key)
- **1 workflow (e2e-tests-split.yml)** already has encryption key configured
- **2 workflows (cerberus, crowdsec)** run integration tests (not affected)
- **33 workflows** don't run backend tests (not affected)
### Workflow-by-Workflow Breakdown
#### 1. quality-checks.yml (CRITICAL)
**Purpose:** PR quality gates that block merges
**Trigger:** On every pull_request to main/development
**Impact:** Most critical - blocks PR approvals
**Test Commands:**
- Line 43: `bash "scripts/go-test-coverage.sh"`
- Line 123: `go test -run TestPerf -v ./internal/api/handlers`
**Current Status:** ❌ Failing
**Fix Required:** Add `CHARON_ENCRYPTION_KEY` to both test steps
**Expected Result:** PR checks turn green, allowing merges
#### 2. codecov-upload.yml (HIGH PRIORITY)
**Purpose:** Upload test coverage to Codecov service
**Trigger:** On pull_request to main/development + workflow_dispatch
**Impact:** High - coverage tracking and reporting
**Test Commands:**
- Line 58: `bash scripts/go-test-coverage.sh`
**Current Status:** ❌ Failing
**Fix Required:** Add `CHARON_ENCRYPTION_KEY` to test step
**Expected Result:** Coverage reports appear on PRs
#### 3. benchmark.yml (MEDIUM PRIORITY)
**Purpose:** Performance regression detection
**Trigger:** After docker-build.yml completes + workflow_dispatch
**Impact:** Medium - catches performance regressions
**Test Commands:**
- Line 44: `go test -bench=. -benchmem -run='^$' ./...`
- Line 74: `go test -run TestPerf -v ./internal/api/handlers`
**Current Status:** ⚠️ At risk (may not have failed yet)
**Fix Required:** Add `CHARON_ENCRYPTION_KEY` to both test steps (proactive)
**Expected Result:** Benchmarks run cleanly without warnings
#### 4. e2e-tests-split.yml (ALREADY FIXED)
**Purpose:** End-to-end Playwright tests
**Trigger:** Multiple triggers, runs E2E test shards
**Status:** ✅ Already configured correctly
**Evidence of correct configuration:**
```yaml
# Lines 280, 481, 690, 894, 1098, 1310 - All identical:
- name: Generate test encryption key
run: echo "CHARON_ENCRYPTION_KEY=$(openssl rand -base64 32)" >> "$GITHUB_ENV"
```
**Why it's correct:** Each shard generates its own ephemeral key before running tests. This is the pattern Option B recommends.
#### 5. cerberus-integration.yml (NOT AFFECTED)
**Purpose:** Cerberus security stack integration tests
**Test Type:** Docker compose with integration scripts
**Why not affected:** Doesn't run `go test` - runs `scripts/cerberus_integration.sh`
**Status:** ✅ No changes needed
#### 6. crowdsec-integration.yml (NOT AFFECTED)
**Purpose:** CrowdSec bouncer integration tests
**Test Type:** Docker compose with integration scripts
**Why not affected:** Doesn't run `go test` - runs skill-based integration scripts
**Status:** ✅ No changes needed
### Why Other Workflows Aren't Affected
**Workflows without backend tests:**
- `docker-build.yml` - Builds images, no test execution
- `codeql.yml` - Security scanning only
- `supply-chain-*.yml` - SBOM and provenance only
- `release-goreleaser.yml` - Release automation
- `docs.yml` - Documentation deployment
- `repo-health.yml` - Repository maintenance
- `renovate_prune.yml` - Dependency management
- `auto-versioning.yml` - Version bumping
- `caddy-major-monitor.yml` - Upstream monitoring
- `update-geolite2.yml` - GeoIP updates
- `nightly-build.yml` - Scheduled builds
- `propagate-changes.yml` - Branch sync
- `weekly-nightly-promotion.yml` - Release promotion
- `gh_cache_cleanup.yml` - Cache maintenance
**Key Insight:** The CI failures only affect workflows that run `go test` commands, and specifically those that instantiate services requiring `RotationService`. Integration test workflows use Docker compose and don't instantiate Go services directly in the CI runner.
---
## Sign-Off
**Prepared by:** Investigation Agent
**Reviewed by:** Pending (Awaiting supervisor approval)
**Approved by:** Pending
**Next Action:** Await approval to proceed with Phase 1 implementation.