Files
Charon/docs/plans/current_spec.md
GitHub Actions 0c2e2f7214 fix: stabilize certificate test execution under CI concurrency
This change hardens certificate handler test execution so repeated CI runs are deterministic and no longer fail intermittently under concurrent scheduling and race-mode pressure.

It was necessary because initialization timing and test setup ordering created nondeterministic behavior that produced sporadic failures in the backend test suite.

The result is a stable certificate list test path with explicit validation gates and reproducible test artifacts for auditing.

Known container vulnerability findings remain documented and are treated as an accepted exception for this hotfix scope, with remediation deferred to the dedicated security track.
2026-02-19 00:44:44 +00:00

370 lines
17 KiB
Markdown

## Fix Flaky Go Test: `TestCertificateHandler_List_WithCertificates`
### 1) Introduction
This specification defines a focused, low-risk plan to eliminate intermittent CI failures for:
- `backend/internal/api/handlers.TestCertificateHandler_List_WithCertificates`
while preserving production behavior in:
- `backend/internal/services/certificate_service.go`
- `backend/internal/api/handlers/certificate_handler.go`
- `backend/internal/api/routes/routes.go`
Primary objective:
- Remove nondeterministic startup concurrency in certificate service initialization that intermittently races with handler list calls under CI parallelism/race detection.
### Decision Record - 2026-02-18
**Decision:** For this targeted flaky-fix, `docs/plans/current_spec.md` is the temporary single source of truth instead of creating separate `requirements.md`, `design.md`, and `tasks.md` artifacts.
**Context:** The scope is intentionally narrow (single flaky test root-cause + deterministic validation loop), and splitting artifacts now would add process overhead without reducing delivery risk.
**Traceability mapping:**
- Requirements content is captured in Section 3 (EARS requirements).
- Design content is captured in Section 4 (technical specifications).
- Task breakdown and validation gates are captured in Section 5 (implementation plan).
**Review trigger:** If scope expands beyond the certificate flaky-fix boundary, restore full artifact split into `requirements.md`, `design.md`, and `tasks.md`.
---
### 2) Research Findings
#### 2.1 CI failure evidence
Observed in `.github/logs/ci_failure.log`:
- `WARNING: DATA RACE`
- failing test: `TestCertificateHandler_List_WithCertificates`
- conflicting access path:
- **Write** from `(*CertificateService).SyncFromDisk` (`certificate_service.go`)
- **Read** from `(*CertificateService).ListCertificates` (`certificate_service.go`), triggered via handler list path
#### 2.2 Existing architecture and hotspot map
**Service layer**
- File: `backend/internal/services/certificate_service.go`
- Key symbols:
- `NewCertificateService(caddyDataDir string, db *gorm.DB) *CertificateService`
- `SyncFromDisk() error`
- `ListCertificates() ([]models.SSLCertificate, error)`
- `InvalidateCache()`
- `refreshCacheFromDB() error`
**Handler layer**
- File: `backend/internal/api/handlers/certificate_handler.go`
- Key symbols:
- `func (h *CertificateHandler) List(c *gin.Context)`
- `func (h *CertificateHandler) Upload(c *gin.Context)`
- `func (h *CertificateHandler) Delete(c *gin.Context)`
**Route wiring**
- File: `backend/internal/api/routes/routes.go`
- Key symbol usage:
- `services.NewCertificateService(caddyDataDir, db)`
**Tests and setup patterns**
- Files:
- `backend/internal/api/handlers/certificate_handler_coverage_test.go`
- `backend/internal/api/handlers/certificate_handler_test.go`
- `backend/internal/api/handlers/certificate_handler_security_test.go`
- `backend/internal/services/certificate_service_test.go`
- `backend/internal/api/handlers/testdb.go`
- Findings:
- many handler tests instantiate the real constructor directly.
- one test already includes a timing workaround (`time.Sleep`) due to startup race behavior.
- service tests already use a helper pattern that avoids constructor-side async startup, which validates the architectural direction for deterministic initialization in tests.
#### 2.3 Root cause summary
Two explicit root-cause branches are tracked for this flaky area:
- **Branch A — constructor startup race:** `NewCertificateService` starts async state mutation (`SyncFromDisk`) immediately, while handler-driven reads (`ListCertificates`) may execute concurrently. This matches CI race output in `.github/logs/ci_failure.log` for `TestCertificateHandler_List_WithCertificates`.
- **Branch B — DB schema/setup ordering drift in tests:** some test paths reach service queries before deterministic schema setup is complete, producing intermittent setup-order failures such as `no such table: ssl_certificates` and `no such table: proxy_hosts`.
Plan intent: address both branches together in one tight PR by making constructor behavior deterministic and enforcing migration/setup ordering in certificate handler test setup.
---
### 3) EARS Requirements (Source of Truth)
#### 3.1 Ubiquitous requirements
- THE SYSTEM SHALL construct `CertificateService` in a deterministic initialization state before first handler read in tests.
- THE SYSTEM SHALL provide race-free access to certificate cache state during list operations.
#### 3.2 Event-driven requirements
- WHEN `NewCertificateService` returns, THE SYSTEM SHALL guarantee that any required first-read synchronization state is consistent for immediate `ListCertificates` calls.
- WHEN `CertificateHandler.List` is called, THE SYSTEM SHALL return data or a deterministic error without concurrency side effects caused by service startup.
- WHEN certificate handler tests initialize database state, THE SYSTEM SHALL complete required schema setup for certificate-related tables before service/handler execution.
#### 3.3 Unwanted-behavior requirements
- IF CI executes tests with race detector or parallel scheduling, THEN THE SYSTEM SHALL NOT produce data races between service initialization and list reads.
- IF startup synchronization fails, THEN THE SYSTEM SHALL surface explicit errors and preserve handler-level HTTP error behavior.
- IF test setup ordering is incorrect, THEN THE SYSTEM SHALL fail fast in setup with explicit diagnostics instead of deferred `no such table` runtime query failures.
#### 3.4 Optional requirements
- WHERE test-only construction paths are needed, THE SYSTEM SHALL use explicit test helpers/options instead of sleeps or timing assumptions.
---
### 4) Technical Specifications
#### 4.1 Service design changes
Primary design target (preferred):
1. Update `NewCertificateService` to remove nondeterministic startup behavior at construction boundary.
2. Ensure `SyncFromDisk` / cache state transition is serialized relative to early `ListCertificates` usage.
3. Keep existing mutex discipline explicit and auditable (single writer, safe readers).
4. Enforce deterministic DB setup ordering in certificate handler tests through shared setup helper behavior (migrate/setup before constructor and request execution).
Design constraints:
- No public API break for callers in `routes.go`.
- No broad refactor of certificate business logic.
- Keep behavior compatible with current handler response contracts.
#### 4.2 Handler and route impact
- `certificate_handler.go` should require no contract changes.
- `routes.go` keeps existing construction call shape unless a minimally invasive constructor option path is required.
#### 4.3 Data flow (target state)
1. Route wiring constructs certificate service.
2. Service enters stable initialized state before first list read path.
3. Handler `List` requests certificate data.
4. Service reads synchronized cache/DB state.
5. Handler responds `200` with deterministic payload or explicit `500` error.
#### 4.4 Error handling matrix
| Scenario | Expected behavior | Validation |
|---|---|---|
| Startup sync succeeds | List path returns stable data | handler list tests pass repeatedly |
| Startup sync fails | Error bubbles predictably to handler | HTTP 500 tests remain deterministic |
| Empty store | 200 + empty list (or existing contract shape) | existing empty-list tests pass |
| Concurrent test execution | No race detector findings in certificate tests | `go test -race` targeted suite |
| Test DB setup ordering incomplete | setup fails immediately with explicit setup error; no deferred query-time table errors | dedicated setup-ordering test + repeated run threshold |
#### 4.5 API and schema impact
- API endpoints: **no external contract changes**.
- Request/response schema: **unchanged** for certificate list/upload/delete.
- Database schema: **no migrations required**.
---
### 5) Implementation Plan (Phased)
## Phase 1: Playwright / UI-UX Baseline (Gate)
Although this fix is backend-test focused, follow test protocol gate:
- Reuse healthy E2E environment when possible; rebuild only if runtime image inputs changed.
- Use exact task/suite gate:
- task ID `shell: Test: E2E Playwright (FireFox) - Core: Certificates`
- suite path executed by the task: `tests/core/certificates.spec.ts`
- If rebuild is required by test protocol, run task ID `shell: Docker: Rebuild E2E Environment` before Playwright.
Deliverables:
- passing targeted Playwright suite for certificate list interactions
- archived output in standard test artifacts
## Phase 2: Backend Service Stabilization
Scope:
- `backend/internal/services/certificate_service.go`
Tasks:
1. Make constructor initialization deterministic at first-use boundary.
2. Remove startup timing dependence that allows early `ListCertificates` to overlap with constructor-initiated sync mutation.
3. Preserve current cache invalidation and DB refresh semantics.
4. Keep lock lifecycle simple and reviewable.
Complexity: **Medium** (single component, concurrency-sensitive)
## Phase 3: Backend Test Hardening
Scope:
- `backend/internal/api/handlers/certificate_handler_coverage_test.go`
- `backend/internal/api/handlers/certificate_handler_test.go`
- `backend/internal/api/handlers/certificate_handler_security_test.go`
- optional shared test helpers:
- `backend/internal/api/handlers/testdb.go`
Tasks:
1. Remove sleep-based or timing-dependent assumptions.
2. Standardize deterministic service setup helper for handler tests.
3. Ensure flaky case (`TestCertificateHandler_List_WithCertificates`) is fully deterministic.
4. Preserve assertion intent and API-level behavior checks.
5. Add explicit setup-ordering validation in tests to guarantee required schema migration/setup completes before handler/service invocation.
Complexity: **Medium** (many callsites, low logic risk)
## Phase 4: Integration & Validation
Tasks:
1. Run reproducible stability stress loop for the known flaky case via task ID `shell: Test: Backend Flaky - Certificate List Stability Loop`.
- **Task command payload:**
- `mkdir -p test-results/flaky && cd /projects/Charon && go test ./backend/internal/api/handlers -run '^TestCertificateHandler_List_WithCertificates$' -count=100 -shuffle=on -parallel=8 -json 2>&1 | tee test-results/flaky/cert-list-stability.jsonl`
- **Artifactized logging requirement:** persist `test-results/flaky/cert-list-stability.jsonl` for reproducibility and CI comparison.
- **Pass threshold:** `100/100` successful runs, zero failures.
2. Run race-mode stress for the same path via task ID `shell: Test: Backend Flaky - Certificate List Race Loop`.
- **Task command payload:**
- `mkdir -p test-results/flaky && cd /projects/Charon && go test -race ./backend/internal/api/handlers -run '^TestCertificateHandler_List_WithCertificates$' -count=30 -shuffle=on -parallel=8 -json 2>&1 | tee test-results/flaky/cert-list-race.jsonl`
- **Artifactized logging requirement:** persist `test-results/flaky/cert-list-race.jsonl`.
- **Pass threshold:** exit code `0` and zero `WARNING: DATA RACE` occurrences.
3. Run setup-ordering validation loop via task ID `shell: Test: Backend Flaky - Certificate DB Setup Ordering Loop`.
- **Task command payload:**
- `mkdir -p test-results/flaky && cd /projects/Charon && go test ./backend/internal/api/handlers -run '^TestCertificateHandler_DBSetupOrdering' -count=50 -shuffle=on -parallel=8 -json 2>&1 | tee test-results/flaky/cert-db-setup-ordering.jsonl`
- **Pass threshold:** `50/50` successful runs and zero `no such table: ssl_certificates|proxy_hosts` messages from positive-path setup runs.
4. Run focused certificate handler regression suite via task ID `shell: Test: Backend Flaky - Certificate Handler Focused Regression`.
- **Task command payload:**
- `mkdir -p test-results/flaky && cd /projects/Charon && go test ./backend/internal/api/handlers -run '^TestCertificateHandler_' -count=1 -json 2>&1 | tee test-results/flaky/cert-handler-regression.jsonl`
- **Pass threshold:** all selected tests pass in one clean run.
5. Execute patch-coverage preflight via existing project task ID `shell: Test: Local Patch Report`.
- **Task command payload:** `bash scripts/local-patch-report.sh`
- **Pass threshold:** both artifacts exist: `test-results/local-patch-report.md` and `test-results/local-patch-report.json`.
Task definition status for Phase 4 gates:
- Existing task ID in `.vscode/tasks.json`:
- `shell: Test: Local Patch Report`
- New task IDs to add in `.vscode/tasks.json` with the exact command payloads above:
- `shell: Test: Backend Flaky - Certificate List Stability Loop`
- `shell: Test: Backend Flaky - Certificate List Race Loop`
- `shell: Test: Backend Flaky - Certificate DB Setup Ordering Loop`
- `shell: Test: Backend Flaky - Certificate Handler Focused Regression`
Complexity: **Low-Medium**
## Phase 5: Documentation & Operational Hygiene
Tasks:
1. Update `docs/plans/current_spec.md` (this file) if implementation details evolve.
2. Record final decision outcomes and any deferred cleanup tasks.
3. Defer unrelated repository config hygiene (`.gitignore`, `codecov.yml`, `.dockerignore`, `Dockerfile`) unless a direct causal link to this flaky test is proven during implementation.
Complexity: **Low**
---
### 6) PR Slicing Strategy
### Decision
**Primary decision: single PR** for this fix.
Why:
- Scope is tightly coupled (constructor semantics + related tests).
- Minimizes context switching and user review requests.
- Reduces risk of partially landed concurrency behavior.
### Trigger reasons to split into multiple PRs
Split only if one or more occur:
- Constructor change causes wider runtime behavior concerns requiring staged rollout.
- Test hardening touches many unrelated test modules beyond certificate handlers.
- Validation reveals additional root-cause branches outside certificate service/test setup ordering.
### Ordered fallback slices (if split is required)
#### PR-1: Service Determinism Core
Scope:
- `backend/internal/services/certificate_service.go`
Dependencies:
- none
Acceptance criteria:
- service constructor/list path no longer exhibits startup race in targeted race tests
- no public API break at route wiring callsite
Rollback/contingency:
- revert constructor change only; preserve tests unchanged for quick recovery
#### PR-2: Handler Test Determinism
Scope:
- `backend/internal/api/handlers/certificate_handler_coverage_test.go`
- `backend/internal/api/handlers/certificate_handler_test.go`
- `backend/internal/api/handlers/certificate_handler_security_test.go`
- optional helper consolidation in `backend/internal/api/handlers/testdb.go`
Dependencies:
- PR-1 merged (preferred)
Acceptance criteria:
- `TestCertificateHandler_List_WithCertificates` stable across repeated CI-style runs
- no `time.Sleep` timing guard required for constructor race avoidance
Rollback/contingency:
- revert only test refactor while keeping stable service change
---
### 7) Acceptance Criteria (Definition of Done)
1. Stability gate: task ID `shell: Test: Backend Flaky - Certificate List Stability Loop` passes `100/100`.
2. Race gate: task ID `shell: Test: Backend Flaky - Certificate List Race Loop` passes with zero race reports.
3. Setup-ordering gate: task ID `shell: Test: Backend Flaky - Certificate DB Setup Ordering Loop` passes `50/50`, with no `no such table: ssl_certificates|proxy_hosts` in positive-path setup runs.
4. Regression gate: task ID `shell: Test: Backend Flaky - Certificate Handler Focused Regression` passes.
5. Playwright baseline gate is completed via task ID `shell: Test: E2E Playwright (FireFox) - Core: Certificates` (suite: `tests/core/certificates.spec.ts`), with task ID `shell: Docker: Rebuild E2E Environment` run first only when rebuild criteria are met.
6. Patch coverage preflight task ID `shell: Test: Local Patch Report` generates `test-results/local-patch-report.md` and `test-results/local-patch-report.json`.
7. Scope gate: no changes to `.gitignore`, `codecov.yml`, `.dockerignore`, or `Dockerfile` unless directly required to fix this flaky test.
8. Reproducibility gate: stress-loop outputs are artifactized under `test-results/flaky/` and retained in PR evidence (`cert-list-stability.jsonl`, `cert-list-race.jsonl`, `cert-db-setup-ordering.jsonl`, `cert-handler-regression.jsonl`).
9. If any validation fails, failure evidence and explicit follow-up tasks are recorded before completion.
---
### 8) Risks and Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Constructor behavior change impacts startup timing | medium | keep API stable; run targeted route/handler regression tests |
| Over-fixing spreads beyond certificate scope | medium | constrain edits to service + certificate tests only |
| DB setup-ordering fixes accidentally mask true migration problems | medium | fail fast during setup with explicit diagnostics and dedicated setup-ordering tests |
| Hidden race persists in adjacent tests | medium | run repeated/race-targeted suite; expand only if evidence requires |
| Out-of-scope config churn creates review noise | low | explicitly defer unrelated config hygiene from this PR |
---
### 9) Handoff
After this plan is approved:
1. Delegate execution to Supervisor/implementation agent with this spec as the source of truth.
2. Execute phases in order with validation gates between phases.
3. Keep PR scope narrow and deterministic; prefer single PR unless split triggers are hit.