Files
Charon/docs/plans/current_spec.md
GitHub Actions 0c2e2f7214 fix: stabilize certificate test execution under CI concurrency
This change hardens certificate handler test execution so repeated CI runs are deterministic and no longer fail intermittently under concurrent scheduling and race-mode pressure.

It was necessary because initialization timing and test setup ordering created nondeterministic behavior that produced sporadic failures in the backend test suite.

The result is a stable certificate list test path with explicit validation gates and reproducible test artifacts for auditing.

Known container vulnerability findings remain documented and are treated as an accepted exception for this hotfix scope, with remediation deferred to the dedicated security track.
2026-02-19 00:44:44 +00:00

17 KiB

Fix Flaky Go Test: TestCertificateHandler_List_WithCertificates

1) Introduction

This specification defines a focused, low-risk plan to eliminate intermittent CI failures for:

  • backend/internal/api/handlers.TestCertificateHandler_List_WithCertificates

while preserving production behavior in:

  • backend/internal/services/certificate_service.go
  • backend/internal/api/handlers/certificate_handler.go
  • backend/internal/api/routes/routes.go

Primary objective:

  • Remove nondeterministic startup concurrency in certificate service initialization that intermittently races with handler list calls under CI parallelism/race detection.

Decision Record - 2026-02-18

Decision: For this targeted flaky-fix, docs/plans/current_spec.md is the temporary single source of truth instead of creating separate requirements.md, design.md, and tasks.md artifacts.

Context: The scope is intentionally narrow (single flaky test root-cause + deterministic validation loop), and splitting artifacts now would add process overhead without reducing delivery risk.

Traceability mapping:

  • Requirements content is captured in Section 3 (EARS requirements).
  • Design content is captured in Section 4 (technical specifications).
  • Task breakdown and validation gates are captured in Section 5 (implementation plan).

Review trigger: If scope expands beyond the certificate flaky-fix boundary, restore full artifact split into requirements.md, design.md, and tasks.md.


2) Research Findings

2.1 CI failure evidence

Observed in .github/logs/ci_failure.log:

  • WARNING: DATA RACE
  • failing test: TestCertificateHandler_List_WithCertificates
  • conflicting access path:
    • Write from (*CertificateService).SyncFromDisk (certificate_service.go)
    • Read from (*CertificateService).ListCertificates (certificate_service.go), triggered via handler list path

2.2 Existing architecture and hotspot map

Service layer

  • File: backend/internal/services/certificate_service.go
  • Key symbols:
    • NewCertificateService(caddyDataDir string, db *gorm.DB) *CertificateService
    • SyncFromDisk() error
    • ListCertificates() ([]models.SSLCertificate, error)
    • InvalidateCache()
    • refreshCacheFromDB() error

Handler layer

  • File: backend/internal/api/handlers/certificate_handler.go
  • Key symbols:
    • func (h *CertificateHandler) List(c *gin.Context)
    • func (h *CertificateHandler) Upload(c *gin.Context)
    • func (h *CertificateHandler) Delete(c *gin.Context)

Route wiring

  • File: backend/internal/api/routes/routes.go
  • Key symbol usage:
    • services.NewCertificateService(caddyDataDir, db)

Tests and setup patterns

  • Files:
    • backend/internal/api/handlers/certificate_handler_coverage_test.go
    • backend/internal/api/handlers/certificate_handler_test.go
    • backend/internal/api/handlers/certificate_handler_security_test.go
    • backend/internal/services/certificate_service_test.go
    • backend/internal/api/handlers/testdb.go
  • Findings:
    • many handler tests instantiate the real constructor directly.
    • one test already includes a timing workaround (time.Sleep) due to startup race behavior.
    • service tests already use a helper pattern that avoids constructor-side async startup, which validates the architectural direction for deterministic initialization in tests.

2.3 Root cause summary

Two explicit root-cause branches are tracked for this flaky area:

  • Branch A — constructor startup race: NewCertificateService starts async state mutation (SyncFromDisk) immediately, while handler-driven reads (ListCertificates) may execute concurrently. This matches CI race output in .github/logs/ci_failure.log for TestCertificateHandler_List_WithCertificates.
  • Branch B — DB schema/setup ordering drift in tests: some test paths reach service queries before deterministic schema setup is complete, producing intermittent setup-order failures such as no such table: ssl_certificates and no such table: proxy_hosts.

Plan intent: address both branches together in one tight PR by making constructor behavior deterministic and enforcing migration/setup ordering in certificate handler test setup.


3) EARS Requirements (Source of Truth)

3.1 Ubiquitous requirements

  • THE SYSTEM SHALL construct CertificateService in a deterministic initialization state before first handler read in tests.
  • THE SYSTEM SHALL provide race-free access to certificate cache state during list operations.

3.2 Event-driven requirements

  • WHEN NewCertificateService returns, THE SYSTEM SHALL guarantee that any required first-read synchronization state is consistent for immediate ListCertificates calls.
  • WHEN CertificateHandler.List is called, THE SYSTEM SHALL return data or a deterministic error without concurrency side effects caused by service startup.
  • WHEN certificate handler tests initialize database state, THE SYSTEM SHALL complete required schema setup for certificate-related tables before service/handler execution.

3.3 Unwanted-behavior requirements

  • IF CI executes tests with race detector or parallel scheduling, THEN THE SYSTEM SHALL NOT produce data races between service initialization and list reads.
  • IF startup synchronization fails, THEN THE SYSTEM SHALL surface explicit errors and preserve handler-level HTTP error behavior.
  • IF test setup ordering is incorrect, THEN THE SYSTEM SHALL fail fast in setup with explicit diagnostics instead of deferred no such table runtime query failures.

3.4 Optional requirements

  • WHERE test-only construction paths are needed, THE SYSTEM SHALL use explicit test helpers/options instead of sleeps or timing assumptions.

4) Technical Specifications

4.1 Service design changes

Primary design target (preferred):

  1. Update NewCertificateService to remove nondeterministic startup behavior at construction boundary.
  2. Ensure SyncFromDisk / cache state transition is serialized relative to early ListCertificates usage.
  3. Keep existing mutex discipline explicit and auditable (single writer, safe readers).
  4. Enforce deterministic DB setup ordering in certificate handler tests through shared setup helper behavior (migrate/setup before constructor and request execution).

Design constraints:

  • No public API break for callers in routes.go.
  • No broad refactor of certificate business logic.
  • Keep behavior compatible with current handler response contracts.

4.2 Handler and route impact

  • certificate_handler.go should require no contract changes.
  • routes.go keeps existing construction call shape unless a minimally invasive constructor option path is required.

4.3 Data flow (target state)

  1. Route wiring constructs certificate service.
  2. Service enters stable initialized state before first list read path.
  3. Handler List requests certificate data.
  4. Service reads synchronized cache/DB state.
  5. Handler responds 200 with deterministic payload or explicit 500 error.

4.4 Error handling matrix

Scenario Expected behavior Validation
Startup sync succeeds List path returns stable data handler list tests pass repeatedly
Startup sync fails Error bubbles predictably to handler HTTP 500 tests remain deterministic
Empty store 200 + empty list (or existing contract shape) existing empty-list tests pass
Concurrent test execution No race detector findings in certificate tests go test -race targeted suite
Test DB setup ordering incomplete setup fails immediately with explicit setup error; no deferred query-time table errors dedicated setup-ordering test + repeated run threshold

4.5 API and schema impact

  • API endpoints: no external contract changes.
  • Request/response schema: unchanged for certificate list/upload/delete.
  • Database schema: no migrations required.

5) Implementation Plan (Phased)

Phase 1: Playwright / UI-UX Baseline (Gate)

Although this fix is backend-test focused, follow test protocol gate:

  • Reuse healthy E2E environment when possible; rebuild only if runtime image inputs changed.
  • Use exact task/suite gate:
    • task ID shell: Test: E2E Playwright (FireFox) - Core: Certificates
    • suite path executed by the task: tests/core/certificates.spec.ts
  • If rebuild is required by test protocol, run task ID shell: Docker: Rebuild E2E Environment before Playwright.

Deliverables:

  • passing targeted Playwright suite for certificate list interactions
  • archived output in standard test artifacts

Phase 2: Backend Service Stabilization

Scope:

  • backend/internal/services/certificate_service.go

Tasks:

  1. Make constructor initialization deterministic at first-use boundary.
  2. Remove startup timing dependence that allows early ListCertificates to overlap with constructor-initiated sync mutation.
  3. Preserve current cache invalidation and DB refresh semantics.
  4. Keep lock lifecycle simple and reviewable.

Complexity: Medium (single component, concurrency-sensitive)

Phase 3: Backend Test Hardening

Scope:

  • backend/internal/api/handlers/certificate_handler_coverage_test.go
  • backend/internal/api/handlers/certificate_handler_test.go
  • backend/internal/api/handlers/certificate_handler_security_test.go
  • optional shared test helpers:
    • backend/internal/api/handlers/testdb.go

Tasks:

  1. Remove sleep-based or timing-dependent assumptions.
  2. Standardize deterministic service setup helper for handler tests.
  3. Ensure flaky case (TestCertificateHandler_List_WithCertificates) is fully deterministic.
  4. Preserve assertion intent and API-level behavior checks.
  5. Add explicit setup-ordering validation in tests to guarantee required schema migration/setup completes before handler/service invocation.

Complexity: Medium (many callsites, low logic risk)

Phase 4: Integration & Validation

Tasks:

  1. Run reproducible stability stress loop for the known flaky case via task ID shell: Test: Backend Flaky - Certificate List Stability Loop.
  • Task command payload:
    • mkdir -p test-results/flaky && cd /projects/Charon && go test ./backend/internal/api/handlers -run '^TestCertificateHandler_List_WithCertificates$' -count=100 -shuffle=on -parallel=8 -json 2>&1 | tee test-results/flaky/cert-list-stability.jsonl
  • Artifactized logging requirement: persist test-results/flaky/cert-list-stability.jsonl for reproducibility and CI comparison.
  • Pass threshold: 100/100 successful runs, zero failures.
  1. Run race-mode stress for the same path via task ID shell: Test: Backend Flaky - Certificate List Race Loop.
  • Task command payload:
    • mkdir -p test-results/flaky && cd /projects/Charon && go test -race ./backend/internal/api/handlers -run '^TestCertificateHandler_List_WithCertificates$' -count=30 -shuffle=on -parallel=8 -json 2>&1 | tee test-results/flaky/cert-list-race.jsonl
  • Artifactized logging requirement: persist test-results/flaky/cert-list-race.jsonl.
  • Pass threshold: exit code 0 and zero WARNING: DATA RACE occurrences.
  1. Run setup-ordering validation loop via task ID shell: Test: Backend Flaky - Certificate DB Setup Ordering Loop.
  • Task command payload:
    • mkdir -p test-results/flaky && cd /projects/Charon && go test ./backend/internal/api/handlers -run '^TestCertificateHandler_DBSetupOrdering' -count=50 -shuffle=on -parallel=8 -json 2>&1 | tee test-results/flaky/cert-db-setup-ordering.jsonl
  • Pass threshold: 50/50 successful runs and zero no such table: ssl_certificates|proxy_hosts messages from positive-path setup runs.
  1. Run focused certificate handler regression suite via task ID shell: Test: Backend Flaky - Certificate Handler Focused Regression.
  • Task command payload:
    • mkdir -p test-results/flaky && cd /projects/Charon && go test ./backend/internal/api/handlers -run '^TestCertificateHandler_' -count=1 -json 2>&1 | tee test-results/flaky/cert-handler-regression.jsonl
  • Pass threshold: all selected tests pass in one clean run.
  1. Execute patch-coverage preflight via existing project task ID shell: Test: Local Patch Report.
  • Task command payload: bash scripts/local-patch-report.sh
  • Pass threshold: both artifacts exist: test-results/local-patch-report.md and test-results/local-patch-report.json.

Task definition status for Phase 4 gates:

  • Existing task ID in .vscode/tasks.json:
    • shell: Test: Local Patch Report
  • New task IDs to add in .vscode/tasks.json with the exact command payloads above:
    • shell: Test: Backend Flaky - Certificate List Stability Loop
    • shell: Test: Backend Flaky - Certificate List Race Loop
    • shell: Test: Backend Flaky - Certificate DB Setup Ordering Loop
    • shell: Test: Backend Flaky - Certificate Handler Focused Regression

Complexity: Low-Medium

Phase 5: Documentation & Operational Hygiene

Tasks:

  1. Update docs/plans/current_spec.md (this file) if implementation details evolve.
  2. Record final decision outcomes and any deferred cleanup tasks.
  3. Defer unrelated repository config hygiene (.gitignore, codecov.yml, .dockerignore, Dockerfile) unless a direct causal link to this flaky test is proven during implementation.

Complexity: Low


6) PR Slicing Strategy

Decision

Primary decision: single PR for this fix.

Why:

  • Scope is tightly coupled (constructor semantics + related tests).
  • Minimizes context switching and user review requests.
  • Reduces risk of partially landed concurrency behavior.

Trigger reasons to split into multiple PRs

Split only if one or more occur:

  • Constructor change causes wider runtime behavior concerns requiring staged rollout.
  • Test hardening touches many unrelated test modules beyond certificate handlers.
  • Validation reveals additional root-cause branches outside certificate service/test setup ordering.

Ordered fallback slices (if split is required)

PR-1: Service Determinism Core

Scope:

  • backend/internal/services/certificate_service.go

Dependencies:

  • none

Acceptance criteria:

  • service constructor/list path no longer exhibits startup race in targeted race tests
  • no public API break at route wiring callsite

Rollback/contingency:

  • revert constructor change only; preserve tests unchanged for quick recovery

PR-2: Handler Test Determinism

Scope:

  • backend/internal/api/handlers/certificate_handler_coverage_test.go
  • backend/internal/api/handlers/certificate_handler_test.go
  • backend/internal/api/handlers/certificate_handler_security_test.go
  • optional helper consolidation in backend/internal/api/handlers/testdb.go

Dependencies:

  • PR-1 merged (preferred)

Acceptance criteria:

  • TestCertificateHandler_List_WithCertificates stable across repeated CI-style runs
  • no time.Sleep timing guard required for constructor race avoidance

Rollback/contingency:

  • revert only test refactor while keeping stable service change

7) Acceptance Criteria (Definition of Done)

  1. Stability gate: task ID shell: Test: Backend Flaky - Certificate List Stability Loop passes 100/100.
  2. Race gate: task ID shell: Test: Backend Flaky - Certificate List Race Loop passes with zero race reports.
  3. Setup-ordering gate: task ID shell: Test: Backend Flaky - Certificate DB Setup Ordering Loop passes 50/50, with no no such table: ssl_certificates|proxy_hosts in positive-path setup runs.
  4. Regression gate: task ID shell: Test: Backend Flaky - Certificate Handler Focused Regression passes.
  5. Playwright baseline gate is completed via task ID shell: Test: E2E Playwright (FireFox) - Core: Certificates (suite: tests/core/certificates.spec.ts), with task ID shell: Docker: Rebuild E2E Environment run first only when rebuild criteria are met.
  6. Patch coverage preflight task ID shell: Test: Local Patch Report generates test-results/local-patch-report.md and test-results/local-patch-report.json.
  7. Scope gate: no changes to .gitignore, codecov.yml, .dockerignore, or Dockerfile unless directly required to fix this flaky test.
  8. Reproducibility gate: stress-loop outputs are artifactized under test-results/flaky/ and retained in PR evidence (cert-list-stability.jsonl, cert-list-race.jsonl, cert-db-setup-ordering.jsonl, cert-handler-regression.jsonl).
  9. If any validation fails, failure evidence and explicit follow-up tasks are recorded before completion.

8) Risks and Mitigations

Risk Impact Mitigation
Constructor behavior change impacts startup timing medium keep API stable; run targeted route/handler regression tests
Over-fixing spreads beyond certificate scope medium constrain edits to service + certificate tests only
DB setup-ordering fixes accidentally mask true migration problems medium fail fast during setup with explicit diagnostics and dedicated setup-ordering tests
Hidden race persists in adjacent tests medium run repeated/race-targeted suite; expand only if evidence requires
Out-of-scope config churn creates review noise low explicitly defer unrelated config hygiene from this PR

9) Handoff

After this plan is approved:

  1. Delegate execution to Supervisor/implementation agent with this spec as the source of truth.
  2. Execute phases in order with validation gates between phases.
  3. Keep PR scope narrow and deterministic; prefer single PR unless split triggers are hit.