## QA Test Failure Remediation Addendum (2026-02-19) ### Scope - In scope: test instability fixes and UI expectation alignment for the currently failing QA gates only. - Out of scope: feature additions, backend behavior changes unrelated to failing assertions, design refactors, and unrelated test cleanup. ### Failure Inventory (Source: `docs/reports/qa_report.md`) 1. `tests/settings/notifications.spec.ts` - Failing case A: provider enable/disable checkbox interaction fails (click interception/out-of-viewport). - Failing case B: invalid template preview assertion matches hidden text instead of visible error feedback. 2. `tests/security-enforcement/zzz-security-ui/system-security-settings.spec.ts` - Failing case: save success toast not visible in `should save general settings successfully`. 3. `frontend/src/pages/__tests__/Security.functional.test.tsx` - Failing case: expects Notifications button disabled when Cerberus is disabled, but current UI behavior keeps header action enabled. 4. Local patch preflight warning - `scripts/local-patch-report.sh` exits warning mode when `frontend/coverage/lcov.info` is missing, while still producing required artifacts. ### Root-Cause Hypotheses (Per Failing Test) #### 1) `tests/settings/notifications.spec.ts` - **Case A (enable/disable provider):** - Locator strategy is fragile (`checkbox` with sibling-label heuristic and fallback `first()`), so interaction can target a non-actionable/offscreen element after modal/card layout shifts. - `setChecked` can still fail when the selected checkbox is occluded by sticky UI layers or not in viewport; no explicit scroll/visibility stabilization is performed before toggling. - **Case B (invalid template preview error):** - Assertion `getByText(/error|failed|invalid/i).first()` is broad and may resolve to hidden/static text (for example option labels or non-toast content) rather than user-visible error feedback. - Test does not scope to live regions, toast container, or form-level validation block, causing false target selection. #### 2) `tests/security-enforcement/zzz-security-ui/system-security-settings.spec.ts` - Save operation likely succeeds (API 200 observed), but toast visibility assertion is timing- and selector-sensitive. - Current assertion mixes generic role/status and data-testid checks without first anchoring to a deterministic post-save signal (response + stable UI state), so ephemeral toast rendering window is missed in some runs. #### 3) `frontend/src/pages/__tests__/Security.functional.test.tsx` - Test expectation drift: spec assumes Notifications header button must be disabled when Cerberus is off. - Current UI behavior appears to allow opening Notification settings regardless of Cerberus state (feature warning applies to enforcement controls, not necessarily header actions), making test assertion stale. #### 4) `scripts/local-patch-report.sh` / coverage preflight - Script behavior is by design: in missing-input mode it writes `test-results/local-patch-report.md` and `test-results/local-patch-report.json` then exits non-zero. - QA gate interpretation is misaligned: artifact generation succeeds, but warning mode is treated as failure rather than preflight warning requiring upstream coverage generation. ### Ordered Minimal Fix Steps (Test + UI Expectation Alignment Only) 1. **Stabilize Notifications checkbox interaction** **Target:** `tests/settings/notifications.spec.ts` - Replace fallback checkbox locator with deterministic test-id or form-scoped role query tied to Enabled control only. - Add pre-action stabilization: ensure element is attached, visible, enabled, and scrolled into view before toggle. - Prefer click on associated label/control pair in modal context if native checkbox is visually wrapped. 2. **Narrow invalid-template error assertion to visible feedback channel** **Target:** `tests/settings/notifications.spec.ts` - Replace generic text matcher with scoped locator order: explicit error toast (`data-testid`/role alert), then visible form validation container. - Assert visibility on that scoped locator and avoid `.first()` on broad text searches. 3. **Stabilize system settings save-success verification** **Target:** `tests/security-enforcement/zzz-security-ui/system-security-settings.spec.ts` - Keep API response wait, then assert one deterministic success channel (shared toast helper or single canonical toast locator) with bounded timeout. - Avoid mixed fallback selectors that may match stale/inactive nodes. 4. **Align unit test expectation with intended UI contract** **Target:** `frontend/src/pages/__tests__/Security.functional.test.tsx` (+ `frontend/src/pages/Security.tsx` only if behavior contract requires it) - Confirm intended contract from current security page behavior and copy text semantics. - Minimal change path: update assertion to expected enabled state if product behavior intentionally permits Notifications action while Cerberus is disabled. - Alternate (only if contract requires disabled action): adjust button disabled logic in `Security.tsx` and keep existing unit expectation. - Choose exactly one path; do not change unrelated security-card logic. 5. **Clarify patch-preflight gate interpretation and sequencing** **Targets:** `docs/reports/qa_report.md` (evidence wording), optional test/runbook docs if needed - Record that preflight artifacts were produced in warning mode due to missing frontend coverage input. - Keep remediation action in validation sequence: generate frontend coverage before local-patch-report when full green gate is required. ### Validation Sequence (Return Full Gate to Green) 1. Run targeted Playwright notifications suite: `tests/settings/notifications.spec.ts` (Firefox project as currently gated). 2. Run targeted security UI suite: `tests/security-enforcement/zzz-security-ui/system-security-settings.spec.ts` (`security-tests` project). 3. Run focused frontend unit test file: `frontend/src/pages/__tests__/Security.functional.test.tsx`. 4. Generate frontend coverage output (`frontend/coverage/lcov.info`) via frontend coverage task/script. 5. Re-run local patch preflight and verify both artifacts exist and warning is cleared. 6. Re-run full required QA gate order for this branch: - Playwright targeted suites - Local patch preflight - Backend/frontend coverage gates - Type check - Pre-commit all files - Security scans (Trivy, Docker image scan, CodeQL Go/JS + findings gate) ### Acceptance Criteria (Addendum-Specific) - All three failing test files pass in their targeted runs. - No assertion relies on broad hidden-text matching for error/success feedback. - Notifications unit assertion reflects the actual intended UI contract (test and implementation consistent). - `test-results/local-patch-report.md` and `test-results/local-patch-report.json` are present, and preflight no longer reports missing `frontend/coverage/lcov.info`. - No unrelated feature or architectural change is introduced. ## Notify-only PR1 Remediation Addendum (Supervisor Unblock) Date: 2026-02-19 Status: Active addendum for PR1 unblock. Precedence: This addendum overrides conflicting sections below for PR1 execution. ### Scope Lock - In scope: Notify-only cutover for external notification provider dispatch, legacy provider migration completeness, and minimal QA unblock evidence. - Out of scope: certificate flaky test work and any unrelated certificate/runtime refactors. ### Dispatch Source of Truth (Decision) - **Selected approach: direct hard-cutover runtime logic** (not router+flags dispatch). - Runtime authority is in: - `backend/internal/services/notification_service.go` - `supportsJSONTemplates(...)` - `(*NotificationService).SendExternal(...)` - `(*NotificationService).TestProvider(...)` - `legacyFallbackInvocationError(...)` - `backend/internal/api/routes/routes.go` - `notificationService := services.NewNotificationService(db)` - `backend/internal/notifications/router.go` is non-authoritative for PR1 runtime dispatch and must not be introduced into runtime flow for this unblock. ### Legacy Provider Migration Completeness (Notify-only Runtime) 1. Add a single reconciliation pass in `backend/internal/services/notification_service.go`: - target function: `EnsureNotifyOnlyProviderMigration(ctx context.Context) error` - invoked once from route boot path after `NewNotificationService(db)` wiring. 2. Reconcile each `notification_providers` row to terminal state: - Supported types (`webhook`, `discord`, `slack`, `gotify`, `generic`): - `engine="notify_v1"`, `migration_state="migrated"`, `migration_error=""`, `last_migrated_at=now` - Unsupported types under notify-only (for example `telegram`): - `migration_state="failed"`, `migration_error="unsupported provider type in notify-only runtime"`, `enabled=false`, `last_migrated_at=now` - Preserve prior origin URL in `legacy_url` when mutating ambiguous/legacy rows. 3. Keep fail-closed runtime semantics: - `SendExternal`/`TestProvider` reject unsupported providers without legacy fallback. 4. Migration completeness gate (must be zero): - enabled providers where `migration_state` is empty, `pending`, or `failed`. ### Ordered Remediation Tasks (Minimal, QA-first) 1. Update plan scope and precedence in `docs/plans/current_spec.md` (this addendum). 2. Lock runtime dispatch authority to direct notify-only logic in `backend/internal/services/notification_service.go`. 3. Add migration reconciliation pass in `backend/internal/services/notification_service.go` and call it from `backend/internal/api/routes/routes.go`. 4. Preserve server-managed migration fields behavior in `backend/internal/api/handlers/notification_provider_handler.go`. 5. Capture unblock evidence in `docs/reports/qa_report.md`. ### Test List (Execution Order) 1. `go test ./backend/internal/services -run 'TestNotificationService_SendExternal_NotifyOnlyBlocksLegacyFallback|TestSendExternal_UnsupportedProviderFailsClosed|TestTestProvider_NotifyOnlyRejectsUnsupportedProvider|TestTestProvider_DiscordUsesNotifyPathInPR1'` 2. `go test ./backend/internal/api/handlers -run 'TestNotificationProviderHandler_CreateIgnoresServerManagedMigrationFields|TestNotificationProviderHandler_UpdatePreservesServerManagedMigrationFields'` 3. Add/run migration completeness test: - `go test ./backend/internal/services -run 'TestNotificationService_EnsureNotifyOnlyProviderMigration'` 4. `npx playwright test tests/settings/notifications.spec.ts --project=firefox` 5. `bash scripts/local-patch-report.sh` (must emit `test-results/local-patch-report.md` and `test-results/local-patch-report.json`) ### Acceptance Criteria (PR1 Unblock) 1. `current_spec.md` clearly reflects Notify-only PR1 scope and excludes certificate flaky scope. 2. Exactly one backend dispatch source-of-truth is documented and used: direct hard-cutover in `NotificationService`. 3. Legacy provider migration reconciliation exists and is covered by targeted tests. 4. No enabled provider remains in non-terminal migration state under Notify-only runtime. 5. QA evidence is updated for Supervisor re-review. ## Fix Flaky Go Test: `TestCertificateHandler_List_WithCertificates` ### 1) Introduction This specification defines a focused, low-risk plan to eliminate intermittent CI failures for: - `backend/internal/api/handlers.TestCertificateHandler_List_WithCertificates` while preserving production behavior in: - `backend/internal/services/certificate_service.go` - `backend/internal/api/handlers/certificate_handler.go` - `backend/internal/api/routes/routes.go` Primary objective: - Remove nondeterministic startup concurrency in certificate service initialization that intermittently races with handler list calls under CI parallelism/race detection. ### Decision Record - 2026-02-18 **Decision:** For this targeted flaky-fix, `docs/plans/current_spec.md` is the temporary single source of truth instead of creating separate `requirements.md`, `design.md`, and `tasks.md` artifacts. **Context:** The scope is intentionally narrow (single flaky test root-cause + deterministic validation loop), and splitting artifacts now would add process overhead without reducing delivery risk. **Traceability mapping:** - Requirements content is captured in Section 3 (EARS requirements). - Design content is captured in Section 4 (technical specifications). - Task breakdown and validation gates are captured in Section 5 (implementation plan). **Review trigger:** If scope expands beyond the certificate flaky-fix boundary, restore full artifact split into `requirements.md`, `design.md`, and `tasks.md`. --- ### 2) Research Findings #### 2.1 CI failure evidence Observed in `.github/logs/ci_failure.log`: - `WARNING: DATA RACE` - failing test: `TestCertificateHandler_List_WithCertificates` - conflicting access path: - **Write** from `(*CertificateService).SyncFromDisk` (`certificate_service.go`) - **Read** from `(*CertificateService).ListCertificates` (`certificate_service.go`), triggered via handler list path #### 2.2 Existing architecture and hotspot map **Service layer** - File: `backend/internal/services/certificate_service.go` - Key symbols: - `NewCertificateService(caddyDataDir string, db *gorm.DB) *CertificateService` - `SyncFromDisk() error` - `ListCertificates() ([]models.SSLCertificate, error)` - `InvalidateCache()` - `refreshCacheFromDB() error` **Handler layer** - File: `backend/internal/api/handlers/certificate_handler.go` - Key symbols: - `func (h *CertificateHandler) List(c *gin.Context)` - `func (h *CertificateHandler) Upload(c *gin.Context)` - `func (h *CertificateHandler) Delete(c *gin.Context)` **Route wiring** - File: `backend/internal/api/routes/routes.go` - Key symbol usage: - `services.NewCertificateService(caddyDataDir, db)` **Tests and setup patterns** - Files: - `backend/internal/api/handlers/certificate_handler_coverage_test.go` - `backend/internal/api/handlers/certificate_handler_test.go` - `backend/internal/api/handlers/certificate_handler_security_test.go` - `backend/internal/services/certificate_service_test.go` - `backend/internal/api/handlers/testdb.go` - Findings: - many handler tests instantiate the real constructor directly. - one test already includes a timing workaround (`time.Sleep`) due to startup race behavior. - service tests already use a helper pattern that avoids constructor-side async startup, which validates the architectural direction for deterministic initialization in tests. #### 2.3 Root cause summary Two explicit root-cause branches are tracked for this flaky area: - **Branch A — constructor startup race:** `NewCertificateService` starts async state mutation (`SyncFromDisk`) immediately, while handler-driven reads (`ListCertificates`) may execute concurrently. This matches CI race output in `.github/logs/ci_failure.log` for `TestCertificateHandler_List_WithCertificates`. - **Branch B — DB schema/setup ordering drift in tests:** some test paths reach service queries before deterministic schema setup is complete, producing intermittent setup-order failures such as `no such table: ssl_certificates` and `no such table: proxy_hosts`. Plan intent: address both branches together in one tight PR by making constructor behavior deterministic and enforcing migration/setup ordering in certificate handler test setup. --- ### 3) EARS Requirements (Source of Truth) #### 3.1 Ubiquitous requirements - THE SYSTEM SHALL construct `CertificateService` in a deterministic initialization state before first handler read in tests. - THE SYSTEM SHALL provide race-free access to certificate cache state during list operations. #### 3.2 Event-driven requirements - WHEN `NewCertificateService` returns, THE SYSTEM SHALL guarantee that any required first-read synchronization state is consistent for immediate `ListCertificates` calls. - WHEN `CertificateHandler.List` is called, THE SYSTEM SHALL return data or a deterministic error without concurrency side effects caused by service startup. - WHEN certificate handler tests initialize database state, THE SYSTEM SHALL complete required schema setup for certificate-related tables before service/handler execution. #### 3.3 Unwanted-behavior requirements - IF CI executes tests with race detector or parallel scheduling, THEN THE SYSTEM SHALL NOT produce data races between service initialization and list reads. - IF startup synchronization fails, THEN THE SYSTEM SHALL surface explicit errors and preserve handler-level HTTP error behavior. - IF test setup ordering is incorrect, THEN THE SYSTEM SHALL fail fast in setup with explicit diagnostics instead of deferred `no such table` runtime query failures. #### 3.4 Optional requirements - WHERE test-only construction paths are needed, THE SYSTEM SHALL use explicit test helpers/options instead of sleeps or timing assumptions. --- ### 4) Technical Specifications #### 4.1 Service design changes Primary design target (preferred): 1. Update `NewCertificateService` to remove nondeterministic startup behavior at construction boundary. 2. Ensure `SyncFromDisk` / cache state transition is serialized relative to early `ListCertificates` usage. 3. Keep existing mutex discipline explicit and auditable (single writer, safe readers). 4. Enforce deterministic DB setup ordering in certificate handler tests through shared setup helper behavior (migrate/setup before constructor and request execution). Design constraints: - No public API break for callers in `routes.go`. - No broad refactor of certificate business logic. - Keep behavior compatible with current handler response contracts. #### 4.2 Handler and route impact - `certificate_handler.go` should require no contract changes. - `routes.go` keeps existing construction call shape unless a minimally invasive constructor option path is required. #### 4.3 Data flow (target state) 1. Route wiring constructs certificate service. 2. Service enters stable initialized state before first list read path. 3. Handler `List` requests certificate data. 4. Service reads synchronized cache/DB state. 5. Handler responds `200` with deterministic payload or explicit `500` error. #### 4.4 Error handling matrix | Scenario | Expected behavior | Validation | |---|---|---| | Startup sync succeeds | List path returns stable data | handler list tests pass repeatedly | | Startup sync fails | Error bubbles predictably to handler | HTTP 500 tests remain deterministic | | Empty store | 200 + empty list (or existing contract shape) | existing empty-list tests pass | | Concurrent test execution | No race detector findings in certificate tests | `go test -race` targeted suite | | Test DB setup ordering incomplete | setup fails immediately with explicit setup error; no deferred query-time table errors | dedicated setup-ordering test + repeated run threshold | #### 4.5 API and schema impact - API endpoints: **no external contract changes**. - Request/response schema: **unchanged** for certificate list/upload/delete. - Database schema: **no migrations required**. --- ### 5) Implementation Plan (Phased) ## Phase 1: Playwright / UI-UX Baseline (Gate) Although this fix is backend-test focused, follow test protocol gate: - Reuse healthy E2E environment when possible; rebuild only if runtime image inputs changed. - Use exact task/suite gate: - task ID `shell: Test: E2E Playwright (FireFox) - Core: Certificates` - suite path executed by the task: `tests/core/certificates.spec.ts` - If rebuild is required by test protocol, run task ID `shell: Docker: Rebuild E2E Environment` before Playwright. Deliverables: - passing targeted Playwright suite for certificate list interactions - archived output in standard test artifacts ## Phase 2: Backend Service Stabilization Scope: - `backend/internal/services/certificate_service.go` Tasks: 1. Make constructor initialization deterministic at first-use boundary. 2. Remove startup timing dependence that allows early `ListCertificates` to overlap with constructor-initiated sync mutation. 3. Preserve current cache invalidation and DB refresh semantics. 4. Keep lock lifecycle simple and reviewable. Complexity: **Medium** (single component, concurrency-sensitive) ## Phase 3: Backend Test Hardening Scope: - `backend/internal/api/handlers/certificate_handler_coverage_test.go` - `backend/internal/api/handlers/certificate_handler_test.go` - `backend/internal/api/handlers/certificate_handler_security_test.go` - optional shared test helpers: - `backend/internal/api/handlers/testdb.go` Tasks: 1. Remove sleep-based or timing-dependent assumptions. 2. Standardize deterministic service setup helper for handler tests. 3. Ensure flaky case (`TestCertificateHandler_List_WithCertificates`) is fully deterministic. 4. Preserve assertion intent and API-level behavior checks. 5. Add explicit setup-ordering validation in tests to guarantee required schema migration/setup completes before handler/service invocation. Complexity: **Medium** (many callsites, low logic risk) ## Phase 4: Integration & Validation Tasks: 1. Run reproducible stability stress loop for the known flaky case via task ID `shell: Test: Backend Flaky - Certificate List Stability Loop`. - **Task command payload:** - `mkdir -p test-results/flaky && cd /projects/Charon && go test ./backend/internal/api/handlers -run '^TestCertificateHandler_List_WithCertificates$' -count=100 -shuffle=on -parallel=8 -json 2>&1 | tee test-results/flaky/cert-list-stability.jsonl` - **Artifactized logging requirement:** persist `test-results/flaky/cert-list-stability.jsonl` for reproducibility and CI comparison. - **Pass threshold:** `100/100` successful runs, zero failures. 2. Run race-mode stress for the same path via task ID `shell: Test: Backend Flaky - Certificate List Race Loop`. - **Task command payload:** - `mkdir -p test-results/flaky && cd /projects/Charon && go test -race ./backend/internal/api/handlers -run '^TestCertificateHandler_List_WithCertificates$' -count=30 -shuffle=on -parallel=8 -json 2>&1 | tee test-results/flaky/cert-list-race.jsonl` - **Artifactized logging requirement:** persist `test-results/flaky/cert-list-race.jsonl`. - **Pass threshold:** exit code `0` and zero `WARNING: DATA RACE` occurrences. 3. Run setup-ordering validation loop via task ID `shell: Test: Backend Flaky - Certificate DB Setup Ordering Loop`. - **Task command payload:** - `mkdir -p test-results/flaky && cd /projects/Charon && go test ./backend/internal/api/handlers -run '^TestCertificateHandler_DBSetupOrdering' -count=50 -shuffle=on -parallel=8 -json 2>&1 | tee test-results/flaky/cert-db-setup-ordering.jsonl` - **Pass threshold:** `50/50` successful runs and zero `no such table: ssl_certificates|proxy_hosts` messages from positive-path setup runs. 4. Run focused certificate handler regression suite via task ID `shell: Test: Backend Flaky - Certificate Handler Focused Regression`. - **Task command payload:** - `mkdir -p test-results/flaky && cd /projects/Charon && go test ./backend/internal/api/handlers -run '^TestCertificateHandler_' -count=1 -json 2>&1 | tee test-results/flaky/cert-handler-regression.jsonl` - **Pass threshold:** all selected tests pass in one clean run. 5. Execute patch-coverage preflight via existing project task ID `shell: Test: Local Patch Report`. - **Task command payload:** `bash scripts/local-patch-report.sh` - **Pass threshold:** both artifacts exist: `test-results/local-patch-report.md` and `test-results/local-patch-report.json`. Task definition status for Phase 4 gates: - Existing task ID in `.vscode/tasks.json`: - `shell: Test: Local Patch Report` - New task IDs to add in `.vscode/tasks.json` with the exact command payloads above: - `shell: Test: Backend Flaky - Certificate List Stability Loop` - `shell: Test: Backend Flaky - Certificate List Race Loop` - `shell: Test: Backend Flaky - Certificate DB Setup Ordering Loop` - `shell: Test: Backend Flaky - Certificate Handler Focused Regression` Complexity: **Low-Medium** ## Phase 5: Documentation & Operational Hygiene Tasks: 1. Update `docs/plans/current_spec.md` (this file) if implementation details evolve. 2. Record final decision outcomes and any deferred cleanup tasks. 3. Defer unrelated repository config hygiene (`.gitignore`, `codecov.yml`, `.dockerignore`, `Dockerfile`) unless a direct causal link to this flaky test is proven during implementation. Complexity: **Low** --- ### 6) PR Slicing Strategy ### Decision **Primary decision: single PR** for this fix. Why: - Scope is tightly coupled (constructor semantics + related tests). - Minimizes context switching and user review requests. - Reduces risk of partially landed concurrency behavior. ### Trigger reasons to split into multiple PRs Split only if one or more occur: - Constructor change causes wider runtime behavior concerns requiring staged rollout. - Test hardening touches many unrelated test modules beyond certificate handlers. - Validation reveals additional root-cause branches outside certificate service/test setup ordering. ### Ordered fallback slices (if split is required) #### PR-1: Service Determinism Core Scope: - `backend/internal/services/certificate_service.go` Dependencies: - none Acceptance criteria: - service constructor/list path no longer exhibits startup race in targeted race tests - no public API break at route wiring callsite Rollback/contingency: - revert constructor change only; preserve tests unchanged for quick recovery #### PR-2: Handler Test Determinism Scope: - `backend/internal/api/handlers/certificate_handler_coverage_test.go` - `backend/internal/api/handlers/certificate_handler_test.go` - `backend/internal/api/handlers/certificate_handler_security_test.go` - optional helper consolidation in `backend/internal/api/handlers/testdb.go` Dependencies: - PR-1 merged (preferred) Acceptance criteria: - `TestCertificateHandler_List_WithCertificates` stable across repeated CI-style runs - no `time.Sleep` timing guard required for constructor race avoidance Rollback/contingency: - revert only test refactor while keeping stable service change --- ### 7) Acceptance Criteria (Definition of Done) 1. Stability gate: task ID `shell: Test: Backend Flaky - Certificate List Stability Loop` passes `100/100`. 2. Race gate: task ID `shell: Test: Backend Flaky - Certificate List Race Loop` passes with zero race reports. 3. Setup-ordering gate: task ID `shell: Test: Backend Flaky - Certificate DB Setup Ordering Loop` passes `50/50`, with no `no such table: ssl_certificates|proxy_hosts` in positive-path setup runs. 4. Regression gate: task ID `shell: Test: Backend Flaky - Certificate Handler Focused Regression` passes. 5. Playwright baseline gate is completed via task ID `shell: Test: E2E Playwright (FireFox) - Core: Certificates` (suite: `tests/core/certificates.spec.ts`), with task ID `shell: Docker: Rebuild E2E Environment` run first only when rebuild criteria are met. 6. Patch coverage preflight task ID `shell: Test: Local Patch Report` generates `test-results/local-patch-report.md` and `test-results/local-patch-report.json`. 7. Scope gate: no changes to `.gitignore`, `codecov.yml`, `.dockerignore`, or `Dockerfile` unless directly required to fix this flaky test. 8. Reproducibility gate: stress-loop outputs are artifactized under `test-results/flaky/` and retained in PR evidence (`cert-list-stability.jsonl`, `cert-list-race.jsonl`, `cert-db-setup-ordering.jsonl`, `cert-handler-regression.jsonl`). 9. If any validation fails, failure evidence and explicit follow-up tasks are recorded before completion. --- ### 8) Risks and Mitigations | Risk | Impact | Mitigation | |---|---|---| | Constructor behavior change impacts startup timing | medium | keep API stable; run targeted route/handler regression tests | | Over-fixing spreads beyond certificate scope | medium | constrain edits to service + certificate tests only | | DB setup-ordering fixes accidentally mask true migration problems | medium | fail fast during setup with explicit diagnostics and dedicated setup-ordering tests | | Hidden race persists in adjacent tests | medium | run repeated/race-targeted suite; expand only if evidence requires | | Out-of-scope config churn creates review noise | low | explicitly defer unrelated config hygiene from this PR | --- ### 9) Handoff After this plan is approved: 1. Delegate execution to Supervisor/implementation agent with this spec as the source of truth. 2. Execute phases in order with validation gates between phases. 3. Keep PR scope narrow and deterministic; prefer single PR unless split triggers are hit. - Migration failures emit structured logs and increment migration-failure counters. - Legacy path execution attempts increment explicit counters and write warning/error logs. ## 10) Critical Risks and Mitigations | Risk | Impact | Mitigation | |---|---|---| | Legacy provider records fail conversion under Notify-only runtime | Notification send failures for affected providers | Transactional migration, explicit `migration_state`/`migration_error`, actionable errors, focused migration tests | | Hidden fallback assumptions still exist in tests/code | False confidence and production drift | Sentinel-based tests that fail on fallback invocation; remove fallback wiring | | Rollout outage without fallback safety net | Increased short-term operational risk | Smaller PR slices, strict preflight E2E/tests/scans, fast revert procedure | | UI/API contract mismatch during canonicalization | Broken settings UX | Canonical endpoint tests across backend/frontend/E2E | | Silent regression in legacy-path blocking instrumentation | Fallback attempts occur without operational visibility, delaying incident response | Add explicit legacy-attempt counters/logs to DoD, dashboard alert thresholds, and validation step that asserts non-zero logging on injected fallback-attempt test path | ## 11) Handoff After approval, delegate execution to the `Supervisor` agent with this file as the single active plan baseline. - Migration failures emit structured logs and increment migration-failure counters. - Legacy path execution attempts increment explicit counters and write warning/error logs. ## 10) Critical Risks and Mitigations | Risk | Impact | Mitigation | |---|---|---| | Legacy provider records fail conversion under Notify-only runtime | Notification send failures for affected providers | Transactional migration, explicit `migration_state`/`migration_error`, actionable errors, focused migration tests | | Hidden fallback assumptions still exist in tests/code | False confidence and production drift | Sentinel-based tests that fail on fallback invocation; remove fallback wiring | | Rollout outage without fallback safety net | Increased short-term operational risk | Smaller PR slices, strict preflight E2E/tests/scans, fast revert procedure | | UI/API contract mismatch during canonicalization | Broken settings UX | Canonical endpoint tests across backend/frontend/E2E | | Silent regression in legacy-path blocking instrumentation | Fallback attempts occur without operational visibility, delaying incident response | Add explicit legacy-attempt counters/logs to DoD, dashboard alert thresholds, and validation step that asserts non-zero logging on injected fallback-attempt test path | ## 11) Handoff After approval, delegate execution to the `Supervisor` agent with this file as the single active plan baseline.