31 KiB
QA Test Failure Remediation Addendum (2026-02-19)
Scope
- In scope: test instability fixes and UI expectation alignment for the currently failing QA gates only.
- Out of scope: feature additions, backend behavior changes unrelated to failing assertions, design refactors, and unrelated test cleanup.
Failure Inventory (Source: docs/reports/qa_report.md)
tests/settings/notifications.spec.ts
- Failing case A: provider enable/disable checkbox interaction fails (click interception/out-of-viewport).
- Failing case B: invalid template preview assertion matches hidden text instead of visible error feedback.
tests/security-enforcement/zzz-security-ui/system-security-settings.spec.ts
- Failing case: save success toast not visible in
should save general settings successfully.
frontend/src/pages/__tests__/Security.functional.test.tsx
- Failing case: expects Notifications button disabled when Cerberus is disabled, but current UI behavior keeps header action enabled.
- Local patch preflight warning
scripts/local-patch-report.shexits warning mode whenfrontend/coverage/lcov.infois missing, while still producing required artifacts.
Root-Cause Hypotheses (Per Failing Test)
1) tests/settings/notifications.spec.ts
- Case A (enable/disable provider):
- Locator strategy is fragile (
checkboxwith sibling-label heuristic and fallbackfirst()), so interaction can target a non-actionable/offscreen element after modal/card layout shifts. setCheckedcan still fail when the selected checkbox is occluded by sticky UI layers or not in viewport; no explicit scroll/visibility stabilization is performed before toggling.
- Locator strategy is fragile (
- Case B (invalid template preview error):
- Assertion
getByText(/error|failed|invalid/i).first()is broad and may resolve to hidden/static text (for example option labels or non-toast content) rather than user-visible error feedback. - Test does not scope to live regions, toast container, or form-level validation block, causing false target selection.
- Assertion
2) tests/security-enforcement/zzz-security-ui/system-security-settings.spec.ts
- Save operation likely succeeds (API 200 observed), but toast visibility assertion is timing- and selector-sensitive.
- Current assertion mixes generic role/status and data-testid checks without first anchoring to a deterministic post-save signal (response + stable UI state), so ephemeral toast rendering window is missed in some runs.
3) frontend/src/pages/__tests__/Security.functional.test.tsx
- Test expectation drift: spec assumes Notifications header button must be disabled when Cerberus is off.
- Current UI behavior appears to allow opening Notification settings regardless of Cerberus state (feature warning applies to enforcement controls, not necessarily header actions), making test assertion stale.
4) scripts/local-patch-report.sh / coverage preflight
- Script behavior is by design: in missing-input mode it writes
test-results/local-patch-report.mdandtest-results/local-patch-report.jsonthen exits non-zero. - QA gate interpretation is misaligned: artifact generation succeeds, but warning mode is treated as failure rather than preflight warning requiring upstream coverage generation.
Ordered Minimal Fix Steps (Test + UI Expectation Alignment Only)
- Stabilize Notifications checkbox interaction
Target:
tests/settings/notifications.spec.ts
- Replace fallback checkbox locator with deterministic test-id or form-scoped role query tied to Enabled control only.
- Add pre-action stabilization: ensure element is attached, visible, enabled, and scrolled into view before toggle.
- Prefer click on associated label/control pair in modal context if native checkbox is visually wrapped.
- Narrow invalid-template error assertion to visible feedback channel
Target:
tests/settings/notifications.spec.ts
- Replace generic text matcher with scoped locator order: explicit error toast (
data-testid/role alert), then visible form validation container. - Assert visibility on that scoped locator and avoid
.first()on broad text searches.
- Stabilize system settings save-success verification
Target:
tests/security-enforcement/zzz-security-ui/system-security-settings.spec.ts
- Keep API response wait, then assert one deterministic success channel (shared toast helper or single canonical toast locator) with bounded timeout.
- Avoid mixed fallback selectors that may match stale/inactive nodes.
- Align unit test expectation with intended UI contract
Target:
frontend/src/pages/__tests__/Security.functional.test.tsx(+frontend/src/pages/Security.tsxonly if behavior contract requires it)
- Confirm intended contract from current security page behavior and copy text semantics.
- Minimal change path: update assertion to expected enabled state if product behavior intentionally permits Notifications action while Cerberus is disabled.
- Alternate (only if contract requires disabled action): adjust button disabled logic in
Security.tsxand keep existing unit expectation. - Choose exactly one path; do not change unrelated security-card logic.
- Clarify patch-preflight gate interpretation and sequencing
Targets:
docs/reports/qa_report.md(evidence wording), optional test/runbook docs if needed
- Record that preflight artifacts were produced in warning mode due to missing frontend coverage input.
- Keep remediation action in validation sequence: generate frontend coverage before local-patch-report when full green gate is required.
Validation Sequence (Return Full Gate to Green)
- Run targeted Playwright notifications suite:
tests/settings/notifications.spec.ts(Firefox project as currently gated). - Run targeted security UI suite:
tests/security-enforcement/zzz-security-ui/system-security-settings.spec.ts(security-testsproject). - Run focused frontend unit test file:
frontend/src/pages/__tests__/Security.functional.test.tsx. - Generate frontend coverage output (
frontend/coverage/lcov.info) via frontend coverage task/script. - Re-run local patch preflight and verify both artifacts exist and warning is cleared.
- Re-run full required QA gate order for this branch:
- Playwright targeted suites
- Local patch preflight
- Backend/frontend coverage gates
- Type check
- Pre-commit all files
- Security scans (Trivy, Docker image scan, CodeQL Go/JS + findings gate)
Acceptance Criteria (Addendum-Specific)
- All three failing test files pass in their targeted runs.
- No assertion relies on broad hidden-text matching for error/success feedback.
- Notifications unit assertion reflects the actual intended UI contract (test and implementation consistent).
test-results/local-patch-report.mdandtest-results/local-patch-report.jsonare present, and preflight no longer reports missingfrontend/coverage/lcov.info.- No unrelated feature or architectural change is introduced.
Notify-only PR1 Remediation Addendum (Supervisor Unblock)
Date: 2026-02-19 Status: Active addendum for PR1 unblock. Precedence: This addendum overrides conflicting sections below for PR1 execution.
Scope Lock
- In scope: Notify-only cutover for external notification provider dispatch, legacy provider migration completeness, and minimal QA unblock evidence.
- Out of scope: certificate flaky test work and any unrelated certificate/runtime refactors.
Dispatch Source of Truth (Decision)
- Selected approach: direct hard-cutover runtime logic (not router+flags dispatch).
- Runtime authority is in:
backend/internal/services/notification_service.gosupportsJSONTemplates(...)(*NotificationService).SendExternal(...)(*NotificationService).TestProvider(...)legacyFallbackInvocationError(...)backend/internal/api/routes/routes.gonotificationService := services.NewNotificationService(db)
backend/internal/notifications/router.gois non-authoritative for PR1 runtime dispatch and must not be introduced into runtime flow for this unblock.
Legacy Provider Migration Completeness (Notify-only Runtime)
- Add a single reconciliation pass in
backend/internal/services/notification_service.go:
- target function:
EnsureNotifyOnlyProviderMigration(ctx context.Context) error - invoked once from route boot path after
NewNotificationService(db)wiring.
- Reconcile each
notification_providersrow to terminal state:
- Supported types (
webhook,discord,slack,gotify,generic):engine="notify_v1",migration_state="migrated",migration_error="",last_migrated_at=now
- Unsupported types under notify-only (for example
telegram):migration_state="failed",migration_error="unsupported provider type in notify-only runtime",enabled=false,last_migrated_at=now
- Preserve prior origin URL in
legacy_urlwhen mutating ambiguous/legacy rows.
- Keep fail-closed runtime semantics:
SendExternal/TestProviderreject unsupported providers without legacy fallback.
- Migration completeness gate (must be zero):
- enabled providers where
migration_stateis empty,pending, orfailed.
Ordered Remediation Tasks (Minimal, QA-first)
- Update plan scope and precedence in
docs/plans/current_spec.md(this addendum). - Lock runtime dispatch authority to direct notify-only logic in
backend/internal/services/notification_service.go. - Add migration reconciliation pass in
backend/internal/services/notification_service.goand call it frombackend/internal/api/routes/routes.go. - Preserve server-managed migration fields behavior in
backend/internal/api/handlers/notification_provider_handler.go. - Capture unblock evidence in
docs/reports/qa_report.md.
Test List (Execution Order)
go test ./backend/internal/services -run 'TestNotificationService_SendExternal_NotifyOnlyBlocksLegacyFallback|TestSendExternal_UnsupportedProviderFailsClosed|TestTestProvider_NotifyOnlyRejectsUnsupportedProvider|TestTestProvider_DiscordUsesNotifyPathInPR1'go test ./backend/internal/api/handlers -run 'TestNotificationProviderHandler_CreateIgnoresServerManagedMigrationFields|TestNotificationProviderHandler_UpdatePreservesServerManagedMigrationFields'- Add/run migration completeness test:
go test ./backend/internal/services -run 'TestNotificationService_EnsureNotifyOnlyProviderMigration'
npx playwright test tests/settings/notifications.spec.ts --project=firefoxbash scripts/local-patch-report.sh(must emittest-results/local-patch-report.mdandtest-results/local-patch-report.json)
Acceptance Criteria (PR1 Unblock)
current_spec.mdclearly reflects Notify-only PR1 scope and excludes certificate flaky scope.- Exactly one backend dispatch source-of-truth is documented and used: direct hard-cutover in
NotificationService. - Legacy provider migration reconciliation exists and is covered by targeted tests.
- No enabled provider remains in non-terminal migration state under Notify-only runtime.
- QA evidence is updated for Supervisor re-review.
Fix Flaky Go Test: TestCertificateHandler_List_WithCertificates
1) Introduction
This specification defines a focused, low-risk plan to eliminate intermittent CI failures for:
backend/internal/api/handlers.TestCertificateHandler_List_WithCertificates
while preserving production behavior in:
backend/internal/services/certificate_service.gobackend/internal/api/handlers/certificate_handler.gobackend/internal/api/routes/routes.go
Primary objective:
- Remove nondeterministic startup concurrency in certificate service initialization that intermittently races with handler list calls under CI parallelism/race detection.
Decision Record - 2026-02-18
Decision: For this targeted flaky-fix, docs/plans/current_spec.md is the temporary single source of truth instead of creating separate requirements.md, design.md, and tasks.md artifacts.
Context: The scope is intentionally narrow (single flaky test root-cause + deterministic validation loop), and splitting artifacts now would add process overhead without reducing delivery risk.
Traceability mapping:
- Requirements content is captured in Section 3 (EARS requirements).
- Design content is captured in Section 4 (technical specifications).
- Task breakdown and validation gates are captured in Section 5 (implementation plan).
Review trigger: If scope expands beyond the certificate flaky-fix boundary, restore full artifact split into requirements.md, design.md, and tasks.md.
2) Research Findings
2.1 CI failure evidence
Observed in .github/logs/ci_failure.log:
WARNING: DATA RACE- failing test:
TestCertificateHandler_List_WithCertificates - conflicting access path:
- Write from
(*CertificateService).SyncFromDisk(certificate_service.go) - Read from
(*CertificateService).ListCertificates(certificate_service.go), triggered via handler list path
- Write from
2.2 Existing architecture and hotspot map
Service layer
- File:
backend/internal/services/certificate_service.go - Key symbols:
NewCertificateService(caddyDataDir string, db *gorm.DB) *CertificateServiceSyncFromDisk() errorListCertificates() ([]models.SSLCertificate, error)InvalidateCache()refreshCacheFromDB() error
Handler layer
- File:
backend/internal/api/handlers/certificate_handler.go - Key symbols:
func (h *CertificateHandler) List(c *gin.Context)func (h *CertificateHandler) Upload(c *gin.Context)func (h *CertificateHandler) Delete(c *gin.Context)
Route wiring
- File:
backend/internal/api/routes/routes.go - Key symbol usage:
services.NewCertificateService(caddyDataDir, db)
Tests and setup patterns
- Files:
backend/internal/api/handlers/certificate_handler_coverage_test.gobackend/internal/api/handlers/certificate_handler_test.gobackend/internal/api/handlers/certificate_handler_security_test.gobackend/internal/services/certificate_service_test.gobackend/internal/api/handlers/testdb.go
- Findings:
- many handler tests instantiate the real constructor directly.
- one test already includes a timing workaround (
time.Sleep) due to startup race behavior. - service tests already use a helper pattern that avoids constructor-side async startup, which validates the architectural direction for deterministic initialization in tests.
2.3 Root cause summary
Two explicit root-cause branches are tracked for this flaky area:
- Branch A — constructor startup race:
NewCertificateServicestarts async state mutation (SyncFromDisk) immediately, while handler-driven reads (ListCertificates) may execute concurrently. This matches CI race output in.github/logs/ci_failure.logforTestCertificateHandler_List_WithCertificates. - Branch B — DB schema/setup ordering drift in tests: some test paths reach service queries before deterministic schema setup is complete, producing intermittent setup-order failures such as
no such table: ssl_certificatesandno such table: proxy_hosts.
Plan intent: address both branches together in one tight PR by making constructor behavior deterministic and enforcing migration/setup ordering in certificate handler test setup.
3) EARS Requirements (Source of Truth)
3.1 Ubiquitous requirements
- THE SYSTEM SHALL construct
CertificateServicein a deterministic initialization state before first handler read in tests. - THE SYSTEM SHALL provide race-free access to certificate cache state during list operations.
3.2 Event-driven requirements
- WHEN
NewCertificateServicereturns, THE SYSTEM SHALL guarantee that any required first-read synchronization state is consistent for immediateListCertificatescalls. - WHEN
CertificateHandler.Listis called, THE SYSTEM SHALL return data or a deterministic error without concurrency side effects caused by service startup. - WHEN certificate handler tests initialize database state, THE SYSTEM SHALL complete required schema setup for certificate-related tables before service/handler execution.
3.3 Unwanted-behavior requirements
- IF CI executes tests with race detector or parallel scheduling, THEN THE SYSTEM SHALL NOT produce data races between service initialization and list reads.
- IF startup synchronization fails, THEN THE SYSTEM SHALL surface explicit errors and preserve handler-level HTTP error behavior.
- IF test setup ordering is incorrect, THEN THE SYSTEM SHALL fail fast in setup with explicit diagnostics instead of deferred
no such tableruntime query failures.
3.4 Optional requirements
- WHERE test-only construction paths are needed, THE SYSTEM SHALL use explicit test helpers/options instead of sleeps or timing assumptions.
4) Technical Specifications
4.1 Service design changes
Primary design target (preferred):
- Update
NewCertificateServiceto remove nondeterministic startup behavior at construction boundary. - Ensure
SyncFromDisk/ cache state transition is serialized relative to earlyListCertificatesusage. - Keep existing mutex discipline explicit and auditable (single writer, safe readers).
- Enforce deterministic DB setup ordering in certificate handler tests through shared setup helper behavior (migrate/setup before constructor and request execution).
Design constraints:
- No public API break for callers in
routes.go. - No broad refactor of certificate business logic.
- Keep behavior compatible with current handler response contracts.
4.2 Handler and route impact
certificate_handler.goshould require no contract changes.routes.gokeeps existing construction call shape unless a minimally invasive constructor option path is required.
4.3 Data flow (target state)
- Route wiring constructs certificate service.
- Service enters stable initialized state before first list read path.
- Handler
Listrequests certificate data. - Service reads synchronized cache/DB state.
- Handler responds
200with deterministic payload or explicit500error.
4.4 Error handling matrix
| Scenario | Expected behavior | Validation |
|---|---|---|
| Startup sync succeeds | List path returns stable data | handler list tests pass repeatedly |
| Startup sync fails | Error bubbles predictably to handler | HTTP 500 tests remain deterministic |
| Empty store | 200 + empty list (or existing contract shape) | existing empty-list tests pass |
| Concurrent test execution | No race detector findings in certificate tests | go test -race targeted suite |
| Test DB setup ordering incomplete | setup fails immediately with explicit setup error; no deferred query-time table errors | dedicated setup-ordering test + repeated run threshold |
4.5 API and schema impact
- API endpoints: no external contract changes.
- Request/response schema: unchanged for certificate list/upload/delete.
- Database schema: no migrations required.
5) Implementation Plan (Phased)
Phase 1: Playwright / UI-UX Baseline (Gate)
Although this fix is backend-test focused, follow test protocol gate:
- Reuse healthy E2E environment when possible; rebuild only if runtime image inputs changed.
- Use exact task/suite gate:
- task ID
shell: Test: E2E Playwright (FireFox) - Core: Certificates - suite path executed by the task:
tests/core/certificates.spec.ts
- task ID
- If rebuild is required by test protocol, run task ID
shell: Docker: Rebuild E2E Environmentbefore Playwright.
Deliverables:
- passing targeted Playwright suite for certificate list interactions
- archived output in standard test artifacts
Phase 2: Backend Service Stabilization
Scope:
backend/internal/services/certificate_service.go
Tasks:
- Make constructor initialization deterministic at first-use boundary.
- Remove startup timing dependence that allows early
ListCertificatesto overlap with constructor-initiated sync mutation. - Preserve current cache invalidation and DB refresh semantics.
- Keep lock lifecycle simple and reviewable.
Complexity: Medium (single component, concurrency-sensitive)
Phase 3: Backend Test Hardening
Scope:
backend/internal/api/handlers/certificate_handler_coverage_test.gobackend/internal/api/handlers/certificate_handler_test.gobackend/internal/api/handlers/certificate_handler_security_test.go- optional shared test helpers:
backend/internal/api/handlers/testdb.go
Tasks:
- Remove sleep-based or timing-dependent assumptions.
- Standardize deterministic service setup helper for handler tests.
- Ensure flaky case (
TestCertificateHandler_List_WithCertificates) is fully deterministic. - Preserve assertion intent and API-level behavior checks.
- Add explicit setup-ordering validation in tests to guarantee required schema migration/setup completes before handler/service invocation.
Complexity: Medium (many callsites, low logic risk)
Phase 4: Integration & Validation
Tasks:
- Run reproducible stability stress loop for the known flaky case via task ID
shell: Test: Backend Flaky - Certificate List Stability Loop.
- Task command payload:
mkdir -p test-results/flaky && cd /projects/Charon && go test ./backend/internal/api/handlers -run '^TestCertificateHandler_List_WithCertificates$' -count=100 -shuffle=on -parallel=8 -json 2>&1 | tee test-results/flaky/cert-list-stability.jsonl
- Artifactized logging requirement: persist
test-results/flaky/cert-list-stability.jsonlfor reproducibility and CI comparison. - Pass threshold:
100/100successful runs, zero failures.
- Run race-mode stress for the same path via task ID
shell: Test: Backend Flaky - Certificate List Race Loop.
- Task command payload:
mkdir -p test-results/flaky && cd /projects/Charon && go test -race ./backend/internal/api/handlers -run '^TestCertificateHandler_List_WithCertificates$' -count=30 -shuffle=on -parallel=8 -json 2>&1 | tee test-results/flaky/cert-list-race.jsonl
- Artifactized logging requirement: persist
test-results/flaky/cert-list-race.jsonl. - Pass threshold: exit code
0and zeroWARNING: DATA RACEoccurrences.
- Run setup-ordering validation loop via task ID
shell: Test: Backend Flaky - Certificate DB Setup Ordering Loop.
- Task command payload:
mkdir -p test-results/flaky && cd /projects/Charon && go test ./backend/internal/api/handlers -run '^TestCertificateHandler_DBSetupOrdering' -count=50 -shuffle=on -parallel=8 -json 2>&1 | tee test-results/flaky/cert-db-setup-ordering.jsonl
- Pass threshold:
50/50successful runs and zerono such table: ssl_certificates|proxy_hostsmessages from positive-path setup runs.
- Run focused certificate handler regression suite via task ID
shell: Test: Backend Flaky - Certificate Handler Focused Regression.
- Task command payload:
mkdir -p test-results/flaky && cd /projects/Charon && go test ./backend/internal/api/handlers -run '^TestCertificateHandler_' -count=1 -json 2>&1 | tee test-results/flaky/cert-handler-regression.jsonl
- Pass threshold: all selected tests pass in one clean run.
- Execute patch-coverage preflight via existing project task ID
shell: Test: Local Patch Report.
- Task command payload:
bash scripts/local-patch-report.sh - Pass threshold: both artifacts exist:
test-results/local-patch-report.mdandtest-results/local-patch-report.json.
Task definition status for Phase 4 gates:
- Existing task ID in
.vscode/tasks.json:shell: Test: Local Patch Report
- New task IDs to add in
.vscode/tasks.jsonwith the exact command payloads above:shell: Test: Backend Flaky - Certificate List Stability Loopshell: Test: Backend Flaky - Certificate List Race Loopshell: Test: Backend Flaky - Certificate DB Setup Ordering Loopshell: Test: Backend Flaky - Certificate Handler Focused Regression
Complexity: Low-Medium
Phase 5: Documentation & Operational Hygiene
Tasks:
- Update
docs/plans/current_spec.md(this file) if implementation details evolve. - Record final decision outcomes and any deferred cleanup tasks.
- Defer unrelated repository config hygiene (
.gitignore,codecov.yml,.dockerignore,Dockerfile) unless a direct causal link to this flaky test is proven during implementation.
Complexity: Low
6) PR Slicing Strategy
Decision
Primary decision: single PR for this fix.
Why:
- Scope is tightly coupled (constructor semantics + related tests).
- Minimizes context switching and user review requests.
- Reduces risk of partially landed concurrency behavior.
Trigger reasons to split into multiple PRs
Split only if one or more occur:
- Constructor change causes wider runtime behavior concerns requiring staged rollout.
- Test hardening touches many unrelated test modules beyond certificate handlers.
- Validation reveals additional root-cause branches outside certificate service/test setup ordering.
Ordered fallback slices (if split is required)
PR-1: Service Determinism Core
Scope:
backend/internal/services/certificate_service.go
Dependencies:
- none
Acceptance criteria:
- service constructor/list path no longer exhibits startup race in targeted race tests
- no public API break at route wiring callsite
Rollback/contingency:
- revert constructor change only; preserve tests unchanged for quick recovery
PR-2: Handler Test Determinism
Scope:
backend/internal/api/handlers/certificate_handler_coverage_test.gobackend/internal/api/handlers/certificate_handler_test.gobackend/internal/api/handlers/certificate_handler_security_test.go- optional helper consolidation in
backend/internal/api/handlers/testdb.go
Dependencies:
- PR-1 merged (preferred)
Acceptance criteria:
TestCertificateHandler_List_WithCertificatesstable across repeated CI-style runs- no
time.Sleeptiming guard required for constructor race avoidance
Rollback/contingency:
- revert only test refactor while keeping stable service change
7) Acceptance Criteria (Definition of Done)
- Stability gate: task ID
shell: Test: Backend Flaky - Certificate List Stability Looppasses100/100. - Race gate: task ID
shell: Test: Backend Flaky - Certificate List Race Looppasses with zero race reports. - Setup-ordering gate: task ID
shell: Test: Backend Flaky - Certificate DB Setup Ordering Looppasses50/50, with nono such table: ssl_certificates|proxy_hostsin positive-path setup runs. - Regression gate: task ID
shell: Test: Backend Flaky - Certificate Handler Focused Regressionpasses. - Playwright baseline gate is completed via task ID
shell: Test: E2E Playwright (FireFox) - Core: Certificates(suite:tests/core/certificates.spec.ts), with task IDshell: Docker: Rebuild E2E Environmentrun first only when rebuild criteria are met. - Patch coverage preflight task ID
shell: Test: Local Patch Reportgeneratestest-results/local-patch-report.mdandtest-results/local-patch-report.json. - Scope gate: no changes to
.gitignore,codecov.yml,.dockerignore, orDockerfileunless directly required to fix this flaky test. - Reproducibility gate: stress-loop outputs are artifactized under
test-results/flaky/and retained in PR evidence (cert-list-stability.jsonl,cert-list-race.jsonl,cert-db-setup-ordering.jsonl,cert-handler-regression.jsonl). - If any validation fails, failure evidence and explicit follow-up tasks are recorded before completion.
8) Risks and Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Constructor behavior change impacts startup timing | medium | keep API stable; run targeted route/handler regression tests |
| Over-fixing spreads beyond certificate scope | medium | constrain edits to service + certificate tests only |
| DB setup-ordering fixes accidentally mask true migration problems | medium | fail fast during setup with explicit diagnostics and dedicated setup-ordering tests |
| Hidden race persists in adjacent tests | medium | run repeated/race-targeted suite; expand only if evidence requires |
| Out-of-scope config churn creates review noise | low | explicitly defer unrelated config hygiene from this PR |
9) Handoff
After this plan is approved:
- Delegate execution to Supervisor/implementation agent with this spec as the source of truth.
- Execute phases in order with validation gates between phases.
- Keep PR scope narrow and deterministic; prefer single PR unless split triggers are hit.
- Migration failures emit structured logs and increment migration-failure counters.
- Legacy path execution attempts increment explicit counters and write warning/error logs.
10) Critical Risks and Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Legacy provider records fail conversion under Notify-only runtime | Notification send failures for affected providers | Transactional migration, explicit migration_state/migration_error, actionable errors, focused migration tests |
| Hidden fallback assumptions still exist in tests/code | False confidence and production drift | Sentinel-based tests that fail on fallback invocation; remove fallback wiring |
| Rollout outage without fallback safety net | Increased short-term operational risk | Smaller PR slices, strict preflight E2E/tests/scans, fast revert procedure |
| UI/API contract mismatch during canonicalization | Broken settings UX | Canonical endpoint tests across backend/frontend/E2E |
| Silent regression in legacy-path blocking instrumentation | Fallback attempts occur without operational visibility, delaying incident response | Add explicit legacy-attempt counters/logs to DoD, dashboard alert thresholds, and validation step that asserts non-zero logging on injected fallback-attempt test path |
11) Handoff
After approval, delegate execution to the Supervisor agent with this file as the single active plan baseline.
- Migration failures emit structured logs and increment migration-failure counters.
- Legacy path execution attempts increment explicit counters and write warning/error logs.
10) Critical Risks and Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Legacy provider records fail conversion under Notify-only runtime | Notification send failures for affected providers | Transactional migration, explicit migration_state/migration_error, actionable errors, focused migration tests |
| Hidden fallback assumptions still exist in tests/code | False confidence and production drift | Sentinel-based tests that fail on fallback invocation; remove fallback wiring |
| Rollout outage without fallback safety net | Increased short-term operational risk | Smaller PR slices, strict preflight E2E/tests/scans, fast revert procedure |
| UI/API contract mismatch during canonicalization | Broken settings UX | Canonical endpoint tests across backend/frontend/E2E |
| Silent regression in legacy-path blocking instrumentation | Fallback attempts occur without operational visibility, delaying incident response | Add explicit legacy-attempt counters/logs to DoD, dashboard alert thresholds, and validation step that asserts non-zero logging on injected fallback-attempt test path |
11) Handoff
After approval, delegate execution to the Supervisor agent with this file as the single active plan baseline.