chore: re-enable security e2e scaffolding and triage gaps
This commit is contained in:
@@ -1,87 +1,103 @@
|
||||
# E2E Shard Failures – Run 21377510901 (PR 550)
|
||||
# Re-enable Security Playwright Tests and Run Full E2E (feature/beta-release)
|
||||
|
||||
**Issue**: CI shards are failing/flaking against Docker environment (localhost:8080) while local runs pass. Need root-cause plan without re-enabling Vite/coverage.
|
||||
**Goal**: Turn security Playwright tests back on, run the full E2E suite (including security flows) on Docker base URL, and prepare triage steps for any failures.
|
||||
**Status**: 🔴 ACTIVE – Planning
|
||||
**Priority**: 🔴 CRITICAL – CI blocked
|
||||
**Priority**: 🔴 CRITICAL – CI/CD gating
|
||||
**Created**: 2026-01-27
|
||||
|
||||
---
|
||||
|
||||
## 🔍 CI vs Local Findings
|
||||
|
||||
- **Shard 1** (passed but flaky): `tests/core/access-lists-crud.spec.ts` intermittently misses toast / ACL visibility assertion.
|
||||
- **Shard 2** (hard fail): `emergency-server/*.spec.ts` and `tier2-validation.spec.ts` hit `ECONNREFUSED ::1:2019/2020`; access list creation returns "Blocked by access control list".
|
||||
- **Shard 3** (fail): `tests/core/account-settings.spec.ts` certificate email validation – error message not visible after retries.
|
||||
- **Shard 4** (fail):
|
||||
- `tests/core/system-settings.spec.ts` success toast not observed.
|
||||
- `tests/core/user-management.spec.ts` invite/resend flows fail with strict mode locator collisions (multiple matching buttons).
|
||||
- **Container logs (shard 2 artifact)**: `Emergency server disabled (CHARON_EMERGENCY_SERVER_ENABLED=false)` and emergency bypass called. Tier-2 server (port 2020) never starts → explains connection refusals. Security ACL reported as disabled post emergency reset but initial access-list calls still 401/blocked until login.
|
||||
- **Environment parity**: Local likely starts emergency server (or uses 127.0.0.1), CI disables it via env; CI uses IPv6 loopback (::1) causing refusals when service is off.
|
||||
- **Architecture**: Vite/coverage already removed; tests target Docker app only.
|
||||
## 🎯 Scope and Constraints
|
||||
- Target branch: `feature/beta-release`.
|
||||
- Base URL: Docker stack (`http://localhost:8080`) unless security tests require override.
|
||||
- Keep management-mode rule: no code reading here; instructions only for execution subagents.
|
||||
- Coverage: run E2E coverage only if already supported via Vite flow; otherwise note as optional follow-up.
|
||||
|
||||
---
|
||||
|
||||
## 🧭 Hypotheses
|
||||
|
||||
1) **Emergency server/tier2 disabled in CI** → all shard-2 tests fail; local enables by default. Root cause: env var CHARON_EMERGENCY_SERVER_ENABLED is false in e2e compose or workflow.
|
||||
2) **ACL bypass timing** → initial emergency reset happens, but ACL state may still block access-list creation; needs deterministic disable hook.
|
||||
3) **UI assertion drift** → account-settings/system-settings/user-management expectations mismatch current UI text/roles; strict-mode locator ambiguity for invite buttons.
|
||||
4) **Toast race / network latency** → success toasts not awaited with retryable locator; CI slower than local.
|
||||
## 🗂️ Files to Change (for execution agents)
|
||||
- [playwright.config.js](playwright.config.js): re-enable security project/shard config, ensure `testDir` includes security specs, and restore any `grep`/`grepInvert` filters previously disabling them.
|
||||
- Tests security fixtures/utilities: [tests/security/**](tests/security/), [tests/fixtures/security/**](tests/fixtures/security/), and any shared helpers under [tests/utils](tests/utils) that were toggled off (e.g., skip blocks, `test.skip`, env flags).
|
||||
- Workflows/toggles: [ .github/workflows/*e2e*.yml](.github/workflows) and Docker compose overrides (e.g., [.docker/compose/docker-compose.e2e.yml](.docker/compose/docker-compose.e2e.yml)) to re-enable env vars/secrets for security tests (ACL/emergency/rate-limit toggles, tokens, base URLs).
|
||||
- Global setup/teardown: [tests/global-setup.ts](tests/global-setup.ts) and related teardown to ensure security setup hooks are active (if previously short-circuited).
|
||||
- Playwright reports/ignore lists: verify any `.gitignore` or report pruning that might suppress security artifacts.
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Action Plan (phased)
|
||||
## 🛠️ Implementation Steps
|
||||
0) **Prepare environment and secrets**
|
||||
- Ensure required secrets/vars are present (redact in logs): `CHARON_EMERGENCY_TOKEN`, `CHARON_ADMIN_USERNAME`/`CHARON_ADMIN_PASSWORD`, `PLAYWRIGHT_BASE_URL` (`http://localhost:8080` for Docker runs), feature toggles for security/ACL/rate-limit (e.g., `CHARON_SECURITY_TESTS_ENABLED`).
|
||||
- Source from GitHub Actions secrets for CI; `.env`/`.env.local` for local. Do not hardcode; validate presence before run. Redact values in logs (print presence only).
|
||||
|
||||
### Phase 1 – Environment parity (CI vs local)
|
||||
- Enable emergency server in CI Docker stack: set `CHARON_EMERGENCY_SERVER_ENABLED=true`, expose admin port 2019 and tier-2 port 2020, and ensure services bind for both IPv4/IPv6 (CI uses ::1).
|
||||
- Explicitly set emergency token for tier-2 if required; document its source (redacted) in test env.
|
||||
- Add startup assertion in global-setup to poll `http://localhost:2019/config/` and `http://localhost:2020/health` (skip if disabled) with short timeout to fail fast.
|
||||
- Capture env snapshot in CI logs for emergency-related vars (redact secrets) and note resolved base URL (IPv4 vs IPv6).
|
||||
1) **Restore security test inclusion**
|
||||
- Revert skips/filters: remove `test.skip`, `test.describe.skip`, or project-level `grepInvert` that excluded security specs.
|
||||
- Ensure `projects` in `playwright.config.js` include security shard (or merge back into main matrix) with correct `testDir`/`testMatch`.
|
||||
- Re-enable security fixture initialization in `global-setup.ts` (e.g., emergency server bootstrap, token wiring) if it was bypassed.
|
||||
|
||||
### Phase 2 – Deterministic security disable
|
||||
- After login/setup, call emergency reset and then verify ACL/rate-limit flags via `/api/v1/security/config` before continuing tests; make this idempotent and fail fast before any data creation.
|
||||
- If ACL still blocks create, call `/api/v1/access-lists/templates` to assert 200; otherwise retry emergency reset once and fail with clear error.
|
||||
- Add small utility in TestDataManager to assert ACL is disabled before creating ACL-dependent resources; short-circuit with actionable error.
|
||||
2) **Re-enable env toggles and secrets**
|
||||
- In E2E workflow and Docker compose for tests, set required env vars (examples: `CHARON_EMERGENCY_SERVER_ENABLED=true`, `CHARON_SECURITY_TESTS_ENABLED=true`, tokens/ports 2019/2020) and confirm mounted secrets for security endpoints.
|
||||
- Verify base URL resolution matches Docker (avoid Vite unless running coverage skill).
|
||||
|
||||
### Phase 3 – Shard-specific fixes
|
||||
- **Shard 2**: Once emergency server enabled, rerun to confirm. Add health check for tier-2 server; fail early if down.
|
||||
- **Shard 1**: Wrap ACL toast assertions with `expect.poll`/`toHaveText` on role-based toast locator; ensure list refresh after create. Add a shared toast helper (role-based with short retries) to reuse across specs.
|
||||
- **Shard 3**: Update certificate email validation assertion to target the visible validation message role/text; avoid brittle `getByText` timeouts.
|
||||
- **Shard 4**:
|
||||
- System settings toast: use role-based toast locator with retry; ensure the form submit awaits network idle before assert.
|
||||
- User management invite/resend: replace ambiguous button locators with role+name scoped to each row (e.g., row locator then `getByRole('button', { name: /resend invite/i })`); add a row-scoped locator helper to avoid strict-mode collisions.
|
||||
3) **Bring up/refresh test stack**
|
||||
- Start or rebuild test stack before running Playwright: use task `Docker: Start Local Environment` (or `Docker: Rebuild E2E Environment` if needed).
|
||||
- Health check: verify ports 8080/2019/2020 respond (`curl http://localhost:8080`, `http://localhost:2019/config`, `http://localhost:2020/health`).
|
||||
|
||||
### Phase 4 – Observability and flake defense
|
||||
- Add Playwright trace/video for shard 1–4 in CI (already default? confirm); keep artifacts for failing shards only to save time.
|
||||
- Log emergency server state (enabled/disabled), ACL status, and resolved base URL (IPv4 vs IPv6) at start of each project.
|
||||
- Add short retries (max 2) for toast assertions using auto-retrying expect.
|
||||
4) **Run full E2E suite (all browsers + security)**
|
||||
- Preferred tasks (from workspace tasks):
|
||||
- `Test: E2E Playwright (All Browsers)` for breadth.
|
||||
- `Test: E2E Playwright (Chromium)` for faster iteration.
|
||||
- `Test: E2E Playwright (Skill)` if automation wrapper required.
|
||||
- If security suite has its own task (e.g., `Test: E2E Playwright (Chromium) - Cerberus: Security Dashboard/Rate Limiting`), run those explicitly after re-enable.
|
||||
|
||||
### Phase 5 – Validation loop
|
||||
- Rerun shards 1–4 in CI after env toggle; compare to local.
|
||||
- If shard 2 passes but others fail, prioritize locator/UX updates in phases 3–4.
|
||||
- Keep Vite/coverage off until all shards green; plan separate coverage job later.
|
||||
5) **Optional coverage pass (only if Vite path)**
|
||||
- Coverage only meaningful via Vite coverage skill (port 5173). Docker/8080 runs will show 0% coverage—do not treat as failure.
|
||||
- If required: run `.github/skills/scripts/skill-runner.sh test-e2e-playwright-coverage`; target non-zero coverage and patch coverage on changed lines.
|
||||
|
||||
6) **Report collection and review**
|
||||
- Generate and open report: `npx playwright show-report` (or task `Test: E2E Playwright - View Report`).
|
||||
- For failures, gather traces/videos from `playwright-report/` and `test-results/`.
|
||||
|
||||
7) **Targeted rerun loop for failures**
|
||||
- For each failing spec: rerun with `npx playwright test --project=chromium --grep "<failing name>"` (and the corresponding security project if separate).
|
||||
- After fixes, rerun full Chromium suite; then run all-browsers suite.
|
||||
|
||||
6) **Triage loop**
|
||||
- Classify failures: environment/setup vs. locator/data vs. backend errors.
|
||||
- Log failing specs, error messages, and env snapshot (base URL, env flags) into triage doc or ticket.
|
||||
|
||||
---
|
||||
|
||||
## 📄 Files/Areas to touch
|
||||
- Workflow/compose env: ensure `CHARON_EMERGENCY_SERVER_ENABLED=true`; expose tier-2 port 2020; confirm emergency token variable passed.
|
||||
- `tests/core/*`: adjust locators and toast assertions per shard notes.
|
||||
- `tests/utils/TestDataManager.ts`: add ACL-disabled check before ACL creation.
|
||||
- `global-setup.ts` (if needed): add emergency server health probe and state logging.
|
||||
## ✅ Validation Checklist (execution order)
|
||||
- [ ] Lint/typecheck: run `Lint: Frontend`, `Lint: TypeScript Check`, `Lint: Frontend (Fix)` if needed.
|
||||
- [ ] E2E full suite with security (Chromium): task `Test: E2E Playwright (Chromium)` plus security-specific tasks (Rate Limiting/Security Dashboard) once re-enabled.
|
||||
- [ ] E2E all browsers: `Test: E2E Playwright (All Browsers)`.
|
||||
- [ ] Coverage (if applicable): run coverage skill; verify non-zero coverage in `coverage/e2e/`.
|
||||
- [ ] Security scans: `Security: Trivy Scan` and `Security: Go Vulnerability Check` (or CodeQL tasks if required).
|
||||
- [ ] Reports reviewed: open Playwright HTML report, inspect traces/videos for any failing specs.
|
||||
- [ ] Triage log captured: record failing spec IDs, errors, env snapshot (base URL, env flags) and artifact links in shared location (e.g., `test-results/triage.md` or ticket).
|
||||
|
||||
---
|
||||
|
||||
## ✅ Completion checklist
|
||||
- [ ] CI env starts emergency server (port 2020) and admin API (2019); health probes added.
|
||||
- [ ] Security disable verified before data setup; ACL create no longer blocked.
|
||||
- [ ] Shard 1 toast flake mitigated with resilient locator/wait.
|
||||
- [ ] Shard 2 emergency/tier2 tests pass in CI.
|
||||
- [ ] Shard 3 account-settings validation assertion updated and passing.
|
||||
- [ ] Shard 4 system-settings toast and user-management locators stabilized.
|
||||
- [ ] Vite/coverage remain off during fixes; add a guard/checklist item in workflow to ensure coverage flags stay disabled during triage; plan coverage follow-up separately.
|
||||
## 🧪 Triage Strategy for Expected Failures
|
||||
- **Auth/boot failures**: Check `global-setup` logs, ensure emergency/ACL toggles and tokens present. Validate endpoints 2019/2020 reachable in Docker logs.
|
||||
- **Locator/strict mode issues**: Use role-based locators and scope to rows/sections; prefer `getByRole` with accessible names. Add short `expect` retries over manual waits.
|
||||
- **Timing/toast flakiness**: Switch to `await expect(locator).toHaveText(...)` with retries; avoid `waitForTimeout`. Ensure network idle or response awaited on submit.
|
||||
- **Backend 4xx/5xx**: Capture response bodies via `page.waitForResponse` or Playwright traces; verify env flags not disabling required features.
|
||||
- **Security endpoint mismatches**: Validate test data/fixtures match current API contract; update fixtures before rerunning.
|
||||
- **Next steps after failures**: Document failing spec paths, error messages, and suspected root cause; rerun focused spec with `--project` and `--grep` once fixes applied.
|
||||
|
||||
---
|
||||
|
||||
## 📎 Artifacts reviewed
|
||||
- GH Actions log: `.agent_work/run-21377510901.log`
|
||||
- Docker logs (shard 2): `.agent_work/run-21377510901-artifacts/docker-logs-shard-2.txt` (shows emergency server disabled, ACL reset attempts)
|
||||
## 📌 Commands for Executors
|
||||
- Re-enable/verify config: `node -e "console.log(require('./playwright.config'))"` (sanity on projects).
|
||||
- Run Chromium suite: task `Test: E2E Playwright (Chromium)`.
|
||||
- Run all browsers: task `Test: E2E Playwright (All Browsers)`.
|
||||
- Run security-focused tasks: `Test: E2E Playwright (Chromium) - Cerberus: Security Dashboard`, `... - Cerberus: Rate Limiting`.
|
||||
- Show report: `npx playwright show-report` or task `Test: E2E Playwright - View Report`.
|
||||
- Coverage (optional): `.github/skills/scripts/skill-runner.sh test-e2e-playwright-coverage`.
|
||||
|
||||
---
|
||||
|
||||
## 📎 Notes
|
||||
- Keep documentation of any env/secret re-introduction minimal and redacted; avoid hardcoding secrets.
|
||||
- If security tests require data resets, ensure teardown does not affect subsequent suites.
|
||||
|
||||
Reference in New Issue
Block a user