Files
Charon/docs/plans/current_spec.md
GitHub Actions 2b2d907b0c fix: enhance notifications and validation features
- Added URL validation for notification providers to ensure only valid http/https URLs are accepted.
- Implemented tests for URL validation scenarios in the Notifications component.
- Updated translations for error messages related to invalid URLs in multiple languages.
- Introduced new hooks for managing security headers and access lists in tests.
- Enhanced the ProviderForm component to reset state correctly when switching between add and edit modes.
- Improved user feedback with update indicators after saving changes to notification providers.
- Added mock implementations for new hooks in various test files to ensure consistent testing behavior.
2026-02-10 22:01:45 +00:00

802 lines
37 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# E2E Playwright Shard Timeout Investigation — Current Spec
Last updated: 2026-02-10
## Goal
- Concise summary: investigate GitHub Actions run https://github.com/Wikid82/Charon/actions/runs/21865692694 where the E2E Playwright job reports Shard 3 stopping at ~30 minutes despite configured timeouts of ~40 minutes. Produce reproducible diagnostics, collect artifacts/logs, identify root cause hypotheses, and provide prioritized remediations and short-term unblock steps.
## Phases
- Discover: collect logs and artifacts.
- Analyze: review config and correlate shard → tests.
- Remediate: short-term and long-term fixes.
- Verify: reproduce and confirm the fix.
---
## 1) Discover — exact places to collect logs & artifacts
### GitHub Actions (run-level)
- Run page: https://github.com/Wikid82/Charon/actions/runs/21865692694
- Run logs (zip): GET https://api.github.com/repos/Wikid82/Charon/actions/runs/21865692694/logs
- Programmatic commands:
```bash
export GITHUB_OWNER=Wikid82
export GITHUB_REPO=Charon
export RUN_ID=21865692694
# Requires GITHUB_TOKEN set with repo access
curl -H "Accept: application/vnd.github+json" \
-H "Authorization: token $GITHUB_TOKEN" \
-L "https://api.github.com/repos/$GITHUB_OWNER/$GITHUB_REPO/actions/runs/$RUN_ID/logs" \
-o run-${RUN_ID}-logs.zip
unzip -d run-${RUN_ID}-logs run-${RUN_ID}-logs.zip
```
- Artifacts list (API):
```bash
curl -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/$GITHUB_OWNER/$GITHUB_REPO/actions/runs/$RUN_ID/artifacts" | jq '.'
```
- gh CLI (interactive/script):
```bash
gh run view $RUN_ID --repo $GITHUB_OWNER/$GITHUB_REPO --log > run-$RUN_ID-summary.log
gh run download $RUN_ID --repo $GITHUB_OWNER/$GITHUB_REPO --dir artifacts-$RUN_ID
```
### GitHub Actions (job-level)
- List jobs for the run and find Playwright shard job(s):
```bash
curl -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/$GITHUB_OWNER/$GITHUB_REPO/actions/runs/$RUN_ID/jobs" | jq '.jobs[] | {id: .id, name: .name, runner_name: .runner_name, started_at: .started_at, completed_at: .completed_at}'
```
- For JOB_ID identified as the shard job, download job logs:
```bash
curl -H "Authorization: token $GITHUB_TOKEN" -L \
"https://api.github.com/repos/$GITHUB_OWNER/$GITHUB_REPO/actions/jobs/$JOB_ID/logs" -o job-${JOB_ID}-logs.zip
unzip -d job-${JOB_ID}-logs job-${JOB_ID}-logs.zip
```
### Playwright test outputs used by this project
- Search and collect the following files in the repo root (or workflow-run directories):
- `playwright.config.ts`, `playwright.config.js`, `playwright.config.mjs`
- `package.json` scripts invoking Playwright (e.g., `test:e2e`, `e2e:ci`)
- `.github/workflows/*` steps that run Playwright
- Typical Playwright outputs to collect (per-shard):
- `<outputDir>/trace.zip`
- `<outputDir>/test-results.json` or `test-results/*`
- `<outputDir>/video/*`
- `<outputDir>/*.log` (stdout/stderr)
Observed local example (for context): the developer ran
`npx playwright test --project=chromium --output=/tmp/playwright-chromium-output --reporter=list > /tmp/playwright-chromium.log 2>&1` — look for similar invocations in workflows/scripts.
### Repository container logs (containers/)
- containers/charon:
- Files to check: `containers/charon/docker-compose.yml`, any `logs/` or `data/` directories under `containers/charon/`.
- Local commands (when reproducing):
```bash
docker compose -f containers/charon/docker-compose.yml logs --no-color --timestamps > containers-charon-logs.txt
docker logs --timestamps --since "1h" charon-e2e > charon-e2e.log 2>&1 || true
```
- containers/caddy:
- Files: `containers/caddy/Caddyfile`, `containers/caddy/config/`, `containers/caddy/logs/`
- Local checks:
```bash
docker logs --timestamps caddy > caddy.log 2>&1 || true
curl -sS http://127.0.0.1:2019/ || true # admin
curl -sS http://127.0.0.1:2020/ || true # emergency
```
---
## 2) Analyze — specific files and config to review (exact paths)
- Workflows (search these paths):
- `.github/workflows/*.yml` — likely candidates: `.github/workflows/e2e.yml`, `.github/workflows/ci.yml`, `.github/workflows/playwright.yml` (run `grep -R "playwright" .github/workflows || true`).
- Look for `timeout-minutes:` either at top-level workflow or under `jobs:<job>.timeout-minutes`.
- Playwright config files:
- `/projects/Charon/playwright.config.ts`
- `/projects/Charon/playwright.config.js`
- `/projects/Charon/playwright.config.mjs`
- Inspect `projects`, `workers`, `retries`, `outputDir`, `reporter` sections.
- package.json and scripts:
- `/projects/Charon/package.json` — inspect `scripts` for e.g. `test:e2e`, `e2e:ci` and the exact Playwright CLI flags used by CI.
- GitHub skill scripts & E2E runner:
- `.github/skills/scripts/skill-runner.sh` — used in `docs` and testing instructions; check for `docker-rebuild-e2e`, `test-e2e-playwright-coverage`.
- Commands:
```bash
sed -n '1,240p' .github/skills/scripts/skill-runner.sh
grep -n "docker-rebuild-e2e\|test-e2e-playwright-coverage\|playwright" -n .github/skills || true
```
- Makefile:
- `/projects/Charon/Makefile` — search for targets related to `e2e`, `playwright`, `rebuild`.
---
## 3) Steps to download GitHub Actions logs & artifacts for run 21865692694
### Programmatic (API)
1. List artifacts for run:
```bash
curl -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/Wikid82/Charon/actions/runs/21865692694/artifacts" | jq '.'
```
2. Download run logs (zip):
```bash
curl -H "Authorization: token $GITHUB_TOKEN" -L \
"https://api.github.com/repos/Wikid82/Charon/actions/runs/21865692694/logs" -o run-21865692694-logs.zip
unzip -d run-21865692694-logs run-21865692694-logs.zip
```
3. List jobs to find Playwright shard job id(s):
```bash
curl -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/Wikid82/Charon/actions/runs/21865692694/jobs" | jq '.jobs[] | {id: .id, name: .name, runner_name: .runner_name, started_at: .started_at, completed_at: .completed_at}'
```
4. Download job logs by JOB_ID:
```bash
curl -H "Authorization: token $GITHUB_TOKEN" -L \
"https://api.github.com/repos/Wikid82/Charon/actions/jobs/$JOB_ID/logs" -o job-$JOB_ID-logs.zip
unzip -d job-$JOB_ID-logs job-$JOB_ID-logs.zip
```
### Using gh CLI
```bash
gh run view 21865692694 --repo Wikid82/Charon --log > run-21865692694-summary.log
gh run download 21865692694 --repo Wikid82/Charon --dir artifacts-21865692694
```
### Manual web UI
- Visit run page and download artifacts and job logs from the job view.
---
## 4) How to locate shard-specific logs and correlate shard indices to tests
- Typical patterns to inspect:
- Look for Playwright CLI flags in the job step (e.g., `--shard=INDEX/TOTAL`, `--output=/tmp/...`).
- If the job ran `npx playwright test --output=/tmp/...`, search the downloaded job logs for that exact command to find the shard index.
- Commands to list tests assigned to a shard (dry-run):
```bash
# Show which tests a given shard would run (no execution)
npx playwright test --list --shard=INDEX/TOTAL
# Or run with reporter=list (shows test items as executed)
npx playwright test --shard=INDEX/TOTAL --reporter=list
```
- Note: Playwright shard index is zero-based. If CI logs show `--shard=3/4`, double-check whether the team used zero-based numbering; confirm by re-running the `--list` command.
Expected per-shard artifact names (if implemented):
- `e2e-shard-<INDEX>-output` containing `trace.zip`, `video/*`, `test-results.json`, and shard-specific logs (stdout/stderr files).
---
## 5) Runner/container logs to inspect
- GitHub-hosted runner: review the Actions job logs for runner messages and any `Runner` diagnostic lines. You cannot access host-level logs.
- Self-hosted runner (if used): retrieve host system logs (requires access to runner host):
```bash
sudo journalctl -u actions.runner.* -n 1000 > runner-service-journal.log
sudo journalctl -k --since "1 hour ago" | grep -i oom > runner-kernel-oom.log || true
sudo journalctl -u docker.service -n 200 > docker-journal.log
```
- Docker container logs (charon, caddy, charon-e2e):
```bash
docker ps -a --filter "name=charon" --format "{{.Names}} {{.Status}}" > containers-ps.txt
docker logs --since "1h" charon-e2e > charon-e2e.log 2>&1 || true
docker logs --since "1h" caddy > caddy.log 2>&1 || true
```
Check Caddy admin/emergency ports (2019 & 2020) to confirm the proxy was healthy during the test run:
```bash
curl -sS --max-time 5 http://127.0.0.1:2019/ || echo "admin not responding"
curl -sS --max-time 5 http://127.0.0.1:2020/ || echo "emergency not responding"
```
---
## 6) Hypotheses for why Shard 3 stopped at ~30m (descriptions + exact artifacts to search)
H1 — Workflow/job timeout configured smaller than expected
- Search:
- `.github/workflows/*` for `timeout-minutes:`
- job logs for `Timeout` or `Job execution time exceeded`
- Commands:
```bash
grep -n "timeout-minutes" .github/workflows -R || true
grep -i "timeout" -R run-${RUN_ID}-logs || true
```
- Confirmed by: `timeout-minutes: 30` or job logs showing `aborting execution due to timeout`.
H2 — Runner preemption / connection loss
- Search job logs for: `Runner lost`, `The runner has been shutdown`, `Connection to the server was lost`.
- Commands:
```bash
grep -iE "runner lost|runner.*shutdown|connection.*lost|Job canceled|cancelled by" -R run-${RUN_ID}-logs || true
```
- Confirmed by: runner disconnect lines and abrupt end of logs with no Playwright stack trace.
H3 — E2E environment container (charon/caddy) died or became unhealthy
- Search container logs for crash/fatal/panic messages and timestamps matching the job stop time.
- Commands:
```bash
docker ps -a --filter "name=charon" --format '{{.Names}} {{.Status}}'
docker logs charon-e2e --since "2h" | sed -n '1,200p'
grep -iE "panic|fatal|segfault|exited|health.*unhealthy|503|502" containers -R || true
```
- Confirmed by: container exit matching job finish time and Caddy returning 502/503 during run.
H4 — Playwright/Node process killed by OOM
- Search for `Killed`, kernel `oom_reaper` lines, system `dmesg` outputs.
- Commands:
```bash
grep -R "Killed" job-${JOB_ID}-logs || true
# on self-hosted runner host
sudo journalctl -k --since '2 hours ago' | grep -i oom || true
```
- Confirmed by: kernel OOM logs at same timestamp or `Killed` in job logs.
H5 — Script-level early timeout (explicit `timeout 30m` or `kill`)
- Search `.github/skills` and workflow steps for `timeout 30m`, `timeout 1800`, or `kill` calls.
- Commands:
```bash
grep -R "\btimeout\b\|kill -9\|kill -15\|pkill" -n .github || true
```
- Confirmed by: a script with `timeout 30m` or similar wrapper used in the job.
H6 — Misinterpreted units or mis-configuration (seconds vs minutes)
- Search for numeric values used in scripts and steps (e.g., `1800` used where minutes expected).
- Commands:
```bash
grep -R "\b1800\b\|\b3600\b\|timeout-minutes" -n .github || true
```
- Confirmed by: a value of `1800` where `timeout-minutes` or similar was expected to be minutes.
For each hypothesis, the exact lines/entries returned by the grep/journal/docker commands are the evidence to confirm or refute it. Keep timestamps to correlate with the job start/completion times in the run logs.
---
## 7) Prioritized remediation plan (short-term → long-term)
### Short-term (unblock re-runs quickly)
1. Download and attach all logs/artifacts for run 21865692694 (use `gh run download`) and share with E2E test author.
2. Temporarily bump `timeout-minutes` for the failing workflow to 60 to allow full runs while diagnosing.
3. Add an `if: always()` step to the E2E job that collects diagnostics and uploads them as artifacts (free memory, `dmesg`, `ps aux`, `docker ps -a`, `docker logs charon-e2e`).
4. Re-run just the failing shard with added `DEBUG=pw:api` and `PWDEBUG=1` and persist shard outputs.
### Medium-term
1. Persist per-shard Playwright outputs via `actions/upload-artifact@v4` for traces/videos/test-results.
2. Add Playwright `retries` for transient failures and `--trace`/`--video` options.
3. Add a CI smoke check before full shard execution to confirm env health.
4. If self-hosted, add runner health checks and alerting (memory, disk, Docker status).
### Long-term
1. Implement stable test splitting based on historical test durations rather than equal-file sharding.
2. Introduce resource constraints and monitoring to protect against OOM and flapping containers.
3. Build a golden-minimal E2E smoke job that must pass before running full shards.
---
## 8) Minimal reproduction checklist (local)
1. Rebuild E2E image used by CI (per repo skill):
```bash
.github/skills/scripts/skill-runner.sh docker-rebuild-e2e
```
2. Start the environment (example):
```bash
docker compose -f containers/charon/docker-compose.yml up -d
```
3. Set base URL and run the same shard (replace INDEX/TOTAL with values from CI):
```bash
export PLAYWRIGHT_BASE_URL=http://localhost:5173
DEBUG=pw:api PWDEBUG=1 \
npx playwright test --shard=INDEX/TOTAL --project=chromium \
--output=/tmp/playwright-shard-INDEX --reporter=list > /tmp/playwright-shard-INDEX.log 2>&1
```
4. If reproducing a timeout, immediately collect:
```bash
docker ps -a --format '{{.Names}} {{.Status}}' > reproduce-docker-ps.txt
docker logs --since '1h' charon-e2e > reproduce-charon-e2e.log || true
tail -n 500 /tmp/playwright-shard-INDEX.log > reproduce-pw-tail.log
```
---
## 9) Required workflow/scripts changes to improve diagnostics & prevent recurrence
- Add `timeout-minutes: 60` to `.github/workflows/<e2e workflow>.yml` while diagnosing; later set to a reasoned SLA (e.g., 50m).
- Add an `always()` step to collect diagnostics on failure and upload artifacts. Example YAML snippet:
```yaml
- name: Collect diagnostics
if: always()
run: |
uptime > uptime.txt
free -m > free-m.txt
df -h > df-h.txt
ps aux > ps-aux.txt
docker ps -a > docker-ps.txt || true
docker logs --tail 500 charon-e2e > docker-charon-e2e.log || true
- uses: actions/upload-artifact@v4
with:
name: e2e-diagnostics-${{ github.run_id }}
path: |
uptime.txt
free-m.txt
df-h.txt
ps-aux.txt
docker-ps.txt
docker-charon-e2e.log
```
- Ensure each Playwright shard runs with `--output` pointing to a shard-specific path and upload that path as artifact:
- artifact name convention: `e2e-shard-${{ matrix.index }}-output`.
---
## 10) People/roles to notify & recommended next actions
- Notify:
- CI/Infra owner or person in `CODEOWNERS` for `.github/workflows`
- E2E test author(s) (owners of failing tests)
- Self-hosted runner owner (if runner_name in job JSON indicates self-hosted)
- Recommended immediate actions for them:
1. Download run artifacts and job logs for run 21865692694 and share them with the test author.
2. Re-run the shard with `DEBUG=pw:api` and `PWDEBUG=1` enabled and ensure per-shard artifacts are uploaded.
3. If self-hosted, check runner host kernel logs for OOM and Docker container exits at the job time.
---
## 11) Verification steps (post-remediation)
1. Re-run E2E workflow end-to-end; verify Shard 3 completes.
2. Confirm artifacts `e2e-shard-3-output` exist and contain `trace.zip`, `video/*`, and `test-results.json`.
3. Confirm no `oom_reaper` or `Killed` messages in runner host logs during the run.
---
## Appendix — quick extraction commands summary
```bash
# Download all artifacts and logs for RUN_ID
gh run download 21865692694 --repo Wikid82/Charon --dir ./artifacts-21865692694
# List jobs and find Playwright shard job(s)
curl -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/Wikid82/Charon/actions/runs/21865692694/jobs" | jq '.jobs[] | {id: .id, name: .name, runner_name: .runner_name, started_at: .started_at, completed_at: .completed_at}'
# Download job logs for JOB_ID
curl -H "Authorization: token $GITHUB_TOKEN" -L \
"https://api.github.com/repos/Wikid82/Charon/actions/jobs/$JOB_ID/logs" -o job-$JOB_ID-logs.zip
unzip -d job-$JOB_ID-logs job-$JOB_ID-logs.zip
# Grep for likely causes
grep -iE "timeout|minut|runner lost|cancelled|Killed|OOM|oom_reaper|Out of memory|panic|fatal" -R run-21865692694-logs || true
```
---
## Next three immediate actions (checklist)
1. Run `gh run download 21865692694 --repo Wikid82/Charon --dir ./artifacts-21865692694` and unzip the run logs.
2. Search the downloaded logs for `timeout-minutes`, `Runner lost`, `Killed`, and `oom_reaper` to triage H1H4.
3. Re-run the failing shard locally with `DEBUG=pw:api PWDEBUG=1` and `--output=/tmp/playwright-shard-INDEX`, capture outputs, and upload them as artifacts.
---
If you want, I can now (A) download the run artifacts & logs for run 21865692694 using gh/API (requires your GITHUB_TOKEN) and list the job IDs, or (B) open the workflow files in `.github/workflows` and search for `timeout-minutes` and Playwright invocations. Which would you like me to do first?
---
post_title: "E2E Test Remediation Plan"
author1: "Charon Team"
post_slug: "e2e-test-remediation-plan"
microsoft_alias: "charon-team"
featured_image: "https://wikid82.github.io/charon/assets/images/featured/charon.png"
categories: ["testing"]
tags: ["playwright", "e2e", "remediation", "security"]
ai_note: "true"
summary: "Phased remediation plan for Charon Playwright E2E tests, covering
inventory, dependencies, runtime estimates, and quick start commands."
post_date: "2026-01-28"
---
## 1. Introduction
This plan replaces the current spec with a comprehensive, phased remediation
strategy for the Playwright E2E test suite under [tests](tests). The goal is to
stabilize execution, align dependencies, and sequence remediation work so that
core management flows, security controls, and integration workflows become
reliable in Docker-based E2E runs.
## 2. Research Findings
### 2.1 Test Harness and Global Dependencies
- Global setup and teardown are enforced by
[tests/global-setup.ts](tests/global-setup.ts),
[tests/auth.setup.ts](tests/auth.setup.ts), and
[tests/security-teardown.setup.ts](tests/security-teardown.setup.ts).
- Global setup validates the emergency token, checks health endpoints, and
resets security settings, which impacts all security-enforcement suites.
- Multiple suites depend on the emergency server (port 2020) and Cerberus
modules with explicit admin whitelist configuration.
### 2.2 Test Inventory and Feature Areas
- Core management flows: authentication, navigation, dashboard, proxy hosts,
certificates, access lists in [tests/core](tests/core).
- DNS providers and ACME workflows: [tests/dns-provider-crud.spec.ts]
(tests/dns-provider-crud.spec.ts),
[tests/dns-provider-types.spec.ts](tests/dns-provider-types.spec.ts),
[tests/manual-dns-provider.spec.ts](tests/manual-dns-provider.spec.ts).
- Monitoring: uptime and log streaming in
[tests/monitoring](tests/monitoring).
- Settings: system, account, SMTP, notifications, encryption, user management
in [tests/settings](tests/settings).
- Tasks and imports: backups, Caddyfile import flows, CrowdSec import, and log
viewing in [tests/tasks](tests/tasks).
- Security UI: dashboard, WAF, CrowdSec, headers, rate limiting, and audit logs
in [tests/security](tests/security).
- Security enforcement: ACL, WAF, rate limits, CrowdSec, emergency token, and
break-glass recovery in [tests/security-enforcement](tests/security-enforcement).
- Integration workflows: cross-feature scenarios in
[tests/integration](tests/integration).
- Browser-specific regressions for import flows in
[tests/webkit-specific](tests/webkit-specific) and
[tests/firefox-specific](tests/firefox-specific).
- Debug and diagnostics: certificates and Caddy import debug coverage in
[tests/debug/certificates-debug.spec.ts](tests/debug/certificates-debug.spec.ts),
[tests/tasks/caddy-import-gaps.spec.ts](tests/tasks/caddy-import-gaps.spec.ts),
[tests/tasks/caddy-import-cross-browser.spec.ts](tests/tasks/caddy-import-cross-browser.spec.ts),
and [tests/debug](tests/debug).
- UI triage and regression coverage: dropdown/modal coverage in
[tests/modal-dropdown-triage.spec.ts](tests/modal-dropdown-triage.spec.ts) and
[tests/proxy-host-dropdown-fix.spec.ts](tests/proxy-host-dropdown-fix.spec.ts).
- Shared utilities validation: wait helpers in
[tests/utils/wait-helpers.spec.ts](tests/utils/wait-helpers.spec.ts).
### 2.3 Dependency and Ordering Constraints
- The security-enforcement suite assumes Cerberus can be toggled on, and its
final tests intentionally restore admin whitelist state
(see [tests/security-enforcement/zzzz-break-glass-recovery.spec.ts]
(tests/security-enforcement/zzzz-break-glass-recovery.spec.ts)).
- Admin whitelist blocking is designed to run last using a zzz prefix
(see [tests/security-enforcement/zzz-admin-whitelist-blocking.spec.ts]
(tests/security-enforcement/zzz-admin-whitelist-blocking.spec.ts)).
- Emergency server tests depend on port 2020 availability
(see [tests/security-enforcement/emergency-server](tests/security-enforcement/emergency-server)).
- Some import suites use real APIs and TestDataManager cleanup; others mock
requests. Remediation must avoid mixing mocked and real flows in a single
phase without clear isolation.
### 2.4 Runtime and Flake Hotspots
- Security-enforcement suites include extended retries, network propagation
delays, and rate limit loops.
- Import debug and gap-coverage suites perform real uploads, data creation, and
commit flows, making them sensitive to backend state and Caddy reload timing.
- Monitoring WebSocket tests require stable log streaming state.
## 3. Technical Specifications
### 3.1 Test Grouping and Shards
- **Foundation:** global setup, auth storage state, security teardown.
- **Core UI:** authentication, navigation, dashboard, proxy hosts, certificates,
access lists.
- **Settings:** system, account, SMTP, notifications, encryption, users.
- **Tasks:** backups, logs, Caddyfile import, CrowdSec import.
- **Monitoring:** uptime monitoring and real-time logs.
- **Security UI:** Cerberus dashboard, WAF config, headers, rate limiting,
CrowdSec config, audit logs.
- **Security Enforcement:** ACL/WAF/CrowdSec/rate limit enforcement, emergency
token and break-glass recovery, admin whitelist blocking.
- **Integration:** proxy + cert, proxy + DNS, backup restore, import workflows,
multi-feature workflows.
- **Browser-specific:** WebKit and Firefox import regressions.
- **Debug/POC:** diagnostics and investigation suites (Caddy import debug).
### 3.2 Dependency Graph (High-Level)
```mermaid
flowchart TD
A[global-setup + auth.setup] --> B[Core UI + Settings]
A --> C[Tasks + Monitoring]
A --> D[Security UI]
D --> E[Security Enforcement]
E --> F[Break-Glass Recovery]
B --> G[Integration Workflows]
C --> G
G --> H[Browser-specific Suites]
```
### 3.3 Runtime Estimates (Docker Mode)
| Group | Suite Examples | Expected Runtime | Prerequisites |
| --- | --- | --- | --- |
| Foundation | global setup + auth | 1-2 min | Docker E2E container, emergency token |
| Core UI | core specs | 6-10 min | Auth storage state, clean data |
| Settings | settings specs | 6-10 min | Auth storage state |
| Tasks | backups/import/logs | 10-16 min | Auth storage state, API mocks and real flows |
| Monitoring | monitoring specs | 5-8 min | WebSocket stability |
| Security UI | security specs | 10-14 min | Cerberus enabled, admin whitelist |
| Security Enforcement | enforcement specs | 15-25 min | Emergency token, port 2020, admin whitelist |
| Integration | integration specs | 12-20 min | Stable core + settings + tasks |
| Browser-specific | firefox/webkit | 8-12 min | Import baseline stable |
| Debug/POC | caddy import debug | 4-6 min | Docker logs available |
Assumed worker count: 4 (default) except security-enforcement which requires
`--workers=1`. Serial execution increases runtime for enforcement suites.
### 3.4 Environment Preconditions
- E2E container built and healthy via
`.github/skills/scripts/skill-runner.sh docker-rebuild-e2e`.
- Ports 8080 (UI/API) and 2020 (emergency server) reachable.
- `CHARON_EMERGENCY_TOKEN` configured and valid.
- Admin whitelist includes test runner ranges when Cerberus is enabled.
- Caddy admin health endpoints reachable for import workflows.
### 3.5 Emergency Server and Security Prerequisites
- Port 2020 (emergency server) available and reachable for
[tests/security-enforcement/emergency-server](tests/security-enforcement/emergency-server).
- Port 2019 is reserved for the Caddy admin API; use 2020 for emergency server
tests to avoid conflicts.
- Basic Auth credentials required for emergency server tests. Defaults in test
fixtures are `admin` / `changeme` and should match the E2E compose config.
- Admin whitelist bypass must be configured before enforcement tests that
toggle Cerberus settings.
## 4. Implementation Plan
### Phase 1: Foundation and Test Harness Reliability
Objective: Ensure the shared test harness is stable before touching feature
flows.
- Validate global setup and storage state creation
(see [tests/global-setup.ts](tests/global-setup.ts) and
[tests/auth.setup.ts](tests/auth.setup.ts)).
- Confirm emergency server availability and credentials for break-glass suites.
- Establish baseline run for core login/navigation suites.
Estimated runtime: 2-4 minutes
Success criteria:
- Storage state created once and reused without re-auth flake.
- Emergency token validation passes and security reset executes.
### Phase 2: Core UI, Settings, Monitoring, and Task Flows
Objective: Remediate the highest-traffic user journeys and tasks.
- Core UI: authentication, navigation, dashboard, proxy hosts, certificates,
access lists (core CRUD and navigation).
- Settings: system, account, SMTP, notifications, encryption, users.
- Monitoring: uptime and real-time logs.
- Tasks: backups, logs viewing, and base Caddyfile import flows.
- Include modal/dropdown triage coverage and wait helpers validation.
Estimated runtime: 25-40 minutes
Success criteria:
- Core CRUD and navigation pass without retries.
- Monitoring WebSocket tests pass without timeouts.
- Backups and log viewing flows pass with mocks and deterministic waits.
### Phase 3: Security UI and Enforcement
Objective: Stabilize Cerberus UI configuration and enforcement workflows.
- Security dashboard and configuration pages.
- WAF, headers, rate limiting, CrowdSec, audit logs.
- Enforcement suites, including emergency token and whitelist blocking order.
Estimated runtime: 30-45 minutes
Success criteria:
- Security UI toggles and pages load without state leakage.
- Enforcement suites pass with Cerberus enabled and whitelist configured.
- Break-glass recovery restores bypass state for subsequent suites.
### Phase 4: Integration, Browser-Specific, and Debug Suites
Objective: Close cross-feature and browser-specific regressions.
- Integration workflows: proxy + cert, proxy + DNS, backup restore, import to
production, multi-feature workflows.
- Browser-specific Caddy import regressions (Firefox/WebKit).
- Debug/POC suites (Caddy import debug, diagnostics) run as opt-in,
including caddy-import-gaps and cross-browser import coverage.
Estimated runtime: 25-40 minutes
Success criteria:
- Integration workflows pass with stable TestDataManager cleanup.
- Browser-specific import tests show consistent API request handling.
- Debug suites remain optional and do not block core pipelines.
## 5. Acceptance Criteria (EARS)
- WHEN the E2E harness initializes, THE SYSTEM SHALL validate emergency token
and create a reusable auth state without flake.
- WHEN core management tests execute, THE SYSTEM SHALL complete CRUD flows
without manual retries or timeouts.
- WHEN security enforcement suites execute, THE SYSTEM SHALL apply Cerberus
settings with admin whitelist bypass and SHALL restore security state after
completion.
- WHEN integration workflows execute, THE SYSTEM SHALL complete cross-feature
journeys without data collisions or residual state.
## 6. Quick Start Commands
```bash
# Rebuild and start E2E container
.github/skills/scripts/skill-runner.sh docker-rebuild-e2e
# PHASE 1: Foundation
cd /projects/Charon
npx playwright test tests/global-setup.ts tests/auth.setup.ts --project=firefox
# PHASE 2: Core UI, Settings, Tasks, Monitoring
# NOTE: PLAYWRIGHT_SKIP_SECURITY_DEPS=1 is automatically set in E2E scripts
# Security suites will NOT execute as dependencies
npx playwright test tests/core --project=firefox
npx playwright test tests/settings --project=firefox
npx playwright test tests/tasks --project=firefox
npx playwright test tests/monitoring --project=firefox
# PHASE 3: Security UI and Enforcement (SERIAL)
npx playwright test tests/security --project=firefox
npx playwright test tests/security-enforcement --project=firefox --workers=1
# PHASE 4: Integration, Browser-Specific, Debug (Optional)
npx playwright test tests/integration --project=firefox
npx playwright test tests/firefox-specific --project=firefox
npx playwright test tests/webkit-specific --project=webkit
npx playwright test tests/debug --project=firefox
npx playwright test tests/tasks/caddy-import-gaps.spec.ts --project=firefox
```
## 7. Risks and Mitigations
- Risk: Security suite state leaks across tests. Mitigation: enforce admin
whitelist reset and break-glass recovery ordering.
- Risk: File-name ordering (zzz-) not enforced without `--workers=1`.
Mitigation: document `--workers=1` requirement and make it mandatory in
CI and quick-start commands.
- Risk: Emergency server unavailable. Mitigation: gate enforcement suites on
health checks and document port 2020 requirements.
- Risk: Import suites combine mocked and real flows. Mitigation: isolate by
phase and keep debug suites opt-in.
- Risk: Missing test suites hide regressions. Mitigation: inventory now
includes all suites and maps them to phases.
## 8. Dependencies and Impacted Files
- Harness: [tests/global-setup.ts](tests/global-setup.ts),
[tests/auth.setup.ts](tests/auth.setup.ts),
[tests/security-teardown.setup.ts](tests/security-teardown.setup.ts).
- Core UI: [tests/core](tests/core).
- Settings: [tests/settings](tests/settings).
- Tasks: [tests/tasks](tests/tasks).
- Monitoring: [tests/monitoring](tests/monitoring).
- Security UI: [tests/security](tests/security).
- Security enforcement: [tests/security-enforcement](tests/security-enforcement).
- Integration: [tests/integration](tests/integration).
- Browser-specific: [tests/firefox-specific](tests/firefox-specific),
[tests/webkit-specific](tests/webkit-specific).
## 9. Confidence Score
Confidence: 79 percent
Rationale: The suite inventory and dependencies are well understood. The main
unknowns are timing-sensitive security propagation and emergency server
availability in varied environments.
## Review Feedback & Required Additions
Summary: the spec is thorough and well-structured but is missing several concrete
forensic and reproduction details needed to reliably diagnose shard timeouts
and to make CI-side fixes repeatable. The items below add those missing
artifacts, commands, and prioritized mitigations.
1) Test-forensics (how to analyze Playwright traces & map failing tests to shards)
- Extract and open traces per-shard: unzip the artifact and run:
```bash
unzip e2e-shard-<INDEX>-output/trace.zip -d /tmp/trace-INDEX
npx playwright show-trace /tmp/trace-INDEX
```
- Use JSON reporter to map test IDs to trace files and timestamps:
```bash
# run locally to produce a reporter JSON for the shard
npx playwright test --shard=INDEX/TOTAL --project=chromium --reporter=json --output=/tmp/playwright-shard-INDEX --trace=on > /tmp/playwright-shard-INDEX.json
jq '.suites[].specs[]?.tests[] | {title: .title, file: .location.file, line: .location.line, duration: .duration, annotations: .annotations}' /tmp/playwright-shard-INDEX.json
```
- Correlate test start/stop timestamps (from reporter JSON) with job logs and container logs to find the precise point where execution stopped.
- If only one test is hanging, use `--grep` or `--file` to re-run that test with `--trace=on --debug=pw:api` and capture trace and stdout.
2) CI / Workflow checks (where to inspect timeouts and cancellation causes)
- Inspect `.github/workflows/*.yml` for both top-level `timeout-minutes:` and job-level `jobs.<job>.timeout-minutes`.
```bash
grep -n "timeout-minutes" .github/workflows -R || true
```
- From the run/job JSON (API) check `status` and `conclusion` fields and `cancelled_by` / `cancelled_at` times:
```bash
curl -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/$GITHUB_OWNER/$GITHUB_REPO/actions/jobs/$JOB_ID" | jq '.'
```
- Search job logs for runner messages indicating preemption, OOM, or cancellation:
```bash
grep -iE "Job canceled|cancelled|runner lost|Runner|Killed|OOM|oom_reaper|Timeout" -R job-$JOB_ID-logs || true
```
- Confirm whether the runner was `self-hosted` (job JSON `runner_name` / `runner_group_id`). If self-hosted, collect `journalctl` and docker host logs for the timestamp window.
3) Reproduction instructions (how to reproduce the shard locally exactly)
- Rebuild image used by CI (recommended to match CI):
```bash
.github/skills/scripts/skill-runner.sh docker-rebuild-e2e
```
- Start E2E environment (use the same compose used in CI):
```bash
docker compose -f containers/charon/docker-compose.yml up -d
```
- Environment variables to set (use the values CI uses):
- `PLAYWRIGHT_BASE_URL` CI base URL (e.g. `http://localhost:8080` for Docker mode; `http://localhost:5173` for Vite dev).
- `CHARON_EMERGENCY_TOKEN` emergency token used by tests.
- `PLAYWRIGHT_JOBS` or `PWDEBUG` as needed: `DEBUG=pw:api PWDEBUG=1`.
- Optional toggles used in CI: `PLAYWRIGHT_SKIP_SECURITY_DEPS=1`.
- Exact shard reproduction command (example matching CI):
```bash
export PLAYWRIGHT_BASE_URL=http://localhost:8080
export CHARON_EMERGENCY_TOKEN=changeme
DEBUG=pw:api PWDEBUG=1 \
npx playwright test --shard=INDEX/TOTAL --project=chromium \
--output=/tmp/playwright-shard-INDEX --reporter=json --trace=on > /tmp/playwright-shard-INDEX.log 2>&1
```
- To re-run a single failing test found in JSON:
```bash
npx playwright test tests/path/to/spec.ts -g "Exact test title" --project=chromium --trace=on --output=/tmp/playwright-single
```
4) Required artifacts & evidence to collect (exact list and commands)
- Per-shard Playwright outputs: `trace.zip`, `video/*`, `test-results.json` or `reporter json` and shard stdout/stderr log. Ensure `--output` points to shard-specific path and upload as artifact.
- Job-level artifacts: GitHub Actions run logs ZIP, job logs ZIP, `gh run download` output.
- Runner/host diagnostics (self-hosted): `journalctl -u actions.runner.*`, `dmesg | grep -i oom`, `sudo journalctl -u docker.service`, `docker ps -a`, `docker logs --since` for charon-e2e and caddy.
- Capture a timestamped mapping file that lists: job start, shard start, last test start, last trace timestamp, job end. Example CSV header: `job_id,job_start,shard_index,shard_start, last_test_started_at, job_end, conclusion`.
- Attach a minimal repro package: Docker image tag, docker-compose file, the exact Playwright command-line, and the failing test id/title.
5) Prioritization of fixes and quick mitigations (concrete)
- P0 (Immediate unblock):
- Temporarily increase `timeout-minutes` to 60 for failing workflow; add `if: always()` diagnostics step and artifact upload.
- Ensure each shard uses `--output` per-shard and is uploaded (`actions/upload-artifact`) so traces are available even on cancellation.
- Re-run failing shard locally with `DEBUG=pw:api PWDEBUG=1` and collect traces.
- P1 (Same-day):
- Add CI smoke healthcheck step that validates UI and emergency server before shards start (quick `curl` checks and a small Playwright smoke test).
- If self-hosted runner, add simple resource guard (systemd service restart prevention) and OOM monitoring alert.
- Configure Playwright retries for flaky tests (small number) and mark expensive suites as `--workers=1`.
- P2 (Next sprint):
- Implement historical-duration-based shard splitting to avoid heavy concentration in one shard.
- Add test-level tagging and targeted prioritization for long-running security-enforcement suites.
- Add CI-level telemetry: test-duration history, flaky-test dashboard.
Verdict: NEEDS CHANGES — the existing spec is a solid base, but add the forensic commands, reproducible shard reproduction steps, explicit artifact list, and CI checks above before marking this plan approved.
Actionable next steps (short list):
- Add the `always()` diagnostics step to `.github/workflows/<e2e-workflow>.yml` and upload diagnostics as artifacts.
- Modify the E2E job to set `--output` to `e2e-shard-${{ matrix.index }}-output` and upload that path.
- Run `gh run download 21865692694` and extract the per-job logs; parse the job JSON to determine if the runner was self-hosted and collect host logs if so.
- Reproduce the failing shard locally using the exact commands above and attach `trace.zip` and JSON reporter output to the issue.
If you want, I can apply the small CI YAML snippets (diagnostics + upload) as a targeted patch or download the run artifacts now (requires `GITHUB_TOKEN`).