- Added URL validation for notification providers to ensure only valid http/https URLs are accepted. - Implemented tests for URL validation scenarios in the Notifications component. - Updated translations for error messages related to invalid URLs in multiple languages. - Introduced new hooks for managing security headers and access lists in tests. - Enhanced the ProviderForm component to reset state correctly when switching between add and edit modes. - Improved user feedback with update indicators after saving changes to notification providers. - Added mock implementations for new hooks in various test files to ensure consistent testing behavior.
37 KiB
E2E Playwright Shard Timeout Investigation — Current Spec
Last updated: 2026-02-10
Goal
- Concise summary: investigate GitHub Actions run https://github.com/Wikid82/Charon/actions/runs/21865692694 where the E2E Playwright job reports Shard 3 stopping at ~30 minutes despite configured timeouts of ~40 minutes. Produce reproducible diagnostics, collect artifacts/logs, identify root cause hypotheses, and provide prioritized remediations and short-term unblock steps.
Phases
- Discover: collect logs and artifacts.
- Analyze: review config and correlate shard → tests.
- Remediate: short-term and long-term fixes.
- Verify: reproduce and confirm the fix.
1) Discover — exact places to collect logs & artifacts
GitHub Actions (run-level)
- Run page: https://github.com/Wikid82/Charon/actions/runs/21865692694
- Run logs (zip): GET https://api.github.com/repos/Wikid82/Charon/actions/runs/21865692694/logs
- Programmatic commands:
export GITHUB_OWNER=Wikid82 export GITHUB_REPO=Charon export RUN_ID=21865692694 # Requires GITHUB_TOKEN set with repo access curl -H "Accept: application/vnd.github+json" \ -H "Authorization: token $GITHUB_TOKEN" \ -L "https://api.github.com/repos/$GITHUB_OWNER/$GITHUB_REPO/actions/runs/$RUN_ID/logs" \ -o run-${RUN_ID}-logs.zip unzip -d run-${RUN_ID}-logs run-${RUN_ID}-logs.zip
- Programmatic commands:
- Artifacts list (API):
curl -H "Authorization: token $GITHUB_TOKEN" \ "https://api.github.com/repos/$GITHUB_OWNER/$GITHUB_REPO/actions/runs/$RUN_ID/artifacts" | jq '.' - gh CLI (interactive/script):
gh run view $RUN_ID --repo $GITHUB_OWNER/$GITHUB_REPO --log > run-$RUN_ID-summary.log gh run download $RUN_ID --repo $GITHUB_OWNER/$GITHUB_REPO --dir artifacts-$RUN_ID
GitHub Actions (job-level)
- List jobs for the run and find Playwright shard job(s):
curl -H "Authorization: token $GITHUB_TOKEN" \ "https://api.github.com/repos/$GITHUB_OWNER/$GITHUB_REPO/actions/runs/$RUN_ID/jobs" | jq '.jobs[] | {id: .id, name: .name, runner_name: .runner_name, started_at: .started_at, completed_at: .completed_at}' - For JOB_ID identified as the shard job, download job logs:
curl -H "Authorization: token $GITHUB_TOKEN" -L \ "https://api.github.com/repos/$GITHUB_OWNER/$GITHUB_REPO/actions/jobs/$JOB_ID/logs" -o job-${JOB_ID}-logs.zip unzip -d job-${JOB_ID}-logs job-${JOB_ID}-logs.zip
Playwright test outputs used by this project
- Search and collect the following files in the repo root (or workflow-run directories):
playwright.config.ts,playwright.config.js,playwright.config.mjspackage.jsonscripts invoking Playwright (e.g.,test:e2e,e2e:ci).github/workflows/*steps that run Playwright
- Typical Playwright outputs to collect (per-shard):
<outputDir>/trace.zip<outputDir>/test-results.jsonortest-results/*<outputDir>/video/*<outputDir>/*.log(stdout/stderr)
Observed local example (for context): the developer ran
npx playwright test --project=chromium --output=/tmp/playwright-chromium-output --reporter=list > /tmp/playwright-chromium.log 2>&1 — look for similar invocations in workflows/scripts.
Repository container logs (containers/)
- containers/charon:
- Files to check:
containers/charon/docker-compose.yml, anylogs/ordata/directories undercontainers/charon/. - Local commands (when reproducing):
docker compose -f containers/charon/docker-compose.yml logs --no-color --timestamps > containers-charon-logs.txt docker logs --timestamps --since "1h" charon-e2e > charon-e2e.log 2>&1 || true
- Files to check:
- containers/caddy:
- Files:
containers/caddy/Caddyfile,containers/caddy/config/,containers/caddy/logs/ - Local checks:
docker logs --timestamps caddy > caddy.log 2>&1 || true curl -sS http://127.0.0.1:2019/ || true # admin curl -sS http://127.0.0.1:2020/ || true # emergency
- Files:
2) Analyze — specific files and config to review (exact paths)
-
Workflows (search these paths):
.github/workflows/*.yml— likely candidates:.github/workflows/e2e.yml,.github/workflows/ci.yml,.github/workflows/playwright.yml(rungrep -R "playwright" .github/workflows || true).- Look for
timeout-minutes:either at top-level workflow or underjobs:<job>.timeout-minutes.
-
Playwright config files:
/projects/Charon/playwright.config.ts/projects/Charon/playwright.config.js/projects/Charon/playwright.config.mjs- Inspect
projects,workers,retries,outputDir,reportersections.
-
package.json and scripts:
/projects/Charon/package.json— inspectscriptsfor e.g.test:e2e,e2e:ciand the exact Playwright CLI flags used by CI.
-
GitHub skill scripts & E2E runner:
.github/skills/scripts/skill-runner.sh— used indocsand testing instructions; check fordocker-rebuild-e2e,test-e2e-playwright-coverage.- Commands:
sed -n '1,240p' .github/skills/scripts/skill-runner.sh grep -n "docker-rebuild-e2e\|test-e2e-playwright-coverage\|playwright" -n .github/skills || true
-
Makefile:
/projects/Charon/Makefile— search for targets related toe2e,playwright,rebuild.
3) Steps to download GitHub Actions logs & artifacts for run 21865692694
Programmatic (API)
- List artifacts for run:
curl -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/Wikid82/Charon/actions/runs/21865692694/artifacts" | jq '.'
- Download run logs (zip):
curl -H "Authorization: token $GITHUB_TOKEN" -L \
"https://api.github.com/repos/Wikid82/Charon/actions/runs/21865692694/logs" -o run-21865692694-logs.zip
unzip -d run-21865692694-logs run-21865692694-logs.zip
- List jobs to find Playwright shard job id(s):
curl -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/Wikid82/Charon/actions/runs/21865692694/jobs" | jq '.jobs[] | {id: .id, name: .name, runner_name: .runner_name, started_at: .started_at, completed_at: .completed_at}'
- Download job logs by JOB_ID:
curl -H "Authorization: token $GITHUB_TOKEN" -L \
"https://api.github.com/repos/Wikid82/Charon/actions/jobs/$JOB_ID/logs" -o job-$JOB_ID-logs.zip
unzip -d job-$JOB_ID-logs job-$JOB_ID-logs.zip
Using gh CLI
gh run view 21865692694 --repo Wikid82/Charon --log > run-21865692694-summary.log
gh run download 21865692694 --repo Wikid82/Charon --dir artifacts-21865692694
Manual web UI
- Visit run page and download artifacts and job logs from the job view.
4) How to locate shard-specific logs and correlate shard indices to tests
-
Typical patterns to inspect:
- Look for Playwright CLI flags in the job step (e.g.,
--shard=INDEX/TOTAL,--output=/tmp/...). - If the job ran
npx playwright test --output=/tmp/..., search the downloaded job logs for that exact command to find the shard index.
- Look for Playwright CLI flags in the job step (e.g.,
-
Commands to list tests assigned to a shard (dry-run):
# Show which tests a given shard would run (no execution)
npx playwright test --list --shard=INDEX/TOTAL
# Or run with reporter=list (shows test items as executed)
npx playwright test --shard=INDEX/TOTAL --reporter=list
- Note: Playwright shard index is zero-based. If CI logs show
--shard=3/4, double-check whether the team used zero-based numbering; confirm by re-running the--listcommand.
Expected per-shard artifact names (if implemented):
e2e-shard-<INDEX>-outputcontainingtrace.zip,video/*,test-results.json, and shard-specific logs (stdout/stderr files).
5) Runner/container logs to inspect
-
GitHub-hosted runner: review the Actions job logs for runner messages and any
Runnerdiagnostic lines. You cannot access host-level logs. -
Self-hosted runner (if used): retrieve host system logs (requires access to runner host):
sudo journalctl -u actions.runner.* -n 1000 > runner-service-journal.log sudo journalctl -k --since "1 hour ago" | grep -i oom > runner-kernel-oom.log || true sudo journalctl -u docker.service -n 200 > docker-journal.log -
Docker container logs (charon, caddy, charon-e2e):
docker ps -a --filter "name=charon" --format "{{.Names}} {{.Status}}" > containers-ps.txt docker logs --since "1h" charon-e2e > charon-e2e.log 2>&1 || true docker logs --since "1h" caddy > caddy.log 2>&1 || true
Check Caddy admin/emergency ports (2019 & 2020) to confirm the proxy was healthy during the test run:
curl -sS --max-time 5 http://127.0.0.1:2019/ || echo "admin not responding"
curl -sS --max-time 5 http://127.0.0.1:2020/ || echo "emergency not responding"
6) Hypotheses for why Shard 3 stopped at ~30m (descriptions + exact artifacts to search)
H1 — Workflow/job timeout configured smaller than expected
- Search:
.github/workflows/*fortimeout-minutes:- job logs for
TimeoutorJob execution time exceeded
- Commands:
grep -n "timeout-minutes" .github/workflows -R || true grep -i "timeout" -R run-${RUN_ID}-logs || true - Confirmed by:
timeout-minutes: 30or job logs showingaborting execution due to timeout.
H2 — Runner preemption / connection loss
- Search job logs for:
Runner lost,The runner has been shutdown,Connection to the server was lost. - Commands:
grep -iE "runner lost|runner.*shutdown|connection.*lost|Job canceled|cancelled by" -R run-${RUN_ID}-logs || true - Confirmed by: runner disconnect lines and abrupt end of logs with no Playwright stack trace.
H3 — E2E environment container (charon/caddy) died or became unhealthy
- Search container logs for crash/fatal/panic messages and timestamps matching the job stop time.
- Commands:
docker ps -a --filter "name=charon" --format '{{.Names}} {{.Status}}' docker logs charon-e2e --since "2h" | sed -n '1,200p' grep -iE "panic|fatal|segfault|exited|health.*unhealthy|503|502" containers -R || true - Confirmed by: container exit matching job finish time and Caddy returning 502/503 during run.
H4 — Playwright/Node process killed by OOM
- Search for
Killed, kerneloom_reaperlines, systemdmesgoutputs. - Commands:
grep -R "Killed" job-${JOB_ID}-logs || true # on self-hosted runner host sudo journalctl -k --since '2 hours ago' | grep -i oom || true - Confirmed by: kernel OOM logs at same timestamp or
Killedin job logs.
H5 — Script-level early timeout (explicit timeout 30m or kill)
- Search
.github/skillsand workflow steps fortimeout 30m,timeout 1800, orkillcalls. - Commands:
grep -R "\btimeout\b\|kill -9\|kill -15\|pkill" -n .github || true - Confirmed by: a script with
timeout 30mor similar wrapper used in the job.
H6 — Misinterpreted units or mis-configuration (seconds vs minutes)
- Search for numeric values used in scripts and steps (e.g.,
1800used where minutes expected). - Commands:
grep -R "\b1800\b\|\b3600\b\|timeout-minutes" -n .github || true - Confirmed by: a value of
1800wheretimeout-minutesor similar was expected to be minutes.
For each hypothesis, the exact lines/entries returned by the grep/journal/docker commands are the evidence to confirm or refute it. Keep timestamps to correlate with the job start/completion times in the run logs.
7) Prioritized remediation plan (short-term → long-term)
Short-term (unblock re-runs quickly)
- Download and attach all logs/artifacts for run 21865692694 (use
gh run download) and share with E2E test author. - Temporarily bump
timeout-minutesfor the failing workflow to 60 to allow full runs while diagnosing. - Add an
if: always()step to the E2E job that collects diagnostics and uploads them as artifacts (free memory,dmesg,ps aux,docker ps -a,docker logs charon-e2e). - Re-run just the failing shard with added
DEBUG=pw:apiandPWDEBUG=1and persist shard outputs.
Medium-term
- Persist per-shard Playwright outputs via
actions/upload-artifact@v4for traces/videos/test-results. - Add Playwright
retriesfor transient failures and--trace/--videooptions. - Add a CI smoke check before full shard execution to confirm env health.
- If self-hosted, add runner health checks and alerting (memory, disk, Docker status).
Long-term
- Implement stable test splitting based on historical test durations rather than equal-file sharding.
- Introduce resource constraints and monitoring to protect against OOM and flapping containers.
- Build a golden-minimal E2E smoke job that must pass before running full shards.
8) Minimal reproduction checklist (local)
- Rebuild E2E image used by CI (per repo skill):
.github/skills/scripts/skill-runner.sh docker-rebuild-e2e
- Start the environment (example):
docker compose -f containers/charon/docker-compose.yml up -d
- Set base URL and run the same shard (replace INDEX/TOTAL with values from CI):
export PLAYWRIGHT_BASE_URL=http://localhost:5173
DEBUG=pw:api PWDEBUG=1 \
npx playwright test --shard=INDEX/TOTAL --project=chromium \
--output=/tmp/playwright-shard-INDEX --reporter=list > /tmp/playwright-shard-INDEX.log 2>&1
- If reproducing a timeout, immediately collect:
docker ps -a --format '{{.Names}} {{.Status}}' > reproduce-docker-ps.txt
docker logs --since '1h' charon-e2e > reproduce-charon-e2e.log || true
tail -n 500 /tmp/playwright-shard-INDEX.log > reproduce-pw-tail.log
9) Required workflow/scripts changes to improve diagnostics & prevent recurrence
-
Add
timeout-minutes: 60to.github/workflows/<e2e workflow>.ymlwhile diagnosing; later set to a reasoned SLA (e.g., 50m). -
Add an
always()step to collect diagnostics on failure and upload artifacts. Example YAML snippet:- name: Collect diagnostics if: always() run: | uptime > uptime.txt free -m > free-m.txt df -h > df-h.txt ps aux > ps-aux.txt docker ps -a > docker-ps.txt || true docker logs --tail 500 charon-e2e > docker-charon-e2e.log || true - uses: actions/upload-artifact@v4 with: name: e2e-diagnostics-${{ github.run_id }} path: | uptime.txt free-m.txt df-h.txt ps-aux.txt docker-ps.txt docker-charon-e2e.log -
Ensure each Playwright shard runs with
--outputpointing to a shard-specific path and upload that path as artifact:- artifact name convention:
e2e-shard-${{ matrix.index }}-output.
- artifact name convention:
10) People/roles to notify & recommended next actions
-
Notify:
- CI/Infra owner or person in
CODEOWNERSfor.github/workflows - E2E test author(s) (owners of failing tests)
- Self-hosted runner owner (if runner_name in job JSON indicates self-hosted)
- CI/Infra owner or person in
-
Recommended immediate actions for them:
- Download run artifacts and job logs for run 21865692694 and share them with the test author.
- Re-run the shard with
DEBUG=pw:apiandPWDEBUG=1enabled and ensure per-shard artifacts are uploaded. - If self-hosted, check runner host kernel logs for OOM and Docker container exits at the job time.
11) Verification steps (post-remediation)
- Re-run E2E workflow end-to-end; verify Shard 3 completes.
- Confirm artifacts
e2e-shard-3-outputexist and containtrace.zip,video/*, andtest-results.json. - Confirm no
oom_reaperorKilledmessages in runner host logs during the run.
Appendix — quick extraction commands summary
# Download all artifacts and logs for RUN_ID
gh run download 21865692694 --repo Wikid82/Charon --dir ./artifacts-21865692694
# List jobs and find Playwright shard job(s)
curl -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/Wikid82/Charon/actions/runs/21865692694/jobs" | jq '.jobs[] | {id: .id, name: .name, runner_name: .runner_name, started_at: .started_at, completed_at: .completed_at}'
# Download job logs for JOB_ID
curl -H "Authorization: token $GITHUB_TOKEN" -L \
"https://api.github.com/repos/Wikid82/Charon/actions/jobs/$JOB_ID/logs" -o job-$JOB_ID-logs.zip
unzip -d job-$JOB_ID-logs job-$JOB_ID-logs.zip
# Grep for likely causes
grep -iE "timeout|minut|runner lost|cancelled|Killed|OOM|oom_reaper|Out of memory|panic|fatal" -R run-21865692694-logs || true
Next three immediate actions (checklist)
- Run
gh run download 21865692694 --repo Wikid82/Charon --dir ./artifacts-21865692694and unzip the run logs. - Search the downloaded logs for
timeout-minutes,Runner lost,Killed, andoom_reaperto triage H1–H4. - Re-run the failing shard locally with
DEBUG=pw:api PWDEBUG=1and--output=/tmp/playwright-shard-INDEX, capture outputs, and upload them as artifacts.
If you want, I can now (A) download the run artifacts & logs for run 21865692694 using gh/API (requires your GITHUB_TOKEN) and list the job IDs, or (B) open the workflow files in .github/workflows and search for timeout-minutes and Playwright invocations. Which would you like me to do first?
post_title: "E2E Test Remediation Plan" author1: "Charon Team" post_slug: "e2e-test-remediation-plan" microsoft_alias: "charon-team" featured_image: "https://wikid82.github.io/charon/assets/images/featured/charon.png" categories: ["testing"] tags: ["playwright", "e2e", "remediation", "security"] ai_note: "true" summary: "Phased remediation plan for Charon Playwright E2E tests, covering inventory, dependencies, runtime estimates, and quick start commands." post_date: "2026-01-28"
1. Introduction
This plan replaces the current spec with a comprehensive, phased remediation strategy for the Playwright E2E test suite under tests. The goal is to stabilize execution, align dependencies, and sequence remediation work so that core management flows, security controls, and integration workflows become reliable in Docker-based E2E runs.
2. Research Findings
2.1 Test Harness and Global Dependencies
- Global setup and teardown are enforced by tests/global-setup.ts, tests/auth.setup.ts, and tests/security-teardown.setup.ts.
- Global setup validates the emergency token, checks health endpoints, and resets security settings, which impacts all security-enforcement suites.
- Multiple suites depend on the emergency server (port 2020) and Cerberus modules with explicit admin whitelist configuration.
2.2 Test Inventory and Feature Areas
- Core management flows: authentication, navigation, dashboard, proxy hosts, certificates, access lists in tests/core.
- DNS providers and ACME workflows: [tests/dns-provider-crud.spec.ts] (tests/dns-provider-crud.spec.ts), tests/dns-provider-types.spec.ts, tests/manual-dns-provider.spec.ts.
- Monitoring: uptime and log streaming in tests/monitoring.
- Settings: system, account, SMTP, notifications, encryption, user management in tests/settings.
- Tasks and imports: backups, Caddyfile import flows, CrowdSec import, and log viewing in tests/tasks.
- Security UI: dashboard, WAF, CrowdSec, headers, rate limiting, and audit logs in tests/security.
- Security enforcement: ACL, WAF, rate limits, CrowdSec, emergency token, and break-glass recovery in tests/security-enforcement.
- Integration workflows: cross-feature scenarios in tests/integration.
- Browser-specific regressions for import flows in tests/webkit-specific and tests/firefox-specific.
- Debug and diagnostics: certificates and Caddy import debug coverage in tests/debug/certificates-debug.spec.ts, tests/tasks/caddy-import-gaps.spec.ts, tests/tasks/caddy-import-cross-browser.spec.ts, and tests/debug.
- UI triage and regression coverage: dropdown/modal coverage in tests/modal-dropdown-triage.spec.ts and tests/proxy-host-dropdown-fix.spec.ts.
- Shared utilities validation: wait helpers in tests/utils/wait-helpers.spec.ts.
2.3 Dependency and Ordering Constraints
- The security-enforcement suite assumes Cerberus can be toggled on, and its final tests intentionally restore admin whitelist state (see [tests/security-enforcement/zzzz-break-glass-recovery.spec.ts] (tests/security-enforcement/zzzz-break-glass-recovery.spec.ts)).
- Admin whitelist blocking is designed to run last using a zzz prefix (see [tests/security-enforcement/zzz-admin-whitelist-blocking.spec.ts] (tests/security-enforcement/zzz-admin-whitelist-blocking.spec.ts)).
- Emergency server tests depend on port 2020 availability (see tests/security-enforcement/emergency-server).
- Some import suites use real APIs and TestDataManager cleanup; others mock requests. Remediation must avoid mixing mocked and real flows in a single phase without clear isolation.
2.4 Runtime and Flake Hotspots
- Security-enforcement suites include extended retries, network propagation delays, and rate limit loops.
- Import debug and gap-coverage suites perform real uploads, data creation, and commit flows, making them sensitive to backend state and Caddy reload timing.
- Monitoring WebSocket tests require stable log streaming state.
3. Technical Specifications
3.1 Test Grouping and Shards
- Foundation: global setup, auth storage state, security teardown.
- Core UI: authentication, navigation, dashboard, proxy hosts, certificates, access lists.
- Settings: system, account, SMTP, notifications, encryption, users.
- Tasks: backups, logs, Caddyfile import, CrowdSec import.
- Monitoring: uptime monitoring and real-time logs.
- Security UI: Cerberus dashboard, WAF config, headers, rate limiting, CrowdSec config, audit logs.
- Security Enforcement: ACL/WAF/CrowdSec/rate limit enforcement, emergency token and break-glass recovery, admin whitelist blocking.
- Integration: proxy + cert, proxy + DNS, backup restore, import workflows, multi-feature workflows.
- Browser-specific: WebKit and Firefox import regressions.
- Debug/POC: diagnostics and investigation suites (Caddy import debug).
3.2 Dependency Graph (High-Level)
flowchart TD
A[global-setup + auth.setup] --> B[Core UI + Settings]
A --> C[Tasks + Monitoring]
A --> D[Security UI]
D --> E[Security Enforcement]
E --> F[Break-Glass Recovery]
B --> G[Integration Workflows]
C --> G
G --> H[Browser-specific Suites]
3.3 Runtime Estimates (Docker Mode)
| Group | Suite Examples | Expected Runtime | Prerequisites |
|---|---|---|---|
| Foundation | global setup + auth | 1-2 min | Docker E2E container, emergency token |
| Core UI | core specs | 6-10 min | Auth storage state, clean data |
| Settings | settings specs | 6-10 min | Auth storage state |
| Tasks | backups/import/logs | 10-16 min | Auth storage state, API mocks and real flows |
| Monitoring | monitoring specs | 5-8 min | WebSocket stability |
| Security UI | security specs | 10-14 min | Cerberus enabled, admin whitelist |
| Security Enforcement | enforcement specs | 15-25 min | Emergency token, port 2020, admin whitelist |
| Integration | integration specs | 12-20 min | Stable core + settings + tasks |
| Browser-specific | firefox/webkit | 8-12 min | Import baseline stable |
| Debug/POC | caddy import debug | 4-6 min | Docker logs available |
Assumed worker count: 4 (default) except security-enforcement which requires
--workers=1. Serial execution increases runtime for enforcement suites.
3.4 Environment Preconditions
- E2E container built and healthy via
.github/skills/scripts/skill-runner.sh docker-rebuild-e2e. - Ports 8080 (UI/API) and 2020 (emergency server) reachable.
CHARON_EMERGENCY_TOKENconfigured and valid.- Admin whitelist includes test runner ranges when Cerberus is enabled.
- Caddy admin health endpoints reachable for import workflows.
3.5 Emergency Server and Security Prerequisites
- Port 2020 (emergency server) available and reachable for tests/security-enforcement/emergency-server.
- Port 2019 is reserved for the Caddy admin API; use 2020 for emergency server tests to avoid conflicts.
- Basic Auth credentials required for emergency server tests. Defaults in test
fixtures are
admin/changemeand should match the E2E compose config. - Admin whitelist bypass must be configured before enforcement tests that toggle Cerberus settings.
4. Implementation Plan
Phase 1: Foundation and Test Harness Reliability
Objective: Ensure the shared test harness is stable before touching feature flows.
- Validate global setup and storage state creation (see tests/global-setup.ts and tests/auth.setup.ts).
- Confirm emergency server availability and credentials for break-glass suites.
- Establish baseline run for core login/navigation suites.
Estimated runtime: 2-4 minutes
Success criteria:
- Storage state created once and reused without re-auth flake.
- Emergency token validation passes and security reset executes.
Phase 2: Core UI, Settings, Monitoring, and Task Flows
Objective: Remediate the highest-traffic user journeys and tasks.
- Core UI: authentication, navigation, dashboard, proxy hosts, certificates, access lists (core CRUD and navigation).
- Settings: system, account, SMTP, notifications, encryption, users.
- Monitoring: uptime and real-time logs.
- Tasks: backups, logs viewing, and base Caddyfile import flows.
- Include modal/dropdown triage coverage and wait helpers validation.
Estimated runtime: 25-40 minutes
Success criteria:
- Core CRUD and navigation pass without retries.
- Monitoring WebSocket tests pass without timeouts.
- Backups and log viewing flows pass with mocks and deterministic waits.
Phase 3: Security UI and Enforcement
Objective: Stabilize Cerberus UI configuration and enforcement workflows.
- Security dashboard and configuration pages.
- WAF, headers, rate limiting, CrowdSec, audit logs.
- Enforcement suites, including emergency token and whitelist blocking order.
Estimated runtime: 30-45 minutes
Success criteria:
- Security UI toggles and pages load without state leakage.
- Enforcement suites pass with Cerberus enabled and whitelist configured.
- Break-glass recovery restores bypass state for subsequent suites.
Phase 4: Integration, Browser-Specific, and Debug Suites
Objective: Close cross-feature and browser-specific regressions.
- Integration workflows: proxy + cert, proxy + DNS, backup restore, import to production, multi-feature workflows.
- Browser-specific Caddy import regressions (Firefox/WebKit).
- Debug/POC suites (Caddy import debug, diagnostics) run as opt-in, including caddy-import-gaps and cross-browser import coverage.
Estimated runtime: 25-40 minutes
Success criteria:
- Integration workflows pass with stable TestDataManager cleanup.
- Browser-specific import tests show consistent API request handling.
- Debug suites remain optional and do not block core pipelines.
5. Acceptance Criteria (EARS)
- WHEN the E2E harness initializes, THE SYSTEM SHALL validate emergency token and create a reusable auth state without flake.
- WHEN core management tests execute, THE SYSTEM SHALL complete CRUD flows without manual retries or timeouts.
- WHEN security enforcement suites execute, THE SYSTEM SHALL apply Cerberus settings with admin whitelist bypass and SHALL restore security state after completion.
- WHEN integration workflows execute, THE SYSTEM SHALL complete cross-feature journeys without data collisions or residual state.
6. Quick Start Commands
# Rebuild and start E2E container
.github/skills/scripts/skill-runner.sh docker-rebuild-e2e
# PHASE 1: Foundation
cd /projects/Charon
npx playwright test tests/global-setup.ts tests/auth.setup.ts --project=firefox
# PHASE 2: Core UI, Settings, Tasks, Monitoring
# NOTE: PLAYWRIGHT_SKIP_SECURITY_DEPS=1 is automatically set in E2E scripts
# Security suites will NOT execute as dependencies
npx playwright test tests/core --project=firefox
npx playwright test tests/settings --project=firefox
npx playwright test tests/tasks --project=firefox
npx playwright test tests/monitoring --project=firefox
# PHASE 3: Security UI and Enforcement (SERIAL)
npx playwright test tests/security --project=firefox
npx playwright test tests/security-enforcement --project=firefox --workers=1
# PHASE 4: Integration, Browser-Specific, Debug (Optional)
npx playwright test tests/integration --project=firefox
npx playwright test tests/firefox-specific --project=firefox
npx playwright test tests/webkit-specific --project=webkit
npx playwright test tests/debug --project=firefox
npx playwright test tests/tasks/caddy-import-gaps.spec.ts --project=firefox
7. Risks and Mitigations
- Risk: Security suite state leaks across tests. Mitigation: enforce admin whitelist reset and break-glass recovery ordering.
- Risk: File-name ordering (zzz-) not enforced without
--workers=1. Mitigation: document--workers=1requirement and make it mandatory in CI and quick-start commands. - Risk: Emergency server unavailable. Mitigation: gate enforcement suites on health checks and document port 2020 requirements.
- Risk: Import suites combine mocked and real flows. Mitigation: isolate by phase and keep debug suites opt-in.
- Risk: Missing test suites hide regressions. Mitigation: inventory now includes all suites and maps them to phases.
8. Dependencies and Impacted Files
- Harness: tests/global-setup.ts, tests/auth.setup.ts, tests/security-teardown.setup.ts.
- Core UI: tests/core.
- Settings: tests/settings.
- Tasks: tests/tasks.
- Monitoring: tests/monitoring.
- Security UI: tests/security.
- Security enforcement: tests/security-enforcement.
- Integration: tests/integration.
- Browser-specific: tests/firefox-specific, tests/webkit-specific.
9. Confidence Score
Confidence: 79 percent
Rationale: The suite inventory and dependencies are well understood. The main unknowns are timing-sensitive security propagation and emergency server availability in varied environments.
Review Feedback & Required Additions
Summary: the spec is thorough and well-structured but is missing several concrete forensic and reproduction details needed to reliably diagnose shard timeouts and to make CI-side fixes repeatable. The items below add those missing artifacts, commands, and prioritized mitigations.
- Test-forensics (how to analyze Playwright traces & map failing tests to shards)
- Extract and open traces per-shard: unzip the artifact and run:
unzip e2e-shard-<INDEX>-output/trace.zip -d /tmp/trace-INDEX npx playwright show-trace /tmp/trace-INDEX - Use JSON reporter to map test IDs to trace files and timestamps:
# run locally to produce a reporter JSON for the shard npx playwright test --shard=INDEX/TOTAL --project=chromium --reporter=json --output=/tmp/playwright-shard-INDEX --trace=on > /tmp/playwright-shard-INDEX.json jq '.suites[].specs[]?.tests[] | {title: .title, file: .location.file, line: .location.line, duration: .duration, annotations: .annotations}' /tmp/playwright-shard-INDEX.json - Correlate test start/stop timestamps (from reporter JSON) with job logs and container logs to find the precise point where execution stopped.
- If only one test is hanging, use
--grepor--fileto re-run that test with--trace=on --debug=pw:apiand capture trace and stdout.
- CI / Workflow checks (where to inspect timeouts and cancellation causes)
- Inspect
.github/workflows/*.ymlfor both top-leveltimeout-minutes:and job-leveljobs.<job>.timeout-minutes.grep -n "timeout-minutes" .github/workflows -R || true - From the run/job JSON (API) check
statusandconclusionfields andcancelled_by/cancelled_attimes:curl -H "Authorization: token $GITHUB_TOKEN" \ "https://api.github.com/repos/$GITHUB_OWNER/$GITHUB_REPO/actions/jobs/$JOB_ID" | jq '.' - Search job logs for runner messages indicating preemption, OOM, or cancellation:
grep -iE "Job canceled|cancelled|runner lost|Runner|Killed|OOM|oom_reaper|Timeout" -R job-$JOB_ID-logs || true - Confirm whether the runner was
self-hosted(job JSONrunner_name/runner_group_id). If self-hosted, collectjournalctland docker host logs for the timestamp window.
- Reproduction instructions (how to reproduce the shard locally exactly)
- Rebuild image used by CI (recommended to match CI):
.github/skills/scripts/skill-runner.sh docker-rebuild-e2e - Start E2E environment (use the same compose used in CI):
docker compose -f containers/charon/docker-compose.yml up -d - Environment variables to set (use the values CI uses):
PLAYWRIGHT_BASE_URL– CI base URL (e.g.http://localhost:8080for Docker mode;http://localhost:5173for Vite dev).CHARON_EMERGENCY_TOKEN– emergency token used by tests.PLAYWRIGHT_JOBSorPWDEBUGas needed:DEBUG=pw:api PWDEBUG=1.- Optional toggles used in CI:
PLAYWRIGHT_SKIP_SECURITY_DEPS=1.
- Exact shard reproduction command (example matching CI):
export PLAYWRIGHT_BASE_URL=http://localhost:8080 export CHARON_EMERGENCY_TOKEN=changeme DEBUG=pw:api PWDEBUG=1 \ npx playwright test --shard=INDEX/TOTAL --project=chromium \ --output=/tmp/playwright-shard-INDEX --reporter=json --trace=on > /tmp/playwright-shard-INDEX.log 2>&1 - To re-run a single failing test found in JSON:
npx playwright test tests/path/to/spec.ts -g "Exact test title" --project=chromium --trace=on --output=/tmp/playwright-single
- Required artifacts & evidence to collect (exact list and commands)
- Per-shard Playwright outputs:
trace.zip,video/*,test-results.jsonorreporter jsonand shard stdout/stderr log. Ensure--outputpoints to shard-specific path and upload as artifact. - Job-level artifacts: GitHub Actions run logs ZIP, job logs ZIP,
gh run downloadoutput. - Runner/host diagnostics (self-hosted):
journalctl -u actions.runner.*,dmesg | grep -i oom,sudo journalctl -u docker.service,docker ps -a,docker logs --sincefor charon-e2e and caddy. - Capture a timestamped mapping file that lists: job start, shard start, last test start, last trace timestamp, job end. Example CSV header:
job_id,job_start,shard_index,shard_start, last_test_started_at, job_end, conclusion. - Attach a minimal repro package: Docker image tag, docker-compose file, the exact Playwright command-line, and the failing test id/title.
- Prioritization of fixes and quick mitigations (concrete)
- P0 (Immediate unblock):
- Temporarily increase
timeout-minutesto 60 for failing workflow; addif: always()diagnostics step and artifact upload. - Ensure each shard uses
--outputper-shard and is uploaded (actions/upload-artifact) so traces are available even on cancellation. - Re-run failing shard locally with
DEBUG=pw:api PWDEBUG=1and collect traces.
- Temporarily increase
- P1 (Same-day):
- Add CI smoke healthcheck step that validates UI and emergency server before shards start (quick
curlchecks and a small Playwright smoke test). - If self-hosted runner, add simple resource guard (systemd service restart prevention) and OOM monitoring alert.
- Configure Playwright retries for flaky tests (small number) and mark expensive suites as
--workers=1.
- Add CI smoke healthcheck step that validates UI and emergency server before shards start (quick
- P2 (Next sprint):
- Implement historical-duration-based shard splitting to avoid heavy concentration in one shard.
- Add test-level tagging and targeted prioritization for long-running security-enforcement suites.
- Add CI-level telemetry: test-duration history, flaky-test dashboard.
Verdict: NEEDS CHANGES — the existing spec is a solid base, but add the forensic commands, reproducible shard reproduction steps, explicit artifact list, and CI checks above before marking this plan approved.
Actionable next steps (short list):
- Add the
always()diagnostics step to.github/workflows/<e2e-workflow>.ymland upload diagnostics as artifacts. - Modify the E2E job to set
--outputtoe2e-shard-${{ matrix.index }}-outputand upload that path. - Run
gh run download 21865692694and extract the per-job logs; parse the job JSON to determine if the runner was self-hosted and collect host logs if so. - Reproduce the failing shard locally using the exact commands above and attach
trace.zipand JSON reporter output to the issue.
If you want, I can apply the small CI YAML snippets (diagnostics + upload) as a targeted patch or download the run artifacts now (requires GITHUB_TOKEN).