Investigation Phase:
Problem:
- Tests hang AFTER global setup completes
- No test execution begins (hung before first test)
- Step timeout (15min) doesn't trigger properly
- Job timeout (45min) eventually kills process after 44min
Changes:
1. Added DEBUG=pw:api to all browser jobs
- Will show exact Playwright API calls
- Pinpoint where execution hangs (auth setup vs browser launch vs test init)
2. Reduced job timeout: 45min → 20min
- Fail faster when tests hang
- Reduces wasted CI resources
- Still allows normal test execution (local: 1.2min)
Expected Outcome:
- Verbose logs reveal hang location
- Faster feedback loop (20min vs 44min)
- Can identify if issue is:
* auth.setup.ts hanging
* Browser process not launching
* Connection issues to application
Next Steps Based on Logs:
- If browser launch hangs: Add dumb-init (Phase 3)
- If auth setup hangs: Investigate cookie/storage state
- If network hangs: Add localhost loopback routing
Phase: 2.5 of 3 (Diagnostic Logging)
See: docs/plans/ci_hang_remediation.md
Resource Constraint Management:
Problem:
- Tests hanging indefinitely during execution in CI
- 2-core runners resource-constrained vs local dev machines
- No timeout enforcement allows tests to run forever
Changes:
1. playwright.config.js:
- Reduced per-test timeout: 90s → 60s (CI only)
- Comment clarifies CI resource constraints
- Local dev keeps 90s for debugging
2. .github/workflows/e2e-tests-split.yml:
- Added timeout-minutes: 15 to all test steps
- Ensures CI fails explicitly after 15 minutes
- Prevents workflow hanging until 6-hour GitHub limit
Expected Outcome:
- Tests fail fast with timeout error instead of hanging
- Clearer debugging: timeout vs hang vs test failure
- CI resources freed up faster for other jobs
Phase: 2 of 3 (Resource Constraints)
See: docs/plans/ci_hang_remediation.md
Remove overly complex verification logic that was causing all browser
jobs to fail. Browser installation should fail fast and clearly if
there are issues.
Changes:
- Remove multi-line verification scripts from all 3 browser install steps
- Simplify to single command: npx playwright install --with-deps {browser}
- Let install step show actual errors if it fails
- Let test execution show "browser not found" errors if install incomplete
Rationale:
- Previous complex verification (using grep/find) was the failure point
- Simpler approach provides clearer error messages for debugging
- Tests themselves will fail clearly if browsers aren't available
Expected outcome:
- Install steps show actual error messages if they fail
- If install succeeds, tests execute normally
- If install "succeeds" but browser is missing, test step shows clear error
Timeout remains at 45 minutes (accommodates 10-15 min install + execution)
- Changed workflow name to reflect sequential execution for stability.
- Reduced test sharding from 4 to 1 per browser, resulting in 3 total jobs.
- Updated job summaries and documentation to clarify execution model.
- Added new documentation file for E2E CI failure diagnosis.
- Adjusted job summary tables to reflect changes in shard counts and execution type.
workflow_run triggers only fire for push events, not pull_request events,
causing PRs to skip integration and E2E tests entirely. Add dual triggers
to all test workflows so they run for both push (via workflow_run) and
pull_request events, while maintaining single-build architecture.
All workflows still pull pre-built images from docker-build.yml - no
redundant builds introduced. This fixes PR test coverage while preserving
the "Build Once, Test Many" optimization for push events.
Fixes: Build Once architecture (commit 928033ec)
Restructures CI/CD pipeline to eliminate redundant Docker image builds
across parallel test workflows. Previously, every PR triggered 5 separate
builds of identical images, consuming compute resources unnecessarily and
contributing to registry storage bloat.
Registry storage was growing at 20GB/week due to unmanaged transient tags
from multiple parallel builds. While automated cleanup exists, preventing
the creation of redundant images is more efficient than cleaning them up.
Changes CI/CD orchestration so docker-build.yml is the single source of
truth for all Docker images. Integration tests (CrowdSec, Cerberus, WAF,
Rate Limiting) and E2E tests now wait for the build to complete via
workflow_run triggers, then pull the pre-built image from GHCR.
PR and feature branch images receive immutable tags that include commit
SHA (pr-123-abc1234, feature-dns-provider-def5678) to prevent race
conditions when branches are updated during test execution. Tag
sanitization handles special characters, slashes, and name length limits
to ensure Docker compatibility.
Adds retry logic for registry operations to handle transient GHCR
failures, with dual-source fallback to artifact downloads when registry
pulls fail. Preserves all existing functionality and backward
compatibility while reducing parallel build count from 5× to 1×.
Security scanning now covers all PR images (previously skipped),
blocking merges on CRITICAL/HIGH vulnerabilities. Concurrency groups
prevent stale test runs from consuming resources when PRs are updated
mid-execution.
Expected impact: 80% reduction in compute resources, 4× faster
total CI time (120min → 30min), prevention of uncontrolled registry
storage growth, and 100% consistency guarantee (all tests validate
the exact same image that would be deployed).
Closes #[issue-number-if-exists]