Files
Charon/docs/reports/ci_workflow_analysis.md
GitHub Actions 2b2d907b0c fix: enhance notifications and validation features
- Added URL validation for notification providers to ensure only valid http/https URLs are accepted.
- Implemented tests for URL validation scenarios in the Notifications component.
- Updated translations for error messages related to invalid URLs in multiple languages.
- Introduced new hooks for managing security headers and access lists in tests.
- Enhanced the ProviderForm component to reset state correctly when switching between add and edit modes.
- Improved user feedback with update indicators after saving changes to notification providers.
- Added mock implementations for new hooks in various test files to ensure consistent testing behavior.
2026-02-10 22:01:45 +00:00

217 lines
12 KiB
Markdown

# CI Workflow Analysis - E2E Timeout Investigation
## Scope
Reviewed CI workflow configuration and the provided E2E job logs to identify timeout and shard-related risks, per sections 2, 3, 7, and 9 of the current spec.
## CI Evidence Collection (Spec Sections 2, 3, 7, 9)
The following commands capture the exact evidence sources used for this investigation.
### Run Logs Download (gh)
```bash
gh run download 21865692694 --repo Wikid82/Charon --dir artifacts-21865692694
```
### Job Logs API Call (curl)
```bash
export GITHUB_OWNER=Wikid82
export GITHUB_REPO=Charon
export JOB_ID=<JOB_ID>
curl -H "Accept: application/vnd.github+json" \
-H "Authorization: token $GITHUB_TOKEN" \
-L "https://api.github.com/repos/$GITHUB_OWNER/$GITHUB_REPO/actions/jobs/$JOB_ID/logs" \
-o job-$JOB_ID-logs.zip
unzip -d job-$JOB_ID-logs job-$JOB_ID-logs.zip
```
### Artifact List API Call (curl)
```bash
export GITHUB_OWNER=Wikid82
export GITHUB_REPO=Charon
export RUN_ID=21865692694
curl -H "Accept: application/vnd.github+json" \
-H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/$GITHUB_OWNER/$GITHUB_REPO/actions/runs/$RUN_ID/artifacts" | jq '.'
```
### Job JSON Inspection (Cancellation Evidence)
```bash
export GITHUB_OWNER=Wikid82
export GITHUB_REPO=Charon
export JOB_ID=<JOB_ID>
curl -H "Accept: application/vnd.github+json" \
-H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/$GITHUB_OWNER/$GITHUB_REPO/actions/jobs/$JOB_ID" | jq '.'
```
## Current Timeout Configurations (Workflow Search)
- [ .github/workflows/e2e-tests-split.yml](.github/workflows/e2e-tests-split.yml#L216) - E2E Chromium Security timeout set to 60.
- [ .github/workflows/e2e-tests-split.yml](.github/workflows/e2e-tests-split.yml#L417) - E2E Firefox Security timeout set to 60.
- [ .github/workflows/e2e-tests-split.yml](.github/workflows/e2e-tests-split.yml#L626) - E2E WebKit Security timeout set to 60.
- [ .github/workflows/e2e-tests-split.yml](.github/workflows/e2e-tests-split.yml#L842) - E2E Chromium Shards timeout set to 60.
- [ .github/workflows/e2e-tests-split.yml](.github/workflows/e2e-tests-split.yml#L1046) - E2E Firefox Shards timeout set to 60.
- [ .github/workflows/e2e-tests-split.yml](.github/workflows/e2e-tests-split.yml#L1258) - E2E WebKit Shards timeout set to 60.
- [ .github/workflows/docker-build.yml](.github/workflows/docker-build.yml#L52) - Docker build phase timeout set to 20 (job-level).
- [ .github/workflows/docker-build.yml](.github/workflows/docker-build.yml#L352) - Docker build phase timeout set to 2 (step-level).
- [ .github/workflows/docker-build.yml](.github/workflows/docker-build.yml#L637) - Docker build phase timeout set to 10 (job-level).
- [ .github/workflows/docs.yml](.github/workflows/docs.yml#L27) - Docs workflow timeout set to 10.
- [ .github/workflows/docs.yml](.github/workflows/docs.yml#L368) - Docs workflow timeout set to 5.
- [ .github/workflows/codecov-upload.yml](.github/workflows/codecov-upload.yml#L38) - Codecov upload timeout set to 15.
- [ .github/workflows/codecov-upload.yml](.github/workflows/codecov-upload.yml#L72) - Codecov upload timeout set to 15.
- [ .github/workflows/security-pr.yml](.github/workflows/security-pr.yml#L23) - Security PR workflow timeout set to 10.
- [ .github/workflows/supply-chain-pr.yml](.github/workflows/supply-chain-pr.yml#L28) - Supply chain PR timeout set to 15.
- [ .github/workflows/renovate.yml](.github/workflows/renovate.yml#L20) - Renovate timeout set to 30.
- [ .github/workflows/security-weekly-rebuild.yml](.github/workflows/security-weekly-rebuild.yml#L30) - Security weekly rebuild timeout set to 60.
- [ .github/workflows/cerberus-integration.yml](.github/workflows/cerberus-integration.yml#L24) - Cerberus integration timeout set to 20.
- [ .github/workflows/crowdsec-integration.yml](.github/workflows/crowdsec-integration.yml#L24) - CrowdSec integration timeout set to 15.
- [ .github/workflows/waf-integration.yml](.github/workflows/waf-integration.yml#L24) - WAF integration timeout set to 15.
- [ .github/workflows/rate-limit-integration.yml](.github/workflows/rate-limit-integration.yml#L24) - Rate limit integration timeout set to 15.
## E2E Playwright Invocation and Shard Strategy
- Playwright is invoked in the E2E workflow for security and non-security runs. See [ .github/workflows/e2e-tests-split.yml](.github/workflows/e2e-tests-split.yml#L331), [ .github/workflows/e2e-tests-split.yml](.github/workflows/e2e-tests-split.yml#L540), [ .github/workflows/e2e-tests-split.yml](.github/workflows/e2e-tests-split.yml#L749), [ .github/workflows/e2e-tests-split.yml](.github/workflows/e2e-tests-split.yml#L945), [ .github/workflows/e2e-tests-split.yml](.github/workflows/e2e-tests-split.yml#L1157), and [ .github/workflows/e2e-tests-split.yml](.github/workflows/e2e-tests-split.yml#L1369).
- Shard matrix configuration for non-security runs is set to 4 shards per browser. See [ .github/workflows/e2e-tests-split.yml](.github/workflows/e2e-tests-split.yml#L851-L852), [ .github/workflows/e2e-tests-split.yml](.github/workflows/e2e-tests-split.yml#L1055-L1056), and [ .github/workflows/e2e-tests-split.yml](.github/workflows/e2e-tests-split.yml#L1267-L1268).
## Reproduction Command Coverage (Spec Sections 3, 8)
The steps below mirror the CI flow with the same compose file, env variables, and Playwright CLI flags.
### Image Rebuild Steps (CI Parity)
```bash
# CI build job produces a local image and saves it as a tar.
# To match CI locally, rebuild the E2E image using the project skill:
.github/skills/scripts/skill-runner.sh docker-rebuild-e2e
```
### Environment Start Commands (CI Compose)
```bash
# CI uses the Playwright CI compose file.
docker compose -f .docker/compose/docker-compose.playwright-ci.yml up -d
# Health check to match CI wait loop behavior.
curl -sf http://127.0.0.1:8080/api/v1/health > /dev/null 2>&1
```
### Exact Playwright CLI Invocation (Non-Security Shards)
```bash
export PLAYWRIGHT_BASE_URL=http://127.0.0.1:8080
export CI=true
export TEST_WORKER_INDEX=<SHARD_INDEX>
export CHARON_EMERGENCY_TOKEN=<SECRET>
export CHARON_EMERGENCY_SERVER_ENABLED=true
export CHARON_SECURITY_TESTS_ENABLED=false
export CHARON_E2E_IMAGE_TAG=<IMAGE_TAG>
npx playwright test \
--project=chromium \
--shard=<SHARD_INDEX>/<TOTAL_SHARDS> \
--output=playwright-output/chromium-shard-<SHARD_INDEX> \
tests/core \
tests/dns-provider-crud.spec.ts \
tests/dns-provider-types.spec.ts \
tests/integration \
tests/manual-dns-provider.spec.ts \
tests/monitoring \
tests/settings \
tests/tasks
```
### Post-Failure Diagnostic Collection (CI Always-Run)
```bash
mkdir -p diagnostics
uptime > diagnostics/uptime.txt
free -m > diagnostics/free-m.txt
df -h > diagnostics/df-h.txt
ps aux > diagnostics/ps-aux.txt
docker ps -a > diagnostics/docker-ps.txt || true
docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true
docker compose -f .docker/compose/docker-compose.playwright-ci.yml logs > docker-logs-shard.txt 2>&1
```
## Emergency Server Port (2020) Configuration
- No explicit references to port 2020 were found in workflow YAMLs. The E2E workflow sets `CHARON_EMERGENCY_SERVER_ENABLED=true` but does not validate port 2020 availability.
## Job Log Evidence (Shard 3)
- No runner cancellation, runner lost, or OOM strings were present in the reviewed job log text.
- The job log shows Playwright test-level timeouts (10s and 60s expectations), not a job-level timeout.
- The job log shows the shard command executed with `--shard=3/4` and standard suite list, indicating the job did run sharded Playwright as expected.
Excerpt:
```
2026-02-10T12:58:19.5379132Z npx playwright test \
2026-02-10T12:58:19.5379658Z --shard=3/4 \
2026-02-10T13:06:49.1304667Z Test timeout of 60000ms exceeded.
```
## Proposed Workflow YAML Changes (Section 9)
The following changes were applied to the E2E workflow to align with the spec:
```yaml
# Timeout increase (temporary)
e2e-chromium:
timeout-minutes: 60
# Per-shard output + artifact upload
- name: Run Chromium Non-Security Tests (Shard ${{ matrix.shard }}/${{ matrix.total-shards }})
run: |
npx playwright test \
--project=chromium \
--shard=${{ matrix.shard }}/${{ matrix.total-shards }} \
--output=playwright-output/chromium-shard-${{ matrix.shard }} \
...
- name: Upload Playwright output (Chromium shard ${{ matrix.shard }})
if: always()
uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0
with:
name: playwright-output-chromium-shard-${{ matrix.shard }}
path: playwright-output/chromium-shard-${{ matrix.shard }}/
# Diagnostics (always)
- name: Collect diagnostics
if: always()
run: |
mkdir -p diagnostics
uptime > diagnostics/uptime.txt
free -m > diagnostics/free-m.txt
df -h > diagnostics/df-h.txt
ps aux > diagnostics/ps-aux.txt
docker ps -a > diagnostics/docker-ps.txt || true
docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true
- name: Upload diagnostics
if: always()
uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0
with:
name: e2e-diagnostics-chromium-shard-${{ matrix.shard }}
path: diagnostics/
```
## Quick Mitigation Checklist (P0)
- Increase E2E job timeouts to 60 minutes in the E2E workflow to eliminate premature job cancellation risk.
- Collect diagnostics on every shard with `if: always()` and upload artifacts.
- Enforce per-shard `--output` paths and upload them as artifacts so traces and JSON are preserved even on failure.
- Re-run the failing shard locally with the exact shard flags and diagnostics enabled to capture a trace.
## CI Remediation Priority Labels (Spec Section 5)
### P0 (Immediate - already applied)
- Timeout increase to 60 minutes for E2E shard jobs.
- Always-run diagnostics collection and artifact upload.
### P1 (Same-day)
- Add a lightweight CI smoke check step before shard execution (health check + minimal Playwright smoke).
- Add basic resource monitoring output (CPU/memory/disk) to the diagnostics bundle.
### P2 (Next sprint)
- Implement shard balancing based on historical test durations.
- Stand up a test-duration/flake telemetry dashboard for CI trends.
## Explicit Confirmation Checklist
- [x] Workflow timeout-minutes locations identified
✓ Found timeout-minutes entries in .github/workflows (e.g., [.github/workflows/e2e-tests-split.yml](.github/workflows/e2e-tests-split.yml#L216), [.github/workflows/docker-build.yml](.github/workflows/docker-build.yml#L52), [.github/workflows/docs.yml](.github/workflows/docs.yml#L27), [.github/workflows/security-weekly-rebuild.yml](.github/workflows/security-weekly-rebuild.yml#L30)).
- [x] Job cancellation evidence searched
✓ Searched /tmp/job-63106399789-logs.zip for "Job canceled", "cancelled", and "runner lost"; no matches found.
- [x] OOM/kill signals searched
✓ Searched /tmp/job-63106399789-logs.zip for "Killed", "OOM", "oom_reaper", and "Out of memory"; no matches found.
- [x] Runner type confirmed (hosted vs self-hosted)
✓ E2E workflow runs on GitHub-hosted runners via runs-on: ubuntu-latest (see [.github/workflows/e2e-tests-split.yml](.github/workflows/e2e-tests-split.yml#L108)).
- [x] Emergency server port config validated
✓ Port 2020 is configured in Playwright CI compose with host mapping and bind (see [.docker/compose/docker-compose.playwright-ci.yml](.docker/compose/docker-compose.playwright-ci.yml#L42) and [.docker/compose/docker-compose.playwright-ci.yml](.docker/compose/docker-compose.playwright-ci.yml#L61)).