Files
Charon/docs/plans/e2e_ci_failure_diagnosis.md
GitHub Actions 3169b05156 fix: skip incomplete system log viewer tests
- Marked 12 tests as skip pending feature implementation
- Features tracked in GitHub issue #686 (system log viewer feature completion)
- Tests cover sorting by timestamp/level/method/URI/status, pagination controls, filtering by text/level, download functionality
- Unblocks Phase 2 at 91.7% pass rate to proceed to Phase 3 security enforcement validation
- TODO comments in code reference GitHub #686 for feature completion tracking
- Tests skipped: Pagination (3), Search/Filter (2), Download (2), Sorting (1), Log Display (4)
2026-02-09 21:55:55 +00:00

18 KiB

E2E CI Failure Diagnosis - 100% Failure vs 90% Pass Local

Date: February 4, 2026 Status: 🔴 CRITICAL - 100% CI failure rate vs 90% local pass rate Urgency: HIGH - Blocking all PRs and CI/CD pipeline


Executive Summary

Problem: E2E tests exhibit a critical environmental discrepancy:

  • Local Environment: 90% of E2E tests PASS when running via skill-runner.sh test-e2e-playwright
  • CI Environment: 100% of E2E jobs FAIL in GitHub Actions workflow (e2e-tests-split.yml)

Root Cause Hypothesis: Multiple critical configuration differences between local and CI environments create an inconsistent test execution environment, leading to systematic failures in CI.

Impact:

  • All PRs blocked due to failing E2E checks
  • Cannot merge to main or development
  • CI/CD pipeline completely stalled
  • ⚠️ Development velocity severely impacted

Configuration Comparison Matrix

Docker Compose Configuration Differences

Configuration Local (docker-compose.playwright-local.yml) CI (docker-compose.playwright-ci.yml) Impact
Environment CHARON_ENV=e2e CHARON_ENV=test 🔴 HIGH - Different runtime behavior
Credential Source env_file: ../../.env Environment variables from $GITHUB_ENV 🟡 MEDIUM - Potential missing vars
Encryption Key Loaded from .env file Generated ephemeral: openssl rand -base64 32 🟢 LOW - Both valid
Emergency Token Loaded from .env file From GitHub Secrets (CHARON_EMERGENCY_TOKEN) 🟡 MEDIUM - Potential missing/invalid token
Security Tests Flag NOT SET CHARON_SECURITY_TESTS_ENABLED=true 🔴 CRITICAL - May enable security modules
Data Storage tmpfs: /app/data (in-memory, ephemeral) Named volumes (playwright_data, etc.) 🟡 MEDIUM - Different persistence behavior
Security Profile Not enabled by default --profile security-tests (enables CrowdSec) 🔴 CRITICAL - Different security modules active
Image Source charon:local (fresh local build) charon:e2e-test (loaded from artifact) 🟢 LOW - Both should be identical builds
Container Name charon-e2e charon-playwright 🟢 LOW - Cosmetic difference

GitHub Actions Workflow Environment

Variable CI Value Local Equivalent Impact
CI true Not set 🟡 MEDIUM - Playwright retries, workers, etc.
PLAYWRIGHT_BASE_URL http://localhost:8080 http://localhost:8080 🟢 LOW - Identical
PLAYWRIGHT_COVERAGE 0 (disabled by default) 0 🟢 LOW - Identical
CHARON_EMERGENCY_SERVER_ENABLED true true 🟢 LOW - Identical
CHARON_EMERGENCY_BIND 0.0.0.0:2020 0.0.0.0:2020 🟢 LOW - Identical
NODE_VERSION 20 User-dependent 🟡 MEDIUM - May differ
GO_VERSION 1.25.6 User-dependent 🟡 MEDIUM - May differ

Local Test Execution Flow

User runs E2E tests locally:

# Step 1: Rebuild E2E container (CRITICAL: user must do this)
.github/skills/scripts/skill-runner.sh docker-rebuild-e2e

# Default behavior: NO security profile enabled
# Result: CrowdSec NOT running
# CHARON_SECURITY_TESTS_ENABLED: NOT SET

# Step 2: Run tests
.github/skills/scripts/skill-runner.sh test-e2e-playwright

What's missing locally:

  1. No --profile security-tests (CrowdSec not running)
  2. No CHARON_SECURITY_TESTS_ENABLED environment variable
  3. CHARON_ENV=e2e instead of CHARON_ENV=test
  4. Uses .env file (requires user to have created it)

CI Test Execution Flow

GitHub Actions runs E2E tests:

# Step 1: Generate ephemeral encryption key
- name: Generate ephemeral encryption key
  run: echo "CHARON_ENCRYPTION_KEY=$(openssl rand -base64 32)" >> $GITHUB_ENV

# Step 2: Validate emergency token
- name: Validate Emergency Token Configuration
  # Checks CHARON_EMERGENCY_TOKEN from secrets

# Step 3: Start with security-tests profile
- name: Start test environment
  run: |
    docker compose -f .docker/compose/docker-compose.playwright-ci.yml --profile security-tests up -d

# Environment variables in workflow:
env:
  CHARON_EMERGENCY_TOKEN: ${{ secrets.CHARON_EMERGENCY_TOKEN }}
  CHARON_EMERGENCY_SERVER_ENABLED: "true"
  CHARON_SECURITY_TESTS_ENABLED: "true"  # ← SET IN CI
  CHARON_E2E_IMAGE_TAG: charon:e2e-test

# Step 4: Wait for health check (30 attempts, 2s interval)

# Step 5: Run tests with sharding
npx playwright test --project=chromium --shard=1/4

What's different in CI:

  1. --profile security-tests enabled (CrowdSec running)
  2. CHARON_SECURITY_TESTS_ENABLED=true explicitly set
  3. CHARON_ENV=test (not e2e)
  4. Named volumes (persistent data within workflow run)
  5. Sharding enabled (4 shards per browser)

Root Cause Analysis

Critical Difference #1: CHARON_ENV (e2e vs test)

Evidence: Local uses CHARON_ENV=e2e, CI uses CHARON_ENV=test

Behavior Difference: Looking at backend/internal/caddy/config.go:92:

isE2E := os.Getenv("CHARON_ENV") == "e2e"

if acmeEmail != "" || isE2E {
    // E2E environment allows certificate generation without email
}

Impact: The application may behave differently in rate limiting, certificate generation, or other environment-specific logic depending on this variable.

Severity: 🔴 HIGH - Fundamental environment difference

Hypothesis: If there's rate limiting logic checking for CHARON_ENV == "e2e" to provide lenient limits, the CI environment with CHARON_ENV=test may enforce stricter limits, causing test failures.

Critical Difference #2: CHARON_SECURITY_TESTS_ENABLED

Evidence: NOT set locally, explicitly set to "true" in CI

Where it's set:

  • CI Workflow: CHARON_SECURITY_TESTS_ENABLED: "true" in env block
  • CI Compose: CHARON_SECURITY_TESTS_ENABLED=${CHARON_SECURITY_TESTS_ENABLED:-true}
  • Local Compose: NOT PRESENT

Impact: UNKNOWN - This variable is NOT used anywhere in the backend Go code (confirmed by grep search). However, it may:

  1. Be checked in the frontend TypeScript code
  2. Control test fixture behavior
  3. Be a vestigial variable that was removed from code but left in compose files

Severity: 🟡 MEDIUM - Present in CI but not local, unexplained purpose

Action Required: Search frontend and test fixtures for usage of this variable.

Critical Difference #3: Security Profile (CrowdSec)

Evidence: CI runs with --profile security-tests, local does NOT (unless manually specified)

Impact:

  • CI: CrowdSec container running alongside charon-app
  • Local: No CrowdSec (unless user runs docker-rebuild-e2e --profile=security-tests)

CrowdSec Service Configuration:

crowdsec:
  image: crowdsecurity/crowdsec:latest
  profiles:
    - security-tests
  environment:
    - COLLECTIONS=crowdsecurity/nginx crowdsecurity/http-cve
    - BOUNCER_KEY_charon=test-bouncer-key-for-e2e
    - DISABLE_ONLINE_API=true

Severity: 🔴 CRITICAL - Entire security module missing locally

Hypothesis: Tests may be failing in CI because:

  1. CrowdSec is blocking requests that should pass
  2. CrowdSec has configuration issues in CI environment
  3. Tests are written assuming CrowdSec is NOT running
  4. Network routing through CrowdSec causes latency or timeouts

Critical Difference #4: Data Storage (tmpfs vs named volumes)

Evidence:

  • Local: tmpfs: /app/data:size=100M,mode=1777 (in-memory, cleared on restart)
  • CI: Named volumes playwright_data, playwright_caddy_data, playwright_caddy_config

Impact:

  • Local: True ephemeral storage - every restart is 100% fresh
  • CI: Volumes persist across container restarts within the same workflow run

Severity: 🟡 MEDIUM - Could cause state pollution in CI

Hypothesis: If CI containers are restarted mid-workflow (e.g., between shards), the volumes retain data, potentially causing state pollution that doesn't exist locally.

Critical Difference #5: Credential Management

Evidence:

  • Local: Uses env_file: ../../.env to load all credentials
  • CI: Passes credentials explicitly via $GITHUB_ENV and secrets

Failure Scenario:

  1. User creates .env file with CHARON_ENCRYPTION_KEY and CHARON_EMERGENCY_TOKEN
  2. Local tests pass because both variables are loaded from .env
  3. CI generates ephemeral CHARON_ENCRYPTION_KEY (always fresh)
  4. CI loads CHARON_EMERGENCY_TOKEN from GitHub Secrets

Potential Issues:

  • Is CHARON_EMERGENCY_TOKEN correctly configured in GitHub Secrets?
  • Is the token length validation passing in CI? (requires ≥64 characters)
  • Are there any other variables loaded from .env locally that are missing in CI?

Severity: 🔴 HIGH - Credential mismatches can cause authentication failures


Suspected Failure Scenarios

Scenario A: CrowdSec Blocking Legitimate Test Requests

Hypothesis: CrowdSec in CI is blocking test requests that would pass locally without CrowdSec.

Evidence Needed:

  1. Docker logs from CrowdSec container in failed CI runs
  2. Charon application logs showing blocked requests
  3. Test failure patterns (are they authentication/authorization related?)

Test: Run locally with security-tests profile:

.github/skills/scripts/skill-runner.sh docker-rebuild-e2e --profile=security-tests
.github/skills/scripts/skill-runner.sh test-e2e-playwright

Expected: If this is the root cause, tests will fail locally with the profile enabled.

Scenario B: CHARON_ENV=test Enforces Stricter Limits

Hypothesis: The test environment enforces production-like limits (rate limiting, timeouts) that break tests designed for lenient e2e environment.

Evidence Needed:

  1. Search backend code for all uses of CHARON_ENV
  2. Identify rate limiting, timeout, or other behavior differences
  3. Check if tests make rapid API calls that would hit rate limits

Test: Modify local compose to use CHARON_ENV=test:

# .docker/compose/docker-compose.playwright-local.yml
environment:
  - CHARON_ENV=test  # Change from e2e

Expected: If this is the root cause, tests will fail locally with CHARON_ENV=test.

Scenario C: Missing Environment Variable in CI

Hypothesis: The CI environment is missing a critical environment variable that's loaded from .env locally but not set in CI compose/workflow.

Evidence Needed:

  1. Compare .env.example with all variables explicitly set in docker-compose.playwright-ci.yml and the workflow
  2. Check application startup logs for warnings about missing environment variables
  3. Review test failure messages for configuration errors

Test: Audit all environment variables:

# Local container
docker exec charon-e2e env | sort > local-env.txt

# CI container (from failed run logs)
# Download docker logs artifact and extract env vars

Scenario D: Image Build Differences (Local vs CI Artifact)

Hypothesis: The Docker image built locally (charon:local) differs from the CI artifact (charon:e2e-test) in some way that causes test failures.

Evidence Needed:

  1. Compare Dockerfile build args between local and CI
  2. Inspect image layers to identify differences
  3. Check if CI cache is corrupted

Test: Load the CI artifact locally and run tests against it:

# Download artifact from failed CI run
# Load image: docker load -i charon-e2e-image.tar
# Run tests against CI artifact locally

Diagnostic Action Plan

Phase 1: Evidence Collection (Immediate)

Task 1.1: Download recent failed CI run artifacts

  • Download Docker logs from latest failed run
  • Download test traces and videos
  • Download HTML test reports

Task 1.2: Capture local environment baseline

# With default settings (passing tests)
docker exec charon-e2e env | sort > local-env-baseline.txt
docker logs charon-e2e > local-logs-baseline.txt

Task 1.3: Search for CHARON_SECURITY_TESTS_ENABLED usage

# Frontend
grep -r "CHARON_SECURITY_TESTS_ENABLED" frontend/

# Tests
grep -r "CHARON_SECURITY_TESTS_ENABLED" tests/

# Backend (already confirmed: NOT USED)

Task 1.4: Document test failure patterns in CI

  • Review last 10 failed CI runs
  • Identify common error messages
  • Check if specific tests always fail
  • Check if failures are random or deterministic

Phase 2: Controlled Experiments (Next)

Experiment 2.1: Enable security-tests profile locally

.github/skills/scripts/skill-runner.sh docker-rebuild-e2e --profile=security-tests --clean
.github/skills/scripts/skill-runner.sh test-e2e-playwright

Expected Outcome: If CrowdSec is the root cause, tests will fail locally.

Experiment 2.2: Change CHARON_ENV to "test" locally

# Edit .docker/compose/docker-compose.playwright-local.yml
# Change: CHARON_ENV=e2e → CHARON_ENV=test
.github/skills/scripts/skill-runner.sh docker-rebuild-e2e --clean
.github/skills/scripts/skill-runner.sh test-e2e-playwright

Expected Outcome: If environment-specific behavior differs, tests will fail locally.

Experiment 2.3: Add CHARON_SECURITY_TESTS_ENABLED locally

# Edit .docker/compose/docker-compose.playwright-local.yml
# Add: - CHARON_SECURITY_TESTS_ENABLED=true
.github/skills/scripts/skill-runner.sh docker-rebuild-e2e --clean
.github/skills/scripts/skill-runner.sh test-e2e-playwright

Expected Outcome: If this flag controls critical behavior, tests may fail locally.

Experiment 2.4: Use named volumes instead of tmpfs locally

# Edit .docker/compose/docker-compose.playwright-local.yml
# Replace tmpfs with named volumes matching CI config
.github/skills/scripts/skill-runner.sh docker-rebuild-e2e --clean
.github/skills/scripts/skill-runner.sh test-e2e-playwright

Expected Outcome: If volume persistence causes state pollution, tests may behave differently.

Phase 3: CI Simplification (Final)

If experiments identify the root cause, apply corresponding fix to CI:

Fix 3.1: Remove security-tests profile from CI (if CrowdSec is the culprit)

# .github/workflows/e2e-tests-split.yml
- name: Start test environment
  run: |
    docker compose -f .docker/compose/docker-compose.playwright-ci.yml up -d
    # Remove: --profile security-tests

Fix 3.2: Align CI environment to match local (if CHARON_ENV is the issue)

# .docker/compose/docker-compose.playwright-ci.yml
environment:
  - CHARON_ENV=e2e  # Change from test to e2e

Fix 3.3: Remove CHARON_SECURITY_TESTS_ENABLED (if unused)

# Remove from workflow and compose if truly unused

Fix 3.4: Use tmpfs in CI (if volume persistence is the issue)

# .docker/compose/docker-compose.playwright-ci.yml
tmpfs:
  - /app/data:size=100M,mode=1777
# Remove: playwright_data volume

Investigation Priorities

🔴 CRITICAL - Investigate First

  1. CrowdSec Profile Difference

    • CI runs with CrowdSec, local does not (by default)
    • Most likely root cause of 100% failure rate
    • Action: Run Experiment 2.1 immediately
  2. CHARON_ENV Difference (e2e vs test)

    • Known to affect application behavior (rate limiting, etc.)
    • Action: Run Experiment 2.2 immediately
  3. Emergency Token Validation

    • CI validates token length (≥64 chars)
    • Local loads from .env (unchecked)
    • Action: Review CI logs for token validation failures

🟡 MEDIUM - Investigate Next

  1. CHARON_SECURITY_TESTS_ENABLED Purpose

    • Set in CI, not in local
    • Not used in backend Go code
    • Action: Search frontend/tests for usage
  2. Named Volumes vs tmpfs

    • CI uses persistent volumes
    • Local uses ephemeral tmpfs
    • Action: Run Experiment 2.4 to test state pollution theory
  3. Image Build Differences

    • Local builds fresh, CI loads from artifact
    • Action: Load CI artifact locally and compare

🟢 LOW - Investigate Last

  1. Node.js/Go Version Differences

    • Unlikely to cause 100% failure
    • More likely to cause flaky tests, not systematic failures
  2. Sharding Differences

    • CI uses sharding (4 shards per browser)
    • Local runs all tests in single process
    • Action: Test with sharding locally

Success Criteria for Resolution

Definition of Done: CI environment matches local environment in all critical configuration aspects, resulting in:

  1. CI E2E tests pass at ≥90% rate (matching local)
  2. Root cause identified and documented
  3. Configuration differences eliminated or explained
  4. Reproducible test environment (local = CI)
  5. All experiments documented with results
  6. Runbook created for future E2E debugging

Rollback Plan: If fixes introduce new issues, revert changes and document findings for deeper investigation.


References

Files to Review:

  • .github/workflows/e2e-tests-split.yml - CI workflow configuration
  • .docker/compose/docker-compose.playwright-ci.yml - CI docker compose
  • .docker/compose/docker-compose.playwright-local.yml - Local docker compose
  • .github/skills/scripts/skill-runner.sh - Skill runner orchestration
  • .github/skills/test-e2e-playwright-scripts/run.sh - Local test execution
  • .github/skills/docker-rebuild-e2e-scripts/run.sh - Local container rebuild
  • backend/internal/caddy/config.go - CHARON_ENV usage
  • playwright.config.js - Playwright test configuration

Related Documentation:

  • .github/instructions/testing.instructions.md - Test protocols
  • .github/instructions/playwright-typescript.instructions.md - Playwright guidelines
  • docs/reports/gh_actions_diagnostic.md - Previous CI failure analysis

GitHub Actions Runs (recent failures):

  • Check Actions tab for latest failed runs on e2e-tests-split.yml
  • Download artifacts: Docker logs, test reports, traces

Next Action: Execute Phase 1 evidence collection, focusing on CrowdSec profile and CHARON_ENV differences as primary suspects.

Assigned To: Supervisor Agent (for review and approval of diagnostic experiments)

Timeline:

  • Phase 1 (Evidence): 1-2 hours
  • Phase 2 (Experiments): 2-4 hours
  • Phase 3 (Fixes): 1-2 hours
  • Total Estimated Time: 4-8 hours to resolution

Diagnostic Plan Generated: February 4, 2026 Author: GitHub Copilot (Planning Mode)