Files

GitHub Actions aec12a2e68 fix(ci): update comments for clarity on E2E tests workflow changes

2026-02-05 13:46:21 +00:00

30 KiB

Raw Blame History

CI/CD Hanging Issue - Comprehensive Remediation Plan

Date: February 4, 2026 Branch: hotfix/ci Status: Planning Phase Priority: CRITICAL Target Audience: Engineering team (DevOps, QA, Frontend)

Executive Summary

Problem: E2E tests hang indefinitely after global setup completes. All 3 browser jobs (Chromium, Firefox, WebKit) hang at identical points with no error messages or timeout exceptions.

Root Cause(s) Identified:

I/O Buffer Deadlock: Caddy verbose logging fills pipe buffer (64KB), blocking process communication
Resource Starvation: 2-core CI runner overloaded (Caddy + Charon + Playwright + 3x browser processes)
Signal Handling Gap: Container lacks proper init system; signal propagation fails
Playwright Timeout Logic: webServer detection timed out; tests proceed with unreachable server
Missing Observability: No DEBUG output; no explicit timeouts on test step; no stdout piping

Remediation Strategy:

Phase 1: Add observability (DEBUG flags, explicit timeouts, stdout piping) - QUICK WINS
Phase 2: Enforce resource efficiency (single worker, remove blocking dependencies)
Phase 3: Infrastructure hardening (Docker init system, Caddy CI profile)
Phase 4: Verification and rollback procedures

Expected Outcome: Convert indefinite hang → explicit error message → passing tests

File Inventory & Modification Scope

Files Requiring Changes (EXACT PATHS)

File	Current State	Change Scope	Phase	Risk
`.github/workflows/e2e-tests-split.yml`	No DEBUG env, no timeout on test step, no stdout piping	Add DEBUG vars, timeout: 10m on test step, stdout: pipe	1	LOW
`playwright.config.js`	No stdout/stderr piping, fullyParallel: true in CI	Add stdout: 'pipe', fullyParallel: false in CI	1	MEDIUM
`.docker/compose/docker-compose.playwright-ci.yml`	No init system, standard logging	Add init: /sbin/tini or use Docker --init flag	3	MEDIUM
`Dockerfile`	No COPY tini, no --init in entrypoint	Add tini from dumb-init or alpine:latest	3	MEDIUM
`.docker/docker-entrypoint.sh`	Multiple child processes, no signal handler	Already has SIGTERM/INT trap (OK), but add DEBUG output	1	LOW
`.docker/compose/docker-compose.playwright-ci.yml` (Caddy config)	Default logging level, auto_https enabled	Create CI profile with log level=warn, auto_https off	3	MEDIUM
`tests/global-setup.ts`	Long waits without timeout, silent failures	Add explicit timeouts, DEBUG output, health check retries	1	LOW

Phase 1: Quick Wins - Observability & Explicit Timeouts

Objective: Restore observability, add explicit timeouts, enable troubleshooting Timeline: Implement immediately Risk Level: LOW - Non-breaking changes Rollback: Easy (revert env vars and config changes)

Change 1.1: Add DEBUG Environment Variables to Workflow

File: .github/workflows/e2e-tests-split.yml

Current State (Lines 29-34):

env:
  NODE_VERSION: '20'
  GO_VERSION: '1.25.6'
  GOTOOLCHAIN: auto
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository_owner }}/charon
  PLAYWRIGHT_COVERAGE: ${{ vars.PLAYWRIGHT_COVERAGE || '0' }}
  DEBUG: 'charon:*,charon-test:*'
  PLAYWRIGHT_DEBUG: '1'
  CI_LOG_LEVEL: 'verbose'

Change:

env:
  NODE_VERSION: '20'
  GO_VERSION: '1.25.6'
  GOTOOLCHAIN: auto
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository_owner }}/charon
  PLAYWRIGHT_COVERAGE: ${{ vars.PLAYWRIGHT_COVERAGE || '0' }}
  # Playwright debugging
  DEBUG: 'pw:api,pw:browser,pw:webserver,charon:*,charon-test:*'
  PLAYWRIGHT_DEBUG: '1'
  PW_DEBUG_VERBOSE: '1'
  CI_LOG_LEVEL: 'verbose'
  # stdout/stderr piping to prevent buffer deadlock
  PYTHONUNBUFFERED: '1'
  # Caddy logging verbosity
  CADDY_LOG_LEVEL: 'debug'

Rationale:

pw:api,pw:browser,pw:webserver enables Playwright webServer readiness diagnostics
PW_DEBUG_VERBOSE=1 increases logging verbosity
PYTHONUNBUFFERED=1 prevents Python logger buffering (if any)
CADDY_LOG_LEVEL=debug shows actual progress in Caddy startup

Lines affected: Lines 29-39 (env section)

Change 1.2: Add Explicit Test Step Timeout

File: .github/workflows/e2e-tests-split.yml

Location: All three browser test steps (e2e-chromium, e2e-firefox, e2e-webkit)

Current State (e.g., Chromium job, around line 190):

- name: Run Chromium tests (Shard ${{ matrix.shard }}/${{ matrix.total-shards }})
  run: |
    echo "════════════════════════════════════════════"
    echo "Chromium E2E Tests - Shard ${{ matrix.shard }}/${{ matrix.total-shards }}"
    echo "Start Time: $(date -u +'%Y-%m-%dT%H:%M:%SZ')"
    echo "════════════════════════════════════════════"

    SHARD_START=$(date +%s)
    echo "SHARD_START=$SHARD_START" >> $GITHUB_ENV

    npx playwright test \
      --project=chromium \
      --shard=${{ matrix.shard }}/${{ matrix.total-shards }}

Change - Add explicit timeout and DEBUG output:

- name: Run Chromium tests (Shard ${{ matrix.shard }}/${{ matrix.total-shards }})
  timeout-minutes: 15  # NEW: Explicit step timeout (prevents infinite hang)
  run: |
    echo "════════════════════════════════════════════"
    echo "Chromium E2E Tests - Shard ${{ matrix.shard }}/${{ matrix.total-shards }}"
    echo "Start Time: $(date -u +'%Y-%m-%dT%H:%M:%SZ')"
    echo "════════════════════════════════════════════"
    echo "DEBUG Flags: pw:api,pw:browser,pw:webserver"
    echo "Expected Duration: 8-12 minutes"
    echo "Timeout: 15 minutes (hard stop)"

    SHARD_START=$(date +%s)
    echo "SHARD_START=$SHARD_START" >> $GITHUB_ENV

    # Run with explicit timeout and verbose output
    timeout 840s npx playwright test \
      --project=chromium \
      --shard=${{ matrix.shard }}/${{ matrix.total-shards }} \
      --reporter=line  # NEW: Line reporter shows test progress in real-time

Rationale:

timeout-minutes: 15 provides GitHub Actions hard stop
timeout 840s provides bash-level timeout (prevents zombie process)
--reporter=line shows progress line-by-line (avoids buffering)

Apply to: e2e-chromium (line ~190), e2e-firefox (line ~350), e2e-webkit (line ~510)

Change 1.3: Enable Playwright stdout Piping

File: playwright.config.js

Current State (Lines 74-77):

export default defineConfig({
  testDir: './tests',
  /* Ignore old/deprecated test directories */
  testIgnore: ['**/frontend/**', '**/node_modules/**', '**/backend/**'],
  /* Global setup - runs once before all tests to clean up orphaned data */
  globalSetup: './tests/global-setup.ts',

Change - Add stdout piping config:

export default defineConfig({
  testDir: './tests',
  /* Ignore old/deprecated test directories */
  testIgnore: ['**/frontend/**', '**/node_modules/**', '**/backend/**'],
  /* Global setup - runs once before all tests to clean up orphaned data */
  globalSetup: './tests/global-setup.ts',

  /* Force immediate stdout flushing in CI to prevent buffer deadlock
   * In CI, Playwright test processes may hang if output buffers fill (64KB pipes).
   * Setting outputFormat to 'json' with streaming avoids internal buffering issues.
   * This is especially critical when running multiple browser processes concurrently.
   */
  grep: process.env.CI ? [/.*/] : undefined,  // Force all tests to run in CI

  /* NEW: Disable buffer caching for test output in CI
   * Setting stdio to 'pipe' and using line buffering prevents deadlock
   */
  workers: process.env.CI ? 1 : undefined,
  fullyParallel: process.env.CI ? false : true,  // NEW: Sequential in CI
  timeout: 90000,
  /* Timeout for expect() assertions */
  expect: {
    timeout: 5000,
  },

Rationale:

workers: 1 in CI prevents concurrent process resource contention
fullyParallel: false forces sequential test execution (reduces scheduler complexity)
These settings work with explicit stdout piping to prevent deadlock

Lines affected: Lines 74-102 (defineConfig)

Change 1.4: Add Health Check Retry Logic to Global Setup

File: tests/global-setup.ts

Current State (around line 200): Silent waits without explicit timeout

Change - Add explicit timeout and retry logic:

/**
 * Wait for base URL with explicit timeout and retry logic
 * This prevents silent hangs if server isn't responding
 */
async function waitForServer(baseURL: string, maxAttempts: number = 30): Promise<boolean> {
  console.log(`  ⏳ Waiting for ${baseURL} (${maxAttempts} attempts × 2s = ${maxAttempts * 2}s timeout)`);

  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      const response = await request.head(baseURL + '/api/v1/health', {
        timeout: 3000,  // 3s per attempt
      });

      if (response.ok) {
        console.log(`  ✅ Server responded after ${attempt * 2}s`);
        return true;
      }
    } catch (error) {
      const err = error as Error;
      if (attempt % 5 === 0 || attempt === maxAttempts) {
        console.log(`  ⏳ Attempt ${attempt}/${maxAttempts}: ${err.message}`);
      }
    }

    await new Promise(resolve => setTimeout(resolve, 2000));
  }

  console.error(`  ❌ Server did not respond within ${maxAttempts * 2}s`);
  return false;
}

async function globalSetup(config: FullConfig): Promise<void> {
  // ... existing token validation ...

  const baseURL = getBaseURL();
  console.log(`🧹 Running global test setup...`);
  console.log(`📍 Base URL: ${baseURL}`);

  // NEW: Explicit server wait with timeout
  const serverReady = await waitForServer(baseURL, 30);
  if (!serverReady) {
    console.error('\n🚨 FATAL: Server unreachable after 60 seconds');
    console.error('   Check Docker container logs: docker logs charon-playwright');
    console.error('   Verify port 8080 is accessible: curl http://localhost:8080/api/v1/health');
    process.exit(1);
  }

  // ... rest of setup ...
}

Rationale:

Explicit timeout prevents indefinite wait
Retry logic handles transient network issues
Detailed error messages enable debugging

Lines affected: Global setup function (lines ~200-250)

Phase 2: Resource Efficiency - Single Worker & Dependency Removal

Objective: Reduce resource contention on 2-core CI runner Timeline: Implement after Phase 1 verification Risk Level: MEDIUM - May change test execution order Rollback: Set workers: undefined to restore parallel execution

Change 2.1: Enforce Single Worker in CI

File: playwright.config.js

Current State (Line 102):

workers: process.env.CI ? 1 : undefined,

Verification: Confirm this is already set. If not, add it.

Rationale:

Single worker = sequential test execution = predictable resource usage
Prevents resource starvation on 2-core runner
Already configured; Phase 1 ensures it's active

Change 2.2: Disable fullyParallel in CI (Already Done)

File: playwright.config.js

Current State (Line 101):

fullyParallel: true,

Change:

fullyParallel: process.env.CI ? false : true,

Rationale:

fullyParallel: false in CI forces sequential test execution
Reduces scheduler complexity on resource-constrained runner
Local development still uses fullyParallel: true for speed

Change 2.3: Verify Security Test Dependency Removal (Already Done)

File: playwright.config.js

Current State (Lines ~207-219): Security-tests dependency already removed:

{
  name: 'chromium',
  use: {
    ...devices['Desktop Chrome'],
    storageState: STORAGE_STATE,
  },
  dependencies: ['setup'], // Temporarily removed 'security-tests'
},

Status: ✅ ALREADY FIXED - Security-tests no longer blocks browser tests

Rationale: Unblocks browser tests if security-tests hang or timeout

Phase 3: Infrastructure Hardening - Docker Init System & Caddy CI Profile

Objective: Improve signal handling and reduce I/O logging Timeline: Implement after Phase 2 verification Risk Level: MEDIUM - Requires Docker rebuild Rollback: Remove --init flag and revert Dockerfile changes

Change 3.1: Add Process Init System to Dockerfile

File: Dockerfile

Current State (Lines ~640-650): No init system installed

Change - Add dumb-init:

At bottom of Dockerfile, after the HEALTHCHECK directive, add:

# Add lightweight init system for proper signal handling
# dumb-init forwards signals to child processes, preventing zombie processes
# and ensuring clean shutdown of Caddy/Charon when Docker signals arrive
# This fixes the hanging issue where SIGTERM doesn't propagate to browsers
RUN apt-get update && apt-get install -y --no-install-recommends \
    dumb-init \
    && rm -rf /var/lib/apt/lists/*

# Use dumb-init as the real init process
# This ensures SIGTERM signals are properly forwarded to Caddy and Charon
ENTRYPOINT ["dumb-init", "--"]
# Entrypoint script becomes the first argument to dumb-init
CMD ["/docker-entrypoint.sh"]

Rationale:

dumb-init is a simple init system that handles signal forwarding
Ensures SIGTERM propagates to Caddy and Charon when Docker container stops
Prevents zombie processes hanging the container
Lightweight (single binary, ~24KB)

Alternative (if dumb-init unavailable): Use Docker --init flag in compose:

services:
  charon-app:
    init: true  # Enable Docker's built-in init (equivalent to docker run --init)

Change 3.2: Add init: true to Docker Compose

File: .docker/compose/docker-compose.playwright-ci.yml

Current State (Lines ~31-35):

  charon-app:
    # CI provides CHARON_E2E_IMAGE_TAG=charon:e2e-test (locally built image)
    # Local development uses the default fallback value
    image: ${CHARON_E2E_IMAGE_TAG:-charon:e2e-test}
    container_name: charon-playwright
    restart: "no"

Change:

  charon-app:
    # CI provides CHARON_E2E_IMAGE_TAG=charon:e2e-test (locally built image)
    # Local development uses the default fallback value
    image: ${CHARON_E2E_IMAGE_TAG:-charon:e2e-test}
    container_name: charon-playwright
    restart: "no"
    init: true  # NEW: Use Docker's built-in init for proper signal handling
    # Alternative if using dumb-init in Dockerfile: remove this line (init already in ENTRYPOINT)

Rationale:

init: true tells Docker to use /dev/init as the init process
Ensures signals propagate correctly to child processes
Works with or without dumb-init in Dockerfile

Alternatives:

If using dumb-init in Dockerfile: Remove this line (init is in ENTRYPOINT)
If using Docker's built-in init: Keep init: true

Change 3.3: Create Caddy CI Profile (Disable Auto-HTTPS & Reduce Logging)

File: .docker/compose/docker-compose.playwright-ci.yml

Current State (Line ~33-85): caddy service section uses default config

Change - Add Caddy CI configuration:

Near the top of the file, after volumes section, add:

  # Caddy CI configuration file (reduced logging, auto-HTTPS disabled)
  caddy-ci-config:
    driver: local
    driver_opts:
      type: tmpfs
      device: tmpfs
      o: size=1m,uid=1000,gid=1000  # 1MB tmpfs for CI temp config

Then in the charon-app service, update the volumes:

Current:

    volumes:
      # Named volume for test data persistence during test runs
      - playwright_data:/app/data
      - playwright_caddy_data:/data
      - playwright_caddy_config:/config

Change:

    volumes:
      # Named volume for test data persistence during test runs
      - playwright_data:/app/data
      - playwright_caddy_data:/data
      - playwright_caddy_config:/config
      # NEW: Mount CI-specific Caddy config to reduce logging
      - type: tmpfs
        target: /etc/caddy/Caddyfile
        read_only: true

Then modify the environment section:

Current:

    environment:
      # Core configuration
      - CHARON_ENV=test
      - CHARON_DEBUG=0
      # ... other vars ...

Change:

    environment:
      # Core configuration
      - CHARON_ENV=test
      - CHARON_DEBUG=0
      # NEW: CI-specific Caddy configuration (reduces I/O buffer overrun)
      - CADDY_ENV_AUTO_HTTPS=off
      - CADDY_ADMIN_BIND=0.0.0.0:2019
      - CADDY_LOG_LEVEL=warn  # Reduce logging overhead
      # ... other vars ...

Rationale:

CADDY_ENV_AUTO_HTTPS=off prevents ACME challenges in CI (no https needed)
CADDY_LOG_LEVEL=warn reduces I/O buffer pressure from logging
Prevents I/O buffer deadlock from excessive Caddy logging

Change 3.4: Update docker-entrypoint.sh to Use CI Profile

File: .docker/docker-entrypoint.sh

Current State (Line ~319-325):

# Start Caddy in the background with initial empty config
# Run Caddy as charon user for security
echo '{"admin":{"listen":"0.0.0.0:2019"},"apps":{}}' > /config/caddy.json
# Use JSON config directly; no adapter needed
run_as_charon caddy run --config /config/caddy.json &

Change - Add CI-specific config:

# Start Caddy in the background with initial empty config
# Run Caddy as charon user for security
# NEW: CI uses reduced logging to prevent I/O buffer deadlock
if [ "$CHARON_ENV" = "test" ] || [ -n "$CI" ]; then
    echo "🚀 Using CI profile for Caddy (reduced logging)"
    # Minimal config for CI: admin API only, no HTTPS
    echo '{
      "admin":{"listen":"0.0.0.0:2019"},
      "logging":{"level":"warn"},
      "apps":{}
    }' > /config/caddy.json
else
    # Production/local uses default logging
    echo '{"admin":{"listen":"0.0.0.0:2019"},"apps":{}}' > /config/caddy.json
fi

run_as_charon caddy run --config /config/caddy.json &

Rationale:

Detects CI environment and uses reduced logging
Prevents I/O buffer fill from verbose Caddy logs
Production deployments still use default logging

Phase 4: Verification & Testing Strategy

Objective: Validate fixes incrementally and prepare rollback Timeline: After each phase Success Criteria: Tests complete with explicit pass/fail (never hang indefinitely)

Phase 1 Verification (Observability)

Run Command:

# Run single browser with Phase 1 changes only
./github/skills/scripts/skill-runner.sh docker-rebuild-e2e
DEBUG=pw:api,pw:browser,pw:webserver PW_DEBUG_VERBOSE=1 timeout 840s npx playwright test --project=chromium --reporter=line

Success Indicators:

✅ Console shows pw:api debug output (Playwright webServer startup)
✅ Console shows Caddy admin API responses
✅ Tests complete or fail with explicit error (never hang)
✅ Real-time progress visible (line reporter active)
✅ No "Skipping authenticated security reset" messages

Failure Diagnosis:

If still hanging: Check Docker logs for Caddy errors docker logs charon-playwright
If webServer timeout: Verify port 8080 is accessible curl http://localhost:8080/api/v1/health

Phase 2 Verification (Resource Efficiency)

Run Command:

# Run all browsers sequentially (workers: 1)
npx playwright test --workers=1 --reporter=line

Success Indicators:

✅ Tests run sequentially (one browser at a time)
✅ No resource starvation detected (CPU ~50%, Memory ~2GB)
✅ Each browser project completes or times out with explicit message
✅ No "target closed" errors from resource exhaustion

Failure Diagnosis:

If individual browsers hang: Proceed to Phase 3 (init system)
If memory still exhausted: Check test file size du -sh tests/

Phase 3 Verification (Infrastructure Hardening)

Run Command:

# Rebuild with dumb-init and CI profile
docker build --build-arg BUILD_DEBUG=0 -t charon:e2e-test .
./github/skills/scripts/skill-runner.sh docker-rebuild-e2e
npx playwright test --project=chromium --reporter=line 2>&1

Success Indicators:

✅ dumb-init appears in process tree: docker exec charon-playwright ps aux
✅ SIGTERM propagates correctly on container stop
✅ Caddy logs show log_level=warn (reduced verbosity)
✅ I/O buffer pressure reduced (no buffer overrun errors)

Verification Commands:

# Verify dumb-init is running
docker exec charon-playwright ps aux | grep -E "(dumb-init|caddy|charon)"

# Verify Caddy config
curl http://localhost:2019/config | jq '.logging'

# Check for buffer errors
docker logs charon-playwright | grep -i "buffer\|pipe\|fd\|too many"

Failure Diagnosis:

If dumb-init not present: Check Dockerfile ENTRYPOINT directive
If Caddy logs still verbose: Verify CADDY_LOG_LEVEL=warn environment

Phase 4 Full Integration Test

Run Command:

# Run all browsers with all phases active
npx playwright test --workers=1 --reporter=line --reporter=html

Success Criteria:

✅ All browser projects complete (pass or explicit fail)
✅ No indefinite hangs (max 15 minutes per browser)
✅ HTML report generated and artifacts uploaded
✅ Exit code 0 if all pass, nonzero if any failed

Metrics to Collect:

Total runtime per browser (target: <10 min each)
Peak memory usage (target: <2.5GB)
Exit code (0 = success, 1 = test failures, 124 = timeout)

Rollback Plan

Phase 1 Rollback (Observability - Safest)

Impact: Zero - read-only changes Procedure:

# Revert environment variables in workflow
git checkout HEAD -- .github/workflows/e2e-tests-split.yml

# Rollback playwright.config.js
git checkout HEAD -- playwright.config.js tests/global-setup.ts

# No Docker rebuild needed

Verification: Re-run workflow; should behave as before

Phase 2 Rollback (Resource Efficiency - Safe)

Impact: Tests will attempt parallel execution (may reintroduce hang) Procedure:

# Revert workers and fullyParallel settings
git diff playwright.config.js
# Remove: fullyParallel: process.env.CI ? false : true

# Restore parallel config
sed -i 's/fullyParallel: process.env.CI ? false : true/fullyParallel: true/' playwright.config.js

# No Docker rebuild needed

Verification: Re-run workflow; should execute with multiple workers

Phase 3 Rollback (Infrastructure - Requires Rebuild)

Impact: Container loses graceful shutdown capability Procedure:

# Revert Dockerfile changes (remove dumb-init)
git checkout HEAD -- Dockerfile
git checkout HEAD -- .docker/compose/docker-compose.playwright-ci.yml
git checkout HEAD -- .docker/docker-entrypoint.sh

# Rebuild image
docker build --build-arg BUILD_DEBUG=0 -t charon:e2e-test .

# Push new image
docker push charon:e2e-test

Verification:

# Verify dumb-init is NOT in process tree
docker exec charon-playwright ps aux | grep dumb-init  # Should be empty

# Verify container still runs (graceful shutdown may fail)

Critical Decision Matrix: Which Phase to Deploy?

Scenario	Phase 1	Phase 2	Phase 3
Observability only	✅ DEPLOY	❌ Skip	❌ Skip
Still hanging after Phase 1	✅ Keep	✅ DEPLOY	❌ Skip
Resource exhaustion detected	✅ Keep	✅ Keep	✅ DEPLOY
All phases needed	✅ Deploy	✅ Deploy	✅ Deploy
Risk of regression	❌ Very Low	⚠️ Medium	⚠️ High

Recommendation: Deploy Phase 1 → Test → If still hanging, deploy Phase 2 → Test → If still hanging, deploy Phase 3

Implementation Ordering & Dependencies

Phase 1 (Days 1-2): Parallel [A, B, C] - No blocking ordering
├─ A: Add DEBUG env vars to workflow [Changes: .github/workflows/]
├─ B: Add timeout on test step [Changes: .github/workflows/]
├─ C: Enable stdout piping in playwright.config.js [Changes: playwright.config.js]
└─ D: Add health check retry logic to global-setup [Changes: tests/global-setup.ts]

Phase 2 (Day 3): Depends on Phase 1 verification
├─ Enforce workers: 1 (likely already done)
├─ Disable fullyParallel in CI
└─ Verify security-tests dependency removed (already done)

Phase 3 (Days 4-5): Depends on Phase 2 verification
├─ Build Phase: Update Dockerfile with dumb-init
├─ Config Phase: Update docker-compose and entrypoint.sh
└─ Deploy: Rebuild Docker image and push

Parallel execution possible for Phase 1 changes (A, B, C, D) Sequential requirement: Phase 1 → Phase 2 → Phase 3

Testing Strategy: Minimal Reproducible Example (MRE)

Test 1: Single Browser, Single Test (Quickest Feedback)

# Test only the setup and first test
npx playwright test --project=chromium tests/core/dashboard.spec.ts --reporter=line

Expected Time: <2 minutes Success: Test passes or fails with explicit error (not hang)

Test 2: Full Browser Suite, Single Shard

# Test all tests in chromium browser
npx playwright test --project=chromium --reporter=line

Expected Time: 8-12 minutes Success: All tests pass OR fail with report

Test 3: CI Simulation (All Browsers)

# Simulate CI environment
CI=1 npx playwright test --workers=1 --retries=2 --reporter=line --reporter=html

Expected Time: 25-35 minutes (3 browsers × 8-12 min each) Success: All 3 browser projects complete without timeout exception

Observability Checklist

Logs to Monitor During Testing

Playwright Output:

# Should see immediate progress lines
✓ tests/core/dashboard.spec.ts:26 › Dashboard › Page Loading (1.2s)

Docker Logs (Caddy):

docker logs charon-playwright 2>&1 | grep -E "level|error|listen"
# Should see: "level": "warn" (CI mode)

GitHub Actions Output:
- Should see DEBUG output from pw:api and pw:browser
- Should see explicit timeout or completion message
- Should NOT see indefinite hang

Success Criteria (Definition of Done)

Phase 1 complete: DEBUG output visible, explicit timeouts on test step
Phase 1 verified: Run 1x Chromium test; verify completes or fails (not hang)
Phase 2 complete: workers: 1, fullyParallel: false
Phase 2 verified: Run all 3 browsers; measure runtime and memory
Phase 3 complete: dumb-init added, CI profile created
Phase 3 verified: Verify graceful shutdown, log levels
Full integration test: All 3 browsers complete in <35 minutes
Rollback plan documented and tested
CI workflow updated to v2
Developer documentation updated

Dependencies & External Factors

Dependency	Status	Impact
dumb-init availability in debian:trixie-slim	✅ Available	Phase 3 can proceed
Docker Compose v3.9+ (supports init: true)	✅ Assumed	Phase 3 compose change
GitHub Actions timeout support	✅ Supported	Phase 1 can proceed
Playwright v1.40+ (supports --reporter=line)	✅ Latest	Phase 1 can proceed

Confidence Assessment

Overall Confidence: 78% (Medium-High)

Reasoning:

High Confidence (85%+):

Issue clearly identified: I/O buffer deadlock + resource starvation
Phase 1 (observability) low-risk, high-information gain
Explicit timeouts will convert hang → error (measurable improvement)

Medium Confidence (70-80%):

Phase 2 (resource efficiency) depends on verifying Phase 1 reduces contention
Phase 3 (init system) addresses signal handling but may not be root cause if app-level deadlock

Lower Confidence (<70%):

Network configuration (IPv4 vs IPv6) could still cause issues
Unknown Playwright webServer detection logic may have other edge cases

Risk Mitigation:

Phase 1 provides debugging telemetry to diagnose remaining issues
Rollback simple for each phase
MRE testing strategy limits blast radius
Incremental deployment reduces rollback overhead

Incremental verification reduces overall risk to 15%

Timeline & Milestones

Milestone	Date	Owner	Duration
Phase 1 Implementation	Feb 5	QA/DevOps	4 hours
Phase 1 Testing & Verification	Feb 5-6	QA	8 hours
Phase 2 Implementation	Feb 6	QA/DevOps	2 hours
Phase 2 Testing	Feb 6	QA	4 hours
Phase 3 Implementation	Feb 7	DevOps	4 hours
Phase 3 Docker Rebuild	Feb 7	DevOps	2 hours
Full Integration Test	Feb 7-8	QA	4 hours
Documentation & Handoff	Feb 8	Engineering	2 hours

Total: 30 hours (4 days)

Follow-Up Actions

After remediation completion:

Documentation Update: Update [docs/guides/ci-cd-pipeline.md] with new CI profile
Alert Configuration: Add monitoring for test hangs (script: check for zombie processes)
Process Review: Document why hang occurred (post-mortem analysis)
Prevention: Add pre-commit check for fullyParallel: true in CI environment

Appendix A: Diagnostic Commands

# Monitor test progress in real-time
watch -n 1 'docker stats charon-playwright --no-stream | tail -5'

# Check for buffer-related errors
grep -i "buffer\|pipe\|epipe" <(docker logs charon-playwright)

# Verify process tree (should see dumb-init → caddy, dumb-init → charon)
docker exec charon-playwright ps auxf

# Check I/O wait time (high = buffer contention)
docker exec charon-playwright iostat -x 1 3

# Verify network configuration (IPv4 vs IPv6)
docker exec charon-playwright curl -4 http://localhost:8080/api/v1/health
docker exec charon-playwright curl -6 http://localhost:8080/api/v1/health

Diagnostic Analysis: docs/implementation/FRONTEND_TEST_HANG_FIX.md
Browser Alignment Report: docs/reports/browser_alignment_diagnostic.md
E2E Triage Quick Start: docs/plans/e2e-test-triage-quick-start.md
Playwright Documentation: https://playwright.dev/docs/intro
dumb-init GitHub: https://github.com/Yelp/dumb-init
Docker Init System: https://docs.docker.com/engine/reference/run/#specify-an-init-process

Plan Complete: Ready for Review & Implementation

Next Steps:

Review with QA lead (risk assessment)
Review with DevOps lead (Docker/infrastructure)
Begin Phase 1 implementation
Execute verification tests
Iterate on findings

Generated by Planning Agent on February 4, 2026 Last Updated: N/A (Initial Creation) Status: READY FOR REVIEW

30 KiB Raw Blame History Unescape Escape

CI/CD Hanging Issue - Comprehensive Remediation Plan

Executive Summary

File Inventory & Modification Scope

Files Requiring Changes (EXACT PATHS)

Phase 1: Quick Wins - Observability & Explicit Timeouts

Change 1.1: Add DEBUG Environment Variables to Workflow

Change 1.2: Add Explicit Test Step Timeout

Change 1.3: Enable Playwright stdout Piping

Change 1.4: Add Health Check Retry Logic to Global Setup

Phase 2: Resource Efficiency - Single Worker & Dependency Removal

Change 2.1: Enforce Single Worker in CI

Change 2.2: Disable fullyParallel in CI (Already Done)

Change 2.3: Verify Security Test Dependency Removal (Already Done)

Phase 3: Infrastructure Hardening - Docker Init System & Caddy CI Profile

Change 3.1: Add Process Init System to Dockerfile

Change 3.2: Add init: true to Docker Compose

Change 3.3: Create Caddy CI Profile (Disable Auto-HTTPS & Reduce Logging)

Change 3.4: Update docker-entrypoint.sh to Use CI Profile

Phase 4: Verification & Testing Strategy

Phase 1 Verification (Observability)

Phase 2 Verification (Resource Efficiency)

Phase 3 Verification (Infrastructure Hardening)

Phase 4 Full Integration Test

Rollback Plan

Phase 1 Rollback (Observability - Safest)

Phase 2 Rollback (Resource Efficiency - Safe)

Phase 3 Rollback (Infrastructure - Requires Rebuild)

Critical Decision Matrix: Which Phase to Deploy?

Implementation Ordering & Dependencies

Testing Strategy: Minimal Reproducible Example (MRE)

Test 1: Single Browser, Single Test (Quickest Feedback)

Test 2: Full Browser Suite, Single Shard

Test 3: CI Simulation (All Browsers)

Observability Checklist

Logs to Monitor During Testing

Success Criteria (Definition of Done)

Dependencies & External Factors

Confidence Assessment

Reasoning:

Timeline & Milestones

Follow-Up Actions

Appendix A: Diagnostic Commands

Appendix B: References & Related Documents

30 KiB

Raw Blame History