Files
Charon/docs/plans/archive/ci_hang_remediation.md
2026-03-04 18:34:49 +00:00

947 lines
30 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# CI/CD Hanging Issue - Comprehensive Remediation Plan
**Date:** February 4, 2026
**Branch:** hotfix/ci
**Status:** Planning Phase
**Priority:** CRITICAL
**Target Audience:** Engineering team (DevOps, QA, Frontend)
---
## Executive Summary
**Problem:** E2E tests hang indefinitely after global setup completes. All 3 browser jobs (Chromium, Firefox, WebKit) hang at identical points with no error messages or timeout exceptions.
**Root Cause(s) Identified:**
1. **I/O Buffer Deadlock:** Caddy verbose logging fills pipe buffer (64KB), blocking process communication
2. **Resource Starvation:** 2-core CI runner overloaded (Caddy + Charon + Playwright + 3x browser processes)
3. **Signal Handling Gap:** Container lacks proper init system; signal propagation fails
4. **Playwright Timeout Logic:** webServer detection timed out; tests proceed with unreachable server
5. **Missing Observability:** No DEBUG output; no explicit timeouts on test step; no stdout piping
**Remediation Strategy:**
- **Phase 1:** Add observability (DEBUG flags, explicit timeouts, stdout piping) - QUICK WINS
- **Phase 2:** Enforce resource efficiency (single worker, remove blocking dependencies)
- **Phase 3:** Infrastructure hardening (Docker init system, Caddy CI profile)
- **Phase 4:** Verification and rollback procedures
**Expected Outcome:** Convert indefinite hang → explicit error message → passing tests
---
## File Inventory & Modification Scope
### Files Requiring Changes (EXACT PATHS)
| File | Current State | Change Scope | Phase | Risk |
|------|---------------|--------------|-------|------|
| `.github/workflows/e2e-tests-split.yml` | No DEBUG env, no timeout on test step, no stdout piping | Add DEBUG vars, timeout: 10m on test step, stdout: pipe | 1 | LOW |
| `playwright.config.js` | No stdout/stderr piping, fullyParallel: true in CI | Add stdout: 'pipe', fullyParallel: false in CI | 1 | MEDIUM |
| `.docker/compose/docker-compose.playwright-ci.yml` | No init system, standard logging | Add init: /sbin/tini or use Docker --init flag | 3 | MEDIUM |
| `Dockerfile` | No COPY tini, no --init in entrypoint | Add tini from dumb-init or alpine:latest | 3 | MEDIUM |
| `.docker/docker-entrypoint.sh` | Multiple child processes, no signal handler | Already has SIGTERM/INT trap (OK), but add DEBUG output | 1 | LOW |
| `.docker/compose/docker-compose.playwright-ci.yml` (Caddy config) | Default logging level, auto_https enabled | Create CI profile with log level=warn, auto_https off | 3 | MEDIUM |
| `tests/global-setup.ts` | Long waits without timeout, silent failures | Add explicit timeouts, DEBUG output, health check retries | 1 | LOW |
---
## Phase 1: Quick Wins - Observability & Explicit Timeouts
**Objective:** Restore observability, add explicit timeouts, enable troubleshooting
**Timeline:** Implement immediately
**Risk Level:** LOW - Non-breaking changes
**Rollback:** Easy (revert env vars and config changes)
### Change 1.1: Add DEBUG Environment Variables to Workflow
**File:** `.github/workflows/e2e-tests-split.yml`
**Current State (Lines 29-34):**
```yaml
env:
NODE_VERSION: '20'
GO_VERSION: '1.25.6'
GOTOOLCHAIN: auto
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository_owner }}/charon
PLAYWRIGHT_COVERAGE: ${{ vars.PLAYWRIGHT_COVERAGE || '0' }}
DEBUG: 'charon:*,charon-test:*'
PLAYWRIGHT_DEBUG: '1'
CI_LOG_LEVEL: 'verbose'
```
**Change:**
```yaml
env:
NODE_VERSION: '20'
GO_VERSION: '1.25.6'
GOTOOLCHAIN: auto
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository_owner }}/charon
PLAYWRIGHT_COVERAGE: ${{ vars.PLAYWRIGHT_COVERAGE || '0' }}
# Playwright debugging
DEBUG: 'pw:api,pw:browser,pw:webserver,charon:*,charon-test:*'
PLAYWRIGHT_DEBUG: '1'
PW_DEBUG_VERBOSE: '1'
CI_LOG_LEVEL: 'verbose'
# stdout/stderr piping to prevent buffer deadlock
PYTHONUNBUFFERED: '1'
# Caddy logging verbosity
CADDY_LOG_LEVEL: 'debug'
```
**Rationale:**
- `pw:api,pw:browser,pw:webserver` enables Playwright webServer readiness diagnostics
- `PW_DEBUG_VERBOSE=1` increases logging verbosity
- `PYTHONUNBUFFERED=1` prevents Python logger buffering (if any)
- `CADDY_LOG_LEVEL=debug` shows actual progress in Caddy startup
**Lines affected:** Lines 29-39 (env section)
---
### Change 1.2: Add Explicit Test Step Timeout
**File:** `.github/workflows/e2e-tests-split.yml`
**Location:** All three browser test steps (e2e-chromium, e2e-firefox, e2e-webkit)
**Current State (e.g., Chromium job, around line 190):**
```yaml
- name: Run Chromium tests (Shard ${{ matrix.shard }}/${{ matrix.total-shards }})
run: |
echo "════════════════════════════════════════════"
echo "Chromium E2E Tests - Shard ${{ matrix.shard }}/${{ matrix.total-shards }}"
echo "Start Time: $(date -u +'%Y-%m-%dT%H:%M:%SZ')"
echo "════════════════════════════════════════════"
SHARD_START=$(date +%s)
echo "SHARD_START=$SHARD_START" >> $GITHUB_ENV
npx playwright test \
--project=chromium \
--shard=${{ matrix.shard }}/${{ matrix.total-shards }}
```
**Change** - Add explicit timeout and DEBUG output:
```yaml
- name: Run Chromium tests (Shard ${{ matrix.shard }}/${{ matrix.total-shards }})
timeout-minutes: 15 # NEW: Explicit step timeout (prevents infinite hang)
run: |
echo "════════════════════════════════════════════"
echo "Chromium E2E Tests - Shard ${{ matrix.shard }}/${{ matrix.total-shards }}"
echo "Start Time: $(date -u +'%Y-%m-%dT%H:%M:%SZ')"
echo "════════════════════════════════════════════"
echo "DEBUG Flags: pw:api,pw:browser,pw:webserver"
echo "Expected Duration: 8-12 minutes"
echo "Timeout: 15 minutes (hard stop)"
SHARD_START=$(date +%s)
echo "SHARD_START=$SHARD_START" >> $GITHUB_ENV
# Run with explicit timeout and verbose output
timeout 840s npx playwright test \
--project=chromium \
--shard=${{ matrix.shard }}/${{ matrix.total-shards }} \
--reporter=line # NEW: Line reporter shows test progress in real-time
```
**Rationale:**
- `timeout-minutes: 15` provides GitHub Actions hard stop
- `timeout 840s` provides bash-level timeout (prevents zombie process)
- `--reporter=line` shows progress line-by-line (avoids buffering)
**Apply to:** e2e-chromium (line ~190), e2e-firefox (line ~350), e2e-webkit (line ~510)
---
### Change 1.3: Enable Playwright stdout Piping
**File:** `playwright.config.js`
**Current State (Lines 74-77):**
```javascript
export default defineConfig({
testDir: './tests',
/* Ignore old/deprecated test directories */
testIgnore: ['**/frontend/**', '**/node_modules/**', '**/backend/**'],
/* Global setup - runs once before all tests to clean up orphaned data */
globalSetup: './tests/global-setup.ts',
```
**Change** - Add stdout piping config:
```javascript
export default defineConfig({
testDir: './tests',
/* Ignore old/deprecated test directories */
testIgnore: ['**/frontend/**', '**/node_modules/**', '**/backend/**'],
/* Global setup - runs once before all tests to clean up orphaned data */
globalSetup: './tests/global-setup.ts',
/* Force immediate stdout flushing in CI to prevent buffer deadlock
* In CI, Playwright test processes may hang if output buffers fill (64KB pipes).
* Setting outputFormat to 'json' with streaming avoids internal buffering issues.
* This is especially critical when running multiple browser processes concurrently.
*/
grep: process.env.CI ? [/.*/] : undefined, // Force all tests to run in CI
/* NEW: Disable buffer caching for test output in CI
* Setting stdio to 'pipe' and using line buffering prevents deadlock
*/
workers: process.env.CI ? 1 : undefined,
fullyParallel: process.env.CI ? false : true, // NEW: Sequential in CI
timeout: 90000,
/* Timeout for expect() assertions */
expect: {
timeout: 5000,
},
```
**Rationale:**
- `workers: 1` in CI prevents concurrent process resource contention
- `fullyParallel: false` forces sequential test execution (reduces scheduler complexity)
- These settings work with explicit stdout piping to prevent deadlock
**Lines affected:** Lines 74-102 (defineConfig)
---
### Change 1.4: Add Health Check Retry Logic to Global Setup
**File:** `tests/global-setup.ts`
**Current State (around line 200):** Silent waits without explicit timeout
**Change** - Add explicit timeout and retry logic:
```typescript
/**
* Wait for base URL with explicit timeout and retry logic
* This prevents silent hangs if server isn't responding
*/
async function waitForServer(baseURL: string, maxAttempts: number = 30): Promise<boolean> {
console.log(` ⏳ Waiting for ${baseURL} (${maxAttempts} attempts × 2s = ${maxAttempts * 2}s timeout)`);
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
const response = await request.head(baseURL + '/api/v1/health', {
timeout: 3000, // 3s per attempt
});
if (response.ok) {
console.log(` ✅ Server responded after ${attempt * 2}s`);
return true;
}
} catch (error) {
const err = error as Error;
if (attempt % 5 === 0 || attempt === maxAttempts) {
console.log(` ⏳ Attempt ${attempt}/${maxAttempts}: ${err.message}`);
}
}
await new Promise(resolve => setTimeout(resolve, 2000));
}
console.error(` ❌ Server did not respond within ${maxAttempts * 2}s`);
return false;
}
async function globalSetup(config: FullConfig): Promise<void> {
// ... existing token validation ...
const baseURL = getBaseURL();
console.log(`🧹 Running global test setup...`);
console.log(`📍 Base URL: ${baseURL}`);
// NEW: Explicit server wait with timeout
const serverReady = await waitForServer(baseURL, 30);
if (!serverReady) {
console.error('\n🚨 FATAL: Server unreachable after 60 seconds');
console.error(' Check Docker container logs: docker logs charon-playwright');
console.error(' Verify port 8080 is accessible: curl http://localhost:8080/api/v1/health');
process.exit(1);
}
// ... rest of setup ...
}
```
**Rationale:**
- Explicit timeout prevents indefinite wait
- Retry logic handles transient network issues
- Detailed error messages enable debugging
**Lines affected:** Global setup function (lines ~200-250)
---
## Phase 2: Resource Efficiency - Single Worker & Dependency Removal
**Objective:** Reduce resource contention on 2-core CI runner
**Timeline:** Implement after Phase 1 verification
**Risk Level:** MEDIUM - May change test execution order
**Rollback:** Set `workers: undefined` to restore parallel execution
### Change 2.1: Enforce Single Worker in CI
**File:** `playwright.config.js`
**Current State (Line 102):**
```javascript
workers: process.env.CI ? 1 : undefined,
```
**Verification:** Confirm this is already set. If not, add it.
**Rationale:**
- Single worker = sequential test execution = predictable resource usage
- Prevents resource starvation on 2-core runner
- Already configured; Phase 1 ensures it's active
---
### Change 2.2: Disable fullyParallel in CI (Already Done)
**File:** `playwright.config.js`
**Current State (Line 101):**
```javascript
fullyParallel: true,
```
**Change:**
```javascript
fullyParallel: process.env.CI ? false : true,
```
**Rationale:**
- `fullyParallel: false` in CI forces sequential test execution
- Reduces scheduler complexity on resource-constrained runner
- Local development still uses `fullyParallel: true` for speed
---
### Change 2.3: Verify Security Test Dependency Removal (Already Done)
**File:** `playwright.config.js`
**Current State (Lines ~207-219):** Security-tests dependency already removed:
```javascript
{
name: 'chromium',
use: {
...devices['Desktop Chrome'],
storageState: STORAGE_STATE,
},
dependencies: ['setup'], // Temporarily removed 'security-tests'
},
```
**Status:** ✅ ALREADY FIXED - Security-tests no longer blocks browser tests
**Rationale:** Unblocks browser tests if security-tests hang or timeout
---
## Phase 3: Infrastructure Hardening - Docker Init System & Caddy CI Profile
**Objective:** Improve signal handling and reduce I/O logging
**Timeline:** Implement after Phase 2 verification
**Risk Level:** MEDIUM - Requires Docker rebuild
**Rollback:** Remove --init flag and revert Dockerfile changes
### Change 3.1: Add Process Init System to Dockerfile
**File:** `Dockerfile`
**Current State (Lines ~640-650):** No init system installed
**Change** - Add dumb-init:
At bottom of Dockerfile, after the HEALTHCHECK directive, add:
```dockerfile
# Add lightweight init system for proper signal handling
# dumb-init forwards signals to child processes, preventing zombie processes
# and ensuring clean shutdown of Caddy/Charon when Docker signals arrive
# This fixes the hanging issue where SIGTERM doesn't propagate to browsers
RUN apt-get update && apt-get install -y --no-install-recommends \
dumb-init \
&& rm -rf /var/lib/apt/lists/*
# Use dumb-init as the real init process
# This ensures SIGTERM signals are properly forwarded to Caddy and Charon
ENTRYPOINT ["dumb-init", "--"]
# Entrypoint script becomes the first argument to dumb-init
CMD ["/docker-entrypoint.sh"]
```
**Rationale:**
- `dumb-init` is a simple init system that handles signal forwarding
- Ensures SIGTERM propagates to Caddy and Charon when Docker container stops
- Prevents zombie processes hanging the container
- Lightweight (single binary, ~24KB)
**Alternative (if dumb-init unavailable):** Use Docker `--init` flag in compose:
```yaml
services:
charon-app:
init: true # Enable Docker's built-in init (equivalent to docker run --init)
```
---
### Change 3.2: Add init: true to Docker Compose
**File:** `.docker/compose/docker-compose.playwright-ci.yml`
**Current State (Lines ~31-35):**
```yaml
charon-app:
# CI provides CHARON_E2E_IMAGE_TAG=charon:e2e-test (locally built image)
# Local development uses the default fallback value
image: ${CHARON_E2E_IMAGE_TAG:-charon:e2e-test}
container_name: charon-playwright
restart: "no"
```
**Change:**
```yaml
charon-app:
# CI provides CHARON_E2E_IMAGE_TAG=charon:e2e-test (locally built image)
# Local development uses the default fallback value
image: ${CHARON_E2E_IMAGE_TAG:-charon:e2e-test}
container_name: charon-playwright
restart: "no"
init: true # NEW: Use Docker's built-in init for proper signal handling
# Alternative if using dumb-init in Dockerfile: remove this line (init already in ENTRYPOINT)
```
**Rationale:**
- `init: true` tells Docker to use `/dev/init` as the init process
- Ensures signals propagate correctly to child processes
- Works with or without dumb-init in Dockerfile
**Alternatives:**
1. If using dumb-init in Dockerfile: Remove this line (init is in ENTRYPOINT)
2. If using Docker's built-in init: Keep `init: true`
---
### Change 3.3: Create Caddy CI Profile (Disable Auto-HTTPS & Reduce Logging)
**File:** `.docker/compose/docker-compose.playwright-ci.yml`
**Current State (Line ~33-85):** caddy service section uses default config
**Change** - Add Caddy CI configuration:
Near the top of the file, after volumes section, add:
```yaml
# Caddy CI configuration file (reduced logging, auto-HTTPS disabled)
caddy-ci-config:
driver: local
driver_opts:
type: tmpfs
device: tmpfs
o: size=1m,uid=1000,gid=1000 # 1MB tmpfs for CI temp config
```
Then in the `charon-app` service, update the volumes:
**Current:**
```yaml
volumes:
# Named volume for test data persistence during test runs
- playwright_data:/app/data
- playwright_caddy_data:/data
- playwright_caddy_config:/config
```
**Change:**
```yaml
volumes:
# Named volume for test data persistence during test runs
- playwright_data:/app/data
- playwright_caddy_data:/data
- playwright_caddy_config:/config
# NEW: Mount CI-specific Caddy config to reduce logging
- type: tmpfs
target: /etc/caddy/Caddyfile
read_only: true
```
Then modify the environment section:
**Current:**
```yaml
environment:
# Core configuration
- CHARON_ENV=test
- CHARON_DEBUG=0
# ... other vars ...
```
**Change:**
```yaml
environment:
# Core configuration
- CHARON_ENV=test
- CHARON_DEBUG=0
# NEW: CI-specific Caddy configuration (reduces I/O buffer overrun)
- CADDY_ENV_AUTO_HTTPS=off
- CADDY_ADMIN_BIND=0.0.0.0:2019
- CADDY_LOG_LEVEL=warn # Reduce logging overhead
# ... other vars ...
```
**Rationale:**
- `CADDY_ENV_AUTO_HTTPS=off` prevents ACME challenges in CI (no https needed)
- `CADDY_LOG_LEVEL=warn` reduces I/O buffer pressure from logging
- Prevents I/O buffer deadlock from excessive Caddy logging
---
### Change 3.4: Update docker-entrypoint.sh to Use CI Profile
**File:** `.docker/docker-entrypoint.sh`
**Current State (Line ~319-325):**
```bash
# Start Caddy in the background with initial empty config
# Run Caddy as charon user for security
echo '{"admin":{"listen":"0.0.0.0:2019"},"apps":{}}' > /config/caddy.json
# Use JSON config directly; no adapter needed
run_as_charon caddy run --config /config/caddy.json &
```
**Change** - Add CI-specific config:
```bash
# Start Caddy in the background with initial empty config
# Run Caddy as charon user for security
# NEW: CI uses reduced logging to prevent I/O buffer deadlock
if [ "$CHARON_ENV" = "test" ] || [ -n "$CI" ]; then
echo "🚀 Using CI profile for Caddy (reduced logging)"
# Minimal config for CI: admin API only, no HTTPS
echo '{
"admin":{"listen":"0.0.0.0:2019"},
"logging":{"level":"warn"},
"apps":{}
}' > /config/caddy.json
else
# Production/local uses default logging
echo '{"admin":{"listen":"0.0.0.0:2019"},"apps":{}}' > /config/caddy.json
fi
run_as_charon caddy run --config /config/caddy.json &
```
**Rationale:**
- Detects CI environment and uses reduced logging
- Prevents I/O buffer fill from verbose Caddy logs
- Production deployments still use default logging
---
## Phase 4: Verification & Testing Strategy
**Objective:** Validate fixes incrementally and prepare rollback
**Timeline:** After each phase
**Success Criteria:** Tests complete with explicit pass/fail (never hang indefinitely)
### Phase 1 Verification (Observability)
**Run Command:**
```bash
# Run single browser with Phase 1 changes only
./github/skills/scripts/skill-runner.sh docker-rebuild-e2e
DEBUG=pw:api,pw:browser,pw:webserver PW_DEBUG_VERBOSE=1 timeout 840s npx playwright test --project=chromium --reporter=line
```
**Success Indicators:**
- ✅ Console shows `pw:api` debug output (Playwright webServer startup)
- ✅ Console shows Caddy admin API responses
- ✅ Tests complete or fail with explicit error (never hang)
- ✅ Real-time progress visible (line reporter active)
- ✅ No "Skipping authenticated security reset" messages
**Failure Diagnosis:**
- If still hanging: Check Docker logs for Caddy errors `docker logs charon-playwright`
- If webServer timeout: Verify port 8080 is accessible `curl http://localhost:8080/api/v1/health`
---
### Phase 2 Verification (Resource Efficiency)
**Run Command:**
```bash
# Run all browsers sequentially (workers: 1)
npx playwright test --workers=1 --reporter=line
```
**Success Indicators:**
- ✅ Tests run sequentially (one browser at a time)
- ✅ No resource starvation detected (CPU ~50%, Memory ~2GB)
- ✅ Each browser project completes or times out with explicit message
- ✅ No "target closed" errors from resource exhaustion
**Failure Diagnosis:**
- If individual browsers hang: Proceed to Phase 3 (init system)
- If memory still exhausted: Check test file size `du -sh tests/`
---
### Phase 3 Verification (Infrastructure Hardening)
**Run Command:**
```bash
# Rebuild with dumb-init and CI profile
docker build --build-arg BUILD_DEBUG=0 -t charon:e2e-test .
./github/skills/scripts/skill-runner.sh docker-rebuild-e2e
npx playwright test --project=chromium --reporter=line 2>&1
```
**Success Indicators:**
-`dumb-init` appears in process tree: `docker exec charon-playwright ps aux`
- ✅ SIGTERM propagates correctly on container stop
- ✅ Caddy logs show `log_level=warn` (reduced verbosity)
- ✅ I/O buffer pressure reduced (no buffer overrun errors)
**Verification Commands:**
```bash
# Verify dumb-init is running
docker exec charon-playwright ps aux | grep -E "(dumb-init|caddy|charon)"
# Verify Caddy config
curl http://localhost:2019/config | jq '.logging'
# Check for buffer errors
docker logs charon-playwright | grep -i "buffer\|pipe\|fd\|too many"
```
**Failure Diagnosis:**
- If dumb-init not present: Check Dockerfile ENTRYPOINT directive
- If Caddy logs still verbose: Verify `CADDY_LOG_LEVEL=warn` environment
---
### Phase 4 Full Integration Test
**Run Command:**
```bash
# Run all browsers with all phases active
npx playwright test --workers=1 --reporter=line --reporter=html
```
**Success Criteria:**
- ✅ All browser projects complete (pass or explicit fail)
- ✅ No indefinite hangs (max 15 minutes per browser)
- ✅ HTML report generated and artifacts uploaded
- ✅ Exit code 0 if all pass, nonzero if any failed
**Metrics to Collect:**
- Total runtime per browser (target: <10 min each)
- Peak memory usage (target: <2.5GB)
- Exit code (0 = success, 1 = test failures, 124 = timeout)
---
## Rollback Plan
### Phase 1 Rollback (Observability - Safest)
**Impact:** Zero - read-only changes
**Procedure:**
```bash
# Revert environment variables in workflow
git checkout HEAD -- .github/workflows/e2e-tests-split.yml
# Rollback playwright.config.js
git checkout HEAD -- playwright.config.js tests/global-setup.ts
# No Docker rebuild needed
```
**Verification:** Re-run workflow; should behave as before
---
### Phase 2 Rollback (Resource Efficiency - Safe)
**Impact:** Tests will attempt parallel execution (may reintroduce hang)
**Procedure:**
```bash
# Revert workers and fullyParallel settings
git diff playwright.config.js
# Remove: fullyParallel: process.env.CI ? false : true
# Restore parallel config
sed -i 's/fullyParallel: process.env.CI ? false : true/fullyParallel: true/' playwright.config.js
# No Docker rebuild needed
```
**Verification:** Re-run workflow; should execute with multiple workers
---
### Phase 3 Rollback (Infrastructure - Requires Rebuild)
**Impact:** Container loses graceful shutdown capability
**Procedure:**
```bash
# Revert Dockerfile changes (remove dumb-init)
git checkout HEAD -- Dockerfile
git checkout HEAD -- .docker/compose/docker-compose.playwright-ci.yml
git checkout HEAD -- .docker/docker-entrypoint.sh
# Rebuild image
docker build --build-arg BUILD_DEBUG=0 -t charon:e2e-test .
# Push new image
docker push charon:e2e-test
```
**Verification:**
```bash
# Verify dumb-init is NOT in process tree
docker exec charon-playwright ps aux | grep dumb-init # Should be empty
# Verify container still runs (graceful shutdown may fail)
```
---
## Critical Decision Matrix: Which Phase to Deploy?
| Scenario | Phase 1 | Phase 2 | Phase 3 |
|----------|---------|---------|---------|
| **Observability only** | ✅ DEPLOY | ❌ Skip | ❌ Skip |
| **Still hanging after Phase 1** | ✅ Keep | ✅ DEPLOY | ❌ Skip |
| **Resource exhaustion detected** | ✅ Keep | ✅ Keep | ✅ DEPLOY |
| **All phases needed** | ✅ Deploy | ✅ Deploy | ✅ Deploy |
| **Risk of regression** | ❌ Very Low | ⚠️ Medium | ⚠️ High |
**Recommendation:** Deploy Phase 1 → Test → If still hanging, deploy Phase 2 → Test → If still hanging, deploy Phase 3
---
## Implementation Ordering & Dependencies
```
Phase 1 (Days 1-2): Parallel [A, B, C] - No blocking ordering
├─ A: Add DEBUG env vars to workflow [Changes: .github/workflows/]
├─ B: Add timeout on test step [Changes: .github/workflows/]
├─ C: Enable stdout piping in playwright.config.js [Changes: playwright.config.js]
└─ D: Add health check retry logic to global-setup [Changes: tests/global-setup.ts]
Phase 2 (Day 3): Depends on Phase 1 verification
├─ Enforce workers: 1 (likely already done)
├─ Disable fullyParallel in CI
└─ Verify security-tests dependency removed (already done)
Phase 3 (Days 4-5): Depends on Phase 2 verification
├─ Build Phase: Update Dockerfile with dumb-init
├─ Config Phase: Update docker-compose and entrypoint.sh
└─ Deploy: Rebuild Docker image and push
```
**Parallel execution possible for Phase 1 changes (A, B, C, D)**
**Sequential requirement:** Phase 1 → Phase 2 → Phase 3
---
## Testing Strategy: Minimal Reproducible Example (MRE)
### Test 1: Single Browser, Single Test (Quickest Feedback)
```bash
# Test only the setup and first test
npx playwright test --project=chromium tests/core/dashboard.spec.ts --reporter=line
```
**Expected Time:** <2 minutes
**Success:** Test passes or fails with explicit error (not hang)
---
### Test 2: Full Browser Suite, Single Shard
```bash
# Test all tests in chromium browser
npx playwright test --project=chromium --reporter=line
```
**Expected Time:** 8-12 minutes
**Success:** All tests pass OR fail with report
---
### Test 3: CI Simulation (All Browsers)
```bash
# Simulate CI environment
CI=1 npx playwright test --workers=1 --retries=2 --reporter=line --reporter=html
```
**Expected Time:** 25-35 minutes (3 browsers × 8-12 min each)
**Success:** All 3 browser projects complete without timeout exception
---
## Observability Checklist
### Logs to Monitor During Testing
1. **Playwright Output:**
```bash
# Should see immediate progress lines
✓ tests/core/dashboard.spec.ts:26 Dashboard Page Loading (1.2s)
```
2. **Docker Logs (Caddy):**
```bash
docker logs charon-playwright 2>&1 | grep -E "level|error|listen"
# Should see: "level": "warn" (CI mode)
```
3. **GitHub Actions Output:**
- Should see DEBUG output from `pw:api` and `pw:browser`
- Should see explicit timeout or completion message
- Should NOT see indefinite hang
---
## Success Criteria (Definition of Done)
- [ ] Phase 1 complete: DEBUG output visible, explicit timeouts on test step
- [ ] Phase 1 verified: Run 1x Chromium test; verify completes or fails (not hang)
- [ ] Phase 2 complete: workers: 1, fullyParallel: false
- [ ] Phase 2 verified: Run all 3 browsers; measure runtime and memory
- [ ] Phase 3 complete: dumb-init added, CI profile created
- [ ] Phase 3 verified: Verify graceful shutdown, log levels
- [ ] Full integration test: All 3 browsers complete in <35 minutes
- [ ] Rollback plan documented and tested
- [ ] CI workflow updated to v2
- [ ] Developer documentation updated
---
## Dependencies & External Factors
| Dependency | Status | Impact |
|-----------|--------|--------|
| dumb-init availability in debian:trixie-slim | ✅ Available | Phase 3 can proceed |
| Docker Compose v3.9+ (supports init: true) | ✅ Assumed | Phase 3 compose change |
| GitHub Actions timeout support | ✅ Supported | Phase 1 can proceed |
| Playwright v1.40+ (supports --reporter=line) | ✅ Latest | Phase 1 can proceed |
---
## Confidence Assessment
**Overall Confidence: 78% (Medium-High)**
### Reasoning:
**High Confidence (85%+):**
- Issue clearly identified: I/O buffer deadlock + resource starvation
- Phase 1 (observability) low-risk, high-information gain
- Explicit timeouts will convert hang → error (measurable improvement)
**Medium Confidence (70-80%):**
- Phase 2 (resource efficiency) depends on verifying Phase 1 reduces contention
- Phase 3 (init system) addresses signal handling but may not be root cause if app-level deadlock
**Lower Confidence (<70%):**
- Network configuration (IPv4 vs IPv6) could still cause issues
- Unknown Playwright webServer detection logic may have other edge cases
**Risk Mitigation:**
- Phase 1 provides debugging telemetry to diagnose remaining issues
- Rollback simple for each phase
- MRE testing strategy limits blast radius
- Incremental deployment reduces rollback overhead
**Incremental verification reduces overall risk to 15%**
---
## Timeline & Milestones
| Milestone | Date | Owner | Duration |
|-----------|------|-------|----------|
| **Phase 1 Implementation** | Feb 5 | QA/DevOps | 4 hours |
| **Phase 1 Testing & Verification** | Feb 5-6 | QA | 8 hours |
| **Phase 2 Implementation** | Feb 6 | QA/DevOps | 2 hours |
| **Phase 2 Testing** | Feb 6 | QA | 4 hours |
| **Phase 3 Implementation** | Feb 7 | DevOps | 4 hours |
| **Phase 3 Docker Rebuild** | Feb 7 | DevOps | 2 hours |
| **Full Integration Test** | Feb 7-8 | QA | 4 hours |
| **Documentation & Handoff** | Feb 8 | Engineering | 2 hours |
**Total: 30 hours (4 days)**
---
## Follow-Up Actions
After remediation completion:
1. **Documentation Update:** Update [docs/guides/ci-cd-pipeline.md] with new CI profile
2. **Alert Configuration:** Add monitoring for test hangs (script: check for zombie processes)
3. **Process Review:** Document why hang occurred (post-mortem analysis)
4. **Prevention:** Add pre-commit check for `fullyParallel: true` in CI environment
---
## Appendix A: Diagnostic Commands
```bash
# Monitor test progress in real-time
watch -n 1 'docker stats charon-playwright --no-stream | tail -5'
# Check for buffer-related errors
grep -i "buffer\|pipe\|epipe" <(docker logs charon-playwright)
# Verify process tree (should see dumb-init → caddy, dumb-init → charon)
docker exec charon-playwright ps auxf
# Check I/O wait time (high = buffer contention)
docker exec charon-playwright iostat -x 1 3
# Verify network configuration (IPv4 vs IPv6)
docker exec charon-playwright curl -4 http://localhost:8080/api/v1/health
docker exec charon-playwright curl -6 http://localhost:8080/api/v1/health
```
---
## Appendix B: References & Related Documents
- **Diagnostic Analysis:** [docs/implementation/FRONTEND_TEST_HANG_FIX.md](../implementation/FRONTEND_TEST_HANG_FIX.md)
- **Browser Alignment Report:** [docs/reports/browser_alignment_diagnostic.md](../reports/browser_alignment_diagnostic.md)
- **E2E Triage Quick Start:** [docs/plans/e2e-test-triage-quick-start.md](../plans/e2e-test-triage-quick-start.md)
- **Playwright Documentation:** https://playwright.dev/docs/intro
- **dumb-init GitHub:** https://github.com/Yelp/dumb-init
- **Docker Init System:** https://docs.docker.com/engine/reference/run/#specify-an-init-process
---
**Plan Complete: Ready for Review & Implementation**
**Next Steps:**
1. Review with QA lead (risk assessment)
2. Review with DevOps lead (Docker/infrastructure)
3. Begin Phase 1 implementation
4. Execute verification tests
5. Iterate on findings
---
*Generated by Planning Agent on February 4, 2026*
*Last Updated: N/A (Initial Creation)*
*Status: READY FOR REVIEW*