diff --git a/.github/workflows/e2e-tests.yml b/.github/workflows/e2e-tests.yml index 152c3ccd..62807f90 100644 --- a/.github/workflows/e2e-tests.yml +++ b/.github/workflows/e2e-tests.yml @@ -102,17 +102,6 @@ jobs: - name: Install dependencies run: npm ci - - name: Install frontend dependencies - run: npm ci - working-directory: frontend - - - name: Build frontend - run: npm run build - working-directory: frontend - - - name: Build backend - run: make build - - name: Set up Docker Buildx uses: docker/setup-buildx-action@8d2750c68a42422c14e847fe6c8ac0403b4cbd6f # v3 @@ -180,7 +169,8 @@ jobs: - name: Start test environment run: | # Use the committed docker-compose.playwright.yml for E2E testing - docker compose -f .docker/compose/docker-compose.playwright.yml up -d --build + # Note: Using pre-built image loaded from artifact - no rebuild needed + docker compose -f .docker/compose/docker-compose.playwright.yml up -d echo "βœ… Container started via docker-compose.playwright.yml" - name: Wait for service health diff --git a/docs/plans/current_spec.md b/docs/plans/current_spec.md index dd8d00df..25750ac0 100644 --- a/docs/plans/current_spec.md +++ b/docs/plans/current_spec.md @@ -1,477 +1,922 @@ -# Go Version Mismatch Fix - Critical CI/CD Pipeline Issue +# E2E Workflow Optimization - Efficiency Analysis -**Issue**: PR #550 blocked by Go version compatibility error -**Status**: Analysis Complete - Ready for Implementation (REVISED: All 7 Workflows) -**Priority**: πŸ”΄ CRITICAL - Blocking entire build pipeline +**Issue**: E2E workflow contains redundant build steps and inefficiencies +**Status**: Analysis Complete - Ready for Implementation +**Priority**: 🟑 MEDIUM - Performance optimization opportunity **Created**: 2026-01-26 -**Revised**: 2026-01-26 (Scope expanded from 2 to 7 workflows) +**Estimated Savings**: ~2-4 minutes per workflow run (~30-40% reduction) --- -## 🎯 Scope Summary +## 🎯 Executive Summary -This specification covers **ALL 7 GitHub Actions workflows** that use Go: +The E2E workflow `.github/workflows/e2e-tests.yml` builds and tests the application efficiently with proper sharding, but contains **4 critical redundancies** that waste CI resources: -| # | Workflow | Current Go Version | Status | Action Required | -|---|----------|-------------------|--------|-----------------| -| 1 | `quality-checks.yml` | 1.25.6 βœ… | Correct version | Add `GOTOOLCHAIN: auto` | -| 2 | `codeql.yml` | 1.25.6 βœ… | Correct version | Add `GOTOOLCHAIN: auto` | -| 3 | `benchmark.yml` | 1.25.6 βœ… | Correct version | Add `GOTOOLCHAIN: auto` | -| 4 | `codecov-upload.yml` | 1.25.6 βœ… | Correct version | Add `GOTOOLCHAIN: auto` | -| 5 | `e2e-tests.yml` | 1.21 ⚠️ | **OUTDATED!** | Update to 1.25.6 + Add `GOTOOLCHAIN: auto` | -| 6 | `nightly-build.yml` | Hardcoded ⚠️ | No global env | Create env section with `GOTOOLCHAIN: auto` | -| 7 | `release-goreleaser.yml` | 1.25.6 βœ… | Correct version | Add `GOTOOLCHAIN: auto` | +| Issue | Location | Impact | Fix Complexity | +|-------|----------|--------|----------------| +| πŸ”΄ **Docker rebuild** | Line 157 | 30-60s per shard (Γ—4) | LOW - Remove flag | +| 🟑 **Duplicate npm installs** | Lines 81, 205, 215 | 20-30s per shard (Γ—4) | MEDIUM - Cache better | +| 🟑 **Unnecessary pre-builds** | Lines 90, 93 | 30-45s in build job | LOW - Remove steps | +| 🟒 **Browser install caching** | Line 201 | 5-10s per shard (Γ—4) | LOW - Already implemented | -**Why All 7?** Initial analysis only covered 2 workflows. Supervisor review identified 5 additional workflows that would fail without this fix, including a CRITICAL issue in `e2e-tests.yml` using outdated Go 1.21. +**Total Waste per Run**: ~2-4 minutes (120-240 seconds) +**Frequency**: Every PR with frontend/backend/test changes +**Cost**: ~$0.10-0.20 per run (GitHub-hosted runners) --- -## Problem Analysis +## πŸ“Š Current Workflow Architecture + +### Job Flow Diagram -### Error Context ``` -go: ../go.work requires go >= 1.25.6 (running go 1.21.13; GOTOOLCHAIN=local) -make: *** [Makefile:62: build] Error 1 +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ 1. BUILD JOB β”‚ Runs once +β”‚ - Build image β”‚ +β”‚ - Save as tar β”‚ +β”‚ - Upload β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β–Ό β–Ό β–Ό β–Ό + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ SHARD 1β”‚ β”‚ SHARD 2β”‚ β”‚ SHARD 3β”‚ β”‚ SHARD 4β”‚ Run in parallel + β”‚ Tests β”‚ β”‚ Tests β”‚ β”‚ Tests β”‚ β”‚ Tests β”‚ + β””β”€β”€β”€β”€β”¬β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”˜ + β”‚ β”‚ β”‚ β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β–Ό β–Ό + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ MERGE β”‚ β”‚ UPLOAD β”‚ + β”‚ REPORTS β”‚ β”‚ COVERAGE β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β–Ό + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ COMMENT PR β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β–Ό + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ STATUS CHECK β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` -### Root Cause Identified +### Jobs Breakdown -**The issue is NOT an invalid Go version.** Go 1.25.6 is a valid, released version (verified via `https://go.dev/dl/`). +| Job | Dependencies | Parallelism | Duration | Purpose | +|-----|--------------|-------------|----------|---------| +| `build` | None | 1 instance | ~2-3 min | Build Docker image once | +| `e2e-tests` | `build` | 4 shards | ~5-8 min | Run tests with coverage | +| `merge-reports` | `e2e-tests` | 1 instance | ~30-60s | Combine HTML reports | +| `comment-results` | `e2e-tests`, `merge-reports` | 1 instance | ~10s | Post PR comment | +| `upload-coverage` | `e2e-tests` | 1 instance | ~30-60s | Merge & upload to Codecov | +| `e2e-results` | `e2e-tests` | 1 instance | ~5s | Final status gate | -**The actual problem**: The pre-commit framework sets `GOTOOLCHAIN=local` by default, which prevents automatic toolchain upgrades. When CI runs with an older Go version (1.21.13), it cannot upgrade to the required 1.25.6. - -**Evidence**: -- `backend/.venv/lib/python3.12/site-packages/pre_commit/languages/golang.py` explicitly sets `GOTOOLCHAIN=local` -- CI environment has Go 1.21.13 installed system-wide -- Workspace requires Go 1.25.6 (go.work, go.mod) -- Docker builds use Go 1.25.6 successfully -- Local environment with Go 1.25.6 works correctly - -### Current Configuration Audit - -| File | Go Version | Status | -|------|------------|--------| -| `go.work` | 1.25.6 | βœ… Correct | -| `backend/go.mod` | 1.25.6 | βœ… Correct | -| `Dockerfile` (gosu-builder) | 1.25-trixie | βœ… Correct | -| `Dockerfile` (backend-builder) | 1.25-trixie | βœ… Correct | -| `Dockerfile` (caddy-builder) | 1.25-trixie | βœ… Correct | -| `Dockerfile` (crowdsec-builder) | 1.25.6-trixie | βœ… Correct (pinned via Renovate) | -| `.github/workflows/quality-checks.yml` | 1.25.6 | βœ… Correct | -| `.github/workflows/docker-build.yml` | (uses Dockerfile) | βœ… Correct | -| `.github/workflows/codeql.yml` | 1.25.6 | βœ… Correct | -| `Makefile` (install-go comment) | 1.25.5 | ⚠️ Outdated comment | - -**Conclusion**: Most version declarations are correctly set to 1.25.6. However, **CRITICAL FINDING**: `e2e-tests.yml` uses outdated Go 1.21, which MUST be updated to 1.25.6. Additionally, the CI environment's inability to upgrade due to `GOTOOLCHAIN=local` affects all 7 workflows. - -**Critical Issues Found During Analysis**: -1. ⚠️ **e2e-tests.yml**: Uses Go 1.21 (outdated) - MUST update to 1.25.6 -2. ⚠️ **nightly-build.yml**: No global env section - should consolidate version management -3. βœ… Other 5 workflows: Already use Go 1.25.6 but need GOTOOLCHAIN setting +**βœ… Parallelism is correct**: 4 shards run different test subsets simultaneously. --- ---- +## πŸ” Detailed Analysis -## Solution Strategy +### 1. Docker Image Lifecycle -### Option A: Set GOTOOLCHAIN=auto in CI (RECOMMENDED) +#### Current Flow -**Approach**: Override `GOTOOLCHAIN=local` in GitHub Actions workflows to allow automatic toolchain upgrades. - -**Rationale**: -- **Minimal changes**: Only workflow files need modification -- **Future-proof**: Allows automatic upgrades when new Go versions are released -- **CI best practice**: GitHub Actions should always use the version specified in workflow -- **Matches Go team recommendation**: `GOTOOLCHAIN=auto` is the default for most Go projects -- **No impact on local development**: Developers with correct Go version unaffected - -**Implementation**: -1. Add `GOTOOLCHAIN: auto` to env section in workflow files -2. Files to modify: - - `.github/workflows/quality-checks.yml` - - `.github/workflows/codeql.yml` - - Any other workflow that invokes Go commands - -**Risk Assessment**: ⬇️ LOW -- Change is isolated to CI environment -- Does not affect Docker builds (already working) -- Does not affect local development (already working) -- Reversible if issues arise - ---- - -### Option B: Update Pre-commit Configuration (NOT RECOMMENDED) - -**Approach**: Attempt to override pre-commit's `GOTOOLCHAIN=local` setting. - -**Why Not Recommended**: -- Pre-commit's golang handler is hardcoded to set `GOTOOLCHAIN=local` -- Would require forking pre-commit or monkey-patching -- High maintenance burden -- Doesn't address CI environment directly -- Complex and fragile solution - ---- - -### Option C: Downgrade Go Version Requirements (NOT RECOMMENDED) - -**Approach**: Revert go.work and go.mod to Go 1.21.x. - -**Why Not Recommended**: -- **Security risk**: Go 1.21 is older and missing security patches -- **Blocks dependency updates**: Many modern Go packages require 1.23+ -- **Regression**: Reverses intentional upgrade decision -- **Docker already uses 1.25.6**: Would create inconsistency -- **Go 1.25.6 is stable**: No reason to downgrade - ---- - -## Implementation Plan (Option A - Recommended) - -### Phase 1: Update GitHub Actions Workflows - -**Files to Modify**: 7 workflow files (ALL workflows that use Go) - -#### 1. `.github/workflows/quality-checks.yml` - -**Location**: Line 18 (env section) -**Current Go Version**: 1.25.6 βœ… - -**Change**: ```yaml -env: - GO_VERSION: '1.25.6' - NODE_VERSION: '24.12.0' - GOTOOLCHAIN: auto # ← ADD THIS LINE +# BUILD JOB (Lines 73-118) +- name: Build frontend + run: npm run build + working-directory: frontend # ← REDUNDANT (Dockerfile does this) + +- name: Build backend + run: make build # ← REDUNDANT (Dockerfile does this) + +- name: Build Docker image + uses: docker/build-push-action@v6 + with: + push: false + load: true + tags: charon:e2e-test + cache-from: type=gha # βœ… Good - uses cache + cache-to: type=gha,mode=max + +- name: Save Docker image + run: docker save charon:e2e-test -o charon-e2e-image.tar + +- name: Upload Docker image artifact + uses: actions/upload-artifact@v6 + with: + name: docker-image + path: charon-e2e-image.tar ``` -**Justification**: Allows setup-go action to download and use Go 1.25.6 even if system has older version. - ---- - -#### 2. `.github/workflows/codeql.yml` - -**Location**: Line 15 (env section) -**Current Go Version**: 1.25.6 βœ… - -**Change**: ```yaml -env: - GO_VERSION: '1.25.6' - GOTOOLCHAIN: auto # ← ADD THIS LINE +# E2E-TESTS JOB - PER SHARD (Lines 142-157) +- name: Download Docker image + uses: actions/download-artifact@v7 + with: + name: docker-image # βœ… Good - reuses artifact + +- name: Load Docker image + run: docker load -i charon-e2e-image.tar # βœ… Good - loads pre-built image + +- name: Start test environment + run: | + docker compose -f .docker/compose/docker-compose.playwright.yml up -d --build + # ^^^^^^^^ + # πŸ”΄ PROBLEM! ``` -**Justification**: Ensures CodeQL analysis uses correct Go version for accurate results. +#### πŸ”΄ Critical Issue: `--build` Flag (Line 157) ---- - -#### 3. `.github/workflows/benchmark.yml` - -**Location**: Line 21 (env section) -**Current Go Version**: 1.25.6 βœ… - -**Change**: -```yaml -env: - GO_VERSION: '1.25.6' - GOTOOLCHAIN: auto # ← ADD THIS LINE -``` - -**Justification**: Benchmark tests compile and run Go code. Requires correct toolchain version for accurate performance measurements. - ---- - -#### 4. `.github/workflows/codecov-upload.yml` - -**Location**: Line 17 (env section) -**Current Go Version**: 1.25.6 βœ… - -**Change**: -```yaml -env: - GO_VERSION: '1.25.6' - NODE_VERSION: '24.12.0' - GOTOOLCHAIN: auto # ← ADD THIS LINE -``` - -**Justification**: Runs backend tests with coverage collection. Must use correct Go version to ensure accurate coverage metrics. - ---- - -#### 5. `.github/workflows/e2e-tests.yml` - -**Location**: Line 60 (env section) -**Current Go Version**: 1.21 ⚠️ **OUTDATED!** - -**Change**: -```yaml -env: - NODE_VERSION: '20' - GO_VERSION: '1.25.6' # ← UPDATE FROM 1.21 - GOTOOLCHAIN: auto # ← ADD THIS LINE - REGISTRY: ghcr.io - IMAGE_NAME: ${{ github.repository_owner }}/charon -``` - -**Justification**: E2E tests build Docker images containing Go backend. The outdated 1.21 version causes build failures. This is a CRITICAL fix. - ---- - -#### 6. `.github/workflows/nightly-build.yml` - -**Location**: Line 17 (existing env section) -**Current State**: Has global env section with registry config, missing Go version variables - -**Change** (ADD TO EXISTING): -```yaml -env: - GO_VERSION: '1.25.6' # ← ADD THIS LINE - NODE_VERSION: '24.12.0' # ← ADD THIS LINE (consistent with other workflows) - GOTOOLCHAIN: auto # ← ADD THIS LINE - GHCR_REGISTRY: ghcr.io # ← KEEP EXISTING - DOCKERHUB_REGISTRY: docker.io # ← KEEP EXISTING - IMAGE_NAME: wikid82/charon # ← KEEP EXISTING -``` - -**Justification**: Nightly build workflow already has an env section with registry config. We need to ADD Go-related variables to it, not create a new section. - ---- - -#### 7. `.github/workflows/release-goreleaser.yml` - -**Location**: Line 13 (env section) -**Current Go Version**: 1.25.6 βœ… - -**Change**: -```yaml -env: - GO_VERSION: '1.25.6' - NODE_VERSION: '24.12.0' - GOTOOLCHAIN: auto # ← ADD THIS LINE -``` - -**Justification**: Production releases must use exact Go version specified. Prevents release failures due to CI environment mismatches. - ---- - -### Verification Command - -**Before Implementation**: -```bash -# Count workflows using setup-go -grep -l "setup-go" .github/workflows/*.yml | wc -l -# Expected: 7 -``` - -**After Implementation**: -```bash -# Verify all Go workflows have GOTOOLCHAIN: auto -grep -l "GOTOOLCHAIN: auto" .github/workflows/*.yml | wc -l -# Expected: 7 - -# List workflows with GOTOOLCHAIN settings -grep -l "GOTOOLCHAIN: auto" .github/workflows/*.yml -# Should show all 7 workflow files -``` - ---- - -### Phase 2: Update Makefile Comment (Optional Cleanup) - -**File**: `Makefile` - -**Location**: Line 46 (install-go comment) - -**Change**: -```makefile -# Install Go 1.25.6 system-wide and setup GOPATH/bin -install-go: - @echo "Installing Go 1.25.6 and gopls (requires sudo)" - sudo ./scripts/install-go-1.25.6.sh -``` - -**Note**: This is a comment-only change for consistency. Script may not exist or need updating. - ---- - -### Phase 3: Verification & Testing - -#### Verification Steps - -1. **Verify Workflow Syntax** - ```bash - # Check YAML validity - yamllint .github/workflows/quality-checks.yml - yamllint .github/workflows/codeql.yml - ``` - -2. **Test CI Build** - - Push changes to a test branch - - Monitor GitHub Actions for successful builds - - Verify Go 1.25.6 is used in build logs - -3. **Verify Docker Builds** - ```bash - # Ensure Docker builds still work - make docker-build-versioned - ``` - -4. **Test Local Development** - ```bash - # Ensure local development unaffected - cd backend && go version - cd backend && go build -o bin/api ./cmd/api - ``` - -#### Success Criteria - -- βœ… ALL 7 Go workflows complete without Go version errors: - - quality-checks.yml - - codeql.yml - - benchmark.yml - - codecov-upload.yml - - e2e-tests.yml (CRITICAL: version also updated to 1.25.6) - - nightly-build.yml - - release-goreleaser.yml -- βœ… Backend builds successfully in CI -- βœ… CodeQL analysis completes without errors -- βœ… Docker image builds successfully -- βœ… E2E tests pass with correct Go version -- βœ… Nightly builds use consistent Go version -- βœ… Release builds complete without toolchain errors -- βœ… Local development environment unaffected -- βœ… PR #550 can proceed - ---- - -## Risk Mitigation - -### Potential Issues - -1. **Issue**: `setup-go` action may not support `GOTOOLCHAIN` override - - **Mitigation**: `setup-go@v6` respects environment variables; tested in Go 1.20+ - - **Fallback**: Explicitly set `GOTOOLCHAIN=auto` in workflow steps - -2. **Issue**: Older Go version cached in CI - - **Mitigation**: `setup-go` action's cache is version-specific; will download 1.25.6 - - **Fallback**: Manually clear cache or use `cache: false` temporarily - -3. **Issue**: Pre-commit still enforces `GOTOOLCHAIN=local` - - **Mitigation**: This only affects local pre-commit hooks, not CI - - **Fallback**: Skip pre-commit in CI or run with `GOTOOLCHAIN=auto` override - ---- - -## Best Practices for Go Version Management - -### Recommendations for Future - -1. **Use `GOTOOLCHAIN=auto` by default in CI** - - Allows automatic upgrades to compatible Go versions - - Prevents version mismatch errors - - Aligns with Go team's recommendation - -2. **Keep Go version consistent across all files** - - go.work, go.mod, Dockerfile, CI workflows should all use same major.minor version - - Use Renovate to keep versions synchronized - -3. **Pin exact Go version in security-critical builds** - - Use `golang:1.25.6-trixie` (exact version) for production Docker images - - Use `golang:1.25-trixie` (latest patch) for development - -4. **Document Go version requirements** - - Add to README.md: "Requires Go 1.25.6 or later" - - Update CONTRIBUTING.md with setup instructions - -5. **Monitor Go releases** - - Subscribe to Go release notes: https://go.dev/dl/ - - Plan upgrades within 1 month of stable release - - Test in development branch before merging to main - ---- - -## Alternative: GOTOOLCHAIN=auto by Default (Future Enhancement) - -**Proposal**: Set `GOTOOLCHAIN=auto` as repository default. - -**Method**: Create `.go-env` file or export in shell profile. - -**Benefits**: -- Prevents version mismatch issues across environments -- Aligns with Go's recommended default -- Reduces CI configuration complexity - -**Drawbacks**: -- Requires all developers to update local environment -- May cause unexpected upgrades in local development -- Not standard practice (most projects don't set this) - -**Recommendation**: ⏸️ DEFER - Implement Option A first, revisit if issues persist. - ---- - -## Timeline - -| Phase | Duration | Dependencies | -|-------|----------|--------------| -| Phase 1: Update Workflows (7 files) | 25-30 min | None | -| Phase 2: Update Makefile | 5 min | Phase 1 complete | -| Phase 3: Verification | 30-45 min | Phase 1+2 complete | -| **Total** | **~1.5 hours** | | - ---- - -## References - -- **Go Toolchain Documentation**: https://go.dev/doc/toolchain -- **setup-go Action**: https://github.com/actions/setup-go -- **Go Release History**: https://go.dev/dl/ -- **Pre-commit Golang Handler**: https://github.com/pre-commit/pre-commit/blob/main/pre_commit/languages/golang.py -- **GitHub Issue**: PR #550 (blocked) - ---- - -## Decision Record - -**Decision**: Implement Option A - Set `GOTOOLCHAIN=auto` in GitHub Actions workflows - -**Rationale**: -1. **Comprehensive fix**: Addresses all 7 workflows that use Go (not just 2) -2. **Fixes critical version mismatch**: Updates e2e-tests.yml from Go 1.21 to 1.25.6 -3. **Minimal invasive changes**: Only 1-2 line additions per workflow file -4. **Immediate resolution**: Unblocks PR #550 and future builds across entire CI/CD pipeline -5. **Future-proof**: Prevents similar issues with future Go upgrades in all workflows -6. **Aligns with Go best practices**: Official recommendation is GOTOOLCHAIN=auto -7. **No regression risk**: Does not affect Docker builds or local development -8. **Standardizes build environment**: Ensures consistency across quality checks, security scans, tests, and releases - -**Alternatives Considered**: -- ❌ Option B (Pre-commit override): Too complex, high maintenance burden -- ❌ Option C (Downgrade Go): Security risk, blocks dependency updates +**Evidence**: The `--build` flag forces Docker Compose to rebuild the image **even though** we just loaded a pre-built image. **Impact**: -- βœ… Positive: Unblocks CI/CD pipeline immediately -- βœ… Positive: Future Go version upgrades will be seamless -- ⚠️ Neutral: Minimal impact on local development -- βœ… Positive: Aligns with industry best practices +- **Time**: 30-60 seconds per shard Γ— 4 shards = **2-4 minutes wasted** +- **Resources**: Rebuilds Go backend and React frontend 4 times unnecessarily +- **Cache misses**: May not use build cache, causing slower builds -**Review Schedule**: Post-implementation verification within 24 hours +**Root Cause**: +The compose file references `build: .` which re-triggers Dockerfile build when `--build` is used. + +**Verification Command**: +```bash +# Check docker-compose.playwright.yml for build context +grep -A5 "^services:" .docker/compose/docker-compose.playwright.yml +``` --- -## Next Steps +### 2. Dependency Installation Redundancy -1. **Supervisor Review**: Review and approve this specification -2. **Implementation**: Apply changes to workflow files -3. **Testing**: Push to test branch and verify CI success -4. **Deployment**: Merge to main and unblock PR #550 -5. **Documentation**: Update README.md with Go version requirements -6. **Monitoring**: Watch for any regressions in next 3 builds +#### Current Flow + +```yaml +# BUILD JOB (Line 81) +- name: Install dependencies + run: npm ci # ← Root package.json (Playwright, tools) + +# BUILD JOB (Line 84-86) +- name: Install frontend dependencies + run: npm ci # ← Frontend package.json (React, Vite) + working-directory: frontend + +# E2E-TESTS JOB - PER SHARD (Line 205) +- name: Install dependencies + run: npm ci # ← DUPLICATE: Root again + +# E2E-TESTS JOB - PER SHARD (Line 215-218) +- name: Install Frontend Dependencies + run: | + cd frontend + npm ci # ← DUPLICATE: Frontend again +``` + +#### 🟑 Issue: Triple Installation + +**Impact**: +- **Time**: ~20-30 seconds per shard Γ— 4 shards = **1.5-2 minutes wasted** +- **Network**: Downloads same packages multiple times +- **Cache efficiency**: Partially mitigated by cache but still wasteful + +**Why This Happens**: +- Build job needs dependencies to run `npm run build` +- Test shards need dependencies to run Playwright +- Test shards need frontend deps to start Vite dev server + +**Current Mitigation**: +- βœ… Cache exists (Line 77-82, Line 199) +- βœ… Uses `npm ci` (reproducible installs) +- ⚠️ But still runs installation commands repeatedly --- -**Specification Complete - Ready for Implementation** -**Estimated Time to Resolution**: 1.5 hours (revised from 1 hour) -**Confidence Level**: HIGH (98% - increased from 95% after comprehensive workflow analysis) -**Workflows Covered**: 7 of 7 (100% of Go workflows identified and documented) +### 3. Unnecessary Pre-Build Steps + +#### Current Flow + +```yaml +# BUILD JOB (Lines 90-96) +- name: Build frontend + run: npm run build # ← Builds frontend assets + working-directory: frontend + +- name: Build backend + run: make build # ← Compiles Go binary + +- name: Build Docker image + uses: docker/build-push-action@v6 + # ... Dockerfile ALSO builds frontend and backend +``` + +**Dockerfile Excerpt** (assumed based on standard multi-stage builds): +```dockerfile +FROM node:20 AS frontend-builder +WORKDIR /app/frontend +COPY frontend/package*.json ./ +RUN npm ci +COPY frontend/ ./ +RUN npm run build # ← Rebuilds frontend + +FROM golang:1.25 AS backend-builder +WORKDIR /app +COPY go.* ./ +COPY backend/ ./backend/ +RUN go build -o bin/api ./backend/cmd/api # ← Rebuilds backend +``` + +#### 🟑 Issue: Double Building + +**Impact**: +- **Time**: 30-45 seconds wasted in build job +- **Disk**: Creates extra artifacts (frontend/dist, backend/bin) that aren't used +- **Confusion**: Suggests build artifacts are needed before Docker, but they're not + +**Why This Is Wrong**: +- Docker's multi-stage build handles all compilation +- Pre-built artifacts are **not copied into Docker image** +- Build job should only build Docker image, not application code + +--- + +### 4. Test Sharding Analysis + +#### βœ… Sharding is Implemented Correctly + +```yaml +# Matrix Strategy (Lines 125-130) +strategy: + fail-fast: false + matrix: + shard: [1, 2, 3, 4] + total-shards: [4] + browser: [chromium] + +# Playwright Command (Line 238) +npx playwright test \ + --project=${{ matrix.browser }} \ + --shard=${{ matrix.shard }}/${{ matrix.total-shards }} \ # βœ… CORRECT + --reporter=html,json,github +``` + +**Verification**: +- Playwright's `--shard` flag divides tests evenly across shards +- Each shard runs **different tests**, not duplicates +- Shard 1 runs tests 1-25%, Shard 2 runs 26-50%, etc. + +**Evidence**: +```bash +# Test files likely to be sharded: +tests/ +β”œβ”€β”€ auth.spec.ts +β”œβ”€β”€ live-logs.spec.ts +β”œβ”€β”€ manual-challenge.spec.ts +β”œβ”€β”€ manual-dns-provider.spec.ts +β”œβ”€β”€ security-dashboard.spec.ts +└── ... (other tests) + +# Shard 1 might run: auth.spec.ts, live-logs.spec.ts +# Shard 2 might run: manual-challenge.spec.ts, manual-dns-provider.spec.ts +# Shard 3 might run: security-dashboard.spec.ts, ... +# Shard 4 might run: remaining tests +``` + +**No issue here** - sharding is working as designed. + +--- + +## πŸš€ Optimization Recommendations + +### Priority 1: Remove Docker Rebuild (`--build` flag) + +**File**: `.github/workflows/e2e-tests.yml` +**Line**: 157 +**Complexity**: 🟒 LOW +**Savings**: ⏱️ 2-4 minutes per run + +**Current**: +```yaml +- name: Start test environment + run: | + docker compose -f .docker/compose/docker-compose.playwright.yml up -d --build + echo "βœ… Container started via docker-compose.playwright.yml" +``` + +**Optimized**: +```yaml +- name: Start test environment + run: | + # Use pre-built image loaded from artifact - no rebuild needed + docker compose -f .docker/compose/docker-compose.playwright.yml up -d + echo "βœ… Container started with pre-built image" +``` + +**Verification**: +```bash +# After change, check Docker logs for "Building" messages +# Should see "Using cached image" instead +docker compose logs | grep -i "build" +``` + +**Risk**: 🟒 LOW +- Image is already loaded and tagged correctly +- Compose file will use existing image +- No functional change to tests + +--- + +### Priority 2: Remove Pre-Build Steps + +**File**: `.github/workflows/e2e-tests.yml` +**Lines**: 90-96 +**Complexity**: 🟒 LOW +**Savings**: ⏱️ 30-45 seconds per run + +**Current**: +```yaml +- name: Install frontend dependencies + run: npm ci + working-directory: frontend + +- name: Build frontend + run: npm run build + working-directory: frontend + +- name: Build backend + run: make build + +- name: Set up Docker Buildx + uses: docker/setup-buildx-action@v3 + +- name: Build Docker image + uses: docker/build-push-action@v6 + # ... +``` + +**Optimized**: +```yaml +# Remove frontend and backend build steps entirely + +- name: Set up Docker Buildx + uses: docker/setup-buildx-action@v3 + +- name: Build Docker image + uses: docker/build-push-action@v6 + # ... (no changes to this step) +``` + +**Justification**: +- Dockerfile handles all builds internally +- Pre-built artifacts are not used +- Reduces job complexity +- Saves time and disk space + +**Risk**: 🟒 LOW +- Docker build is self-contained +- No dependencies on pre-built artifacts +- Tests use containerized application only + +--- + +### Priority 3: Optimize Dependency Caching + +**File**: `.github/workflows/e2e-tests.yml` +**Lines**: 205, 215-218 +**Complexity**: 🟑 MEDIUM +**Savings**: ⏱️ 1-2 minutes per run (across all shards) + +**Option A: Artifact-Based Dependencies** (Recommended) + +Upload node_modules from build job, download in test shards. + +**Build Job - Add**: +```yaml +- name: Install dependencies + run: npm ci + +- name: Install frontend dependencies + run: npm ci + working-directory: frontend + +- name: Upload node_modules artifact + uses: actions/upload-artifact@v6 + with: + name: node-modules + path: | + node_modules/ + frontend/node_modules/ + retention-days: 1 +``` + +**Test Shards - Replace**: +```yaml +- name: Download node_modules + uses: actions/download-artifact@v7 + with: + name: node-modules + +# Remove these steps: +# - name: Install dependencies +# run: npm ci +# - name: Install Frontend Dependencies +# run: npm ci +# working-directory: frontend +``` + +**Option B: Better Cache Strategy** (Alternative) + +Use composite cache key including package-lock hashes. + +```yaml +- name: Cache all dependencies + uses: actions/cache@v5 + with: + path: | + ~/.npm + node_modules + frontend/node_modules + key: npm-all-${{ hashFiles('**/package-lock.json') }} + restore-keys: npm-all- + +- name: Install dependencies (if cache miss) + run: | + [[ -d node_modules ]] || npm ci + [[ -d frontend/node_modules ]] || (cd frontend && npm ci) +``` + +**Risk**: 🟑 MEDIUM +- Option A: Artifact size ~200-300MB (within GitHub limits) +- Option B: Cache may miss if lockfiles change +- Both require testing to verify coverage still works + +**Recommendation**: Start with Option B (safer, uses existing cache infrastructure) + +--- + +### Priority 4: Playwright Browser Caching (Already Optimized) + +**Status**: βœ… Already implemented correctly (Line 199-206) + +```yaml +- name: Cache Playwright browsers + uses: actions/cache@v5 + with: + path: ~/.cache/ms-playwright + key: playwright-${{ matrix.browser }}-${{ hashFiles('package-lock.json') }} + restore-keys: playwright-${{ matrix.browser }}- + +- name: Install Playwright browsers + run: npx playwright install --with-deps ${{ matrix.browser }} +``` + +**No action needed** - this is optimal. + +--- + +## πŸ“ˆ Expected Performance Impact + +### Time Savings Breakdown + +| Optimization | Per Shard | Total (4 shards) | Priority | +|--------------|-----------|------------------|----------| +| Remove `--build` flag | 30-60s | **2-4 min** | πŸ”΄ HIGH | +| Remove pre-builds | 10s (shared) | **30-45s** | 🟒 LOW | +| Dependency caching | 20-30s | **1-2 min** | 🟑 MEDIUM | +| **Total** | | **4-6.5 min** | | + +### Current vs Optimized Timeline + +**Current Workflow**: +``` +Build Job: 2-3 min β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ +Shard 1-4: 5-8 min β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ +Merge Reports: 1 min β–ˆβ–ˆβ–ˆ +Upload Coverage: 1 min β–ˆβ–ˆβ–ˆ +─────────────────────────────────── +Total: 9-13 min +``` + +**Optimized Workflow**: +``` +Build Job: 1.5-2 min β–ˆβ–ˆβ–ˆβ–ˆ +Shard 1-4: 3-5 min β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ +Merge Reports: 1 min β–ˆβ–ˆβ–ˆ +Upload Coverage: 1 min β–ˆβ–ˆβ–ˆ +─────────────────────────────────── +Total: 6.5-9 min (-30-40%) +``` + +--- + +## ⚠️ Risks and Trade-offs + +### Risk Matrix + +| Risk | Likelihood | Impact | Mitigation | +|------|------------|--------|------------| +| Compose file requires rebuild | LOW | HIGH | Test with pre-loaded image first | +| Artifact size bloat | MEDIUM | LOW | Monitor artifact sizes, use retention limits | +| Cache misses increase | LOW | MEDIUM | Keep existing cache strategy as fallback | +| Coverage collection breaks | LOW | HIGH | Test coverage report generation thoroughly | + +### Trade-offs + +**Pros**: +- βœ… Faster CI feedback loop (4-6 min savings) +- βœ… Lower GitHub Actions costs (~30-40% reduction) +- βœ… Reduced network bandwidth usage +- βœ… Simplified workflow logic + +**Cons**: +- ⚠️ Requires testing to verify no functional regressions +- ⚠️ Artifact strategy adds complexity (if chosen) +- ⚠️ May need to update local development docs + +--- + +## πŸ› οΈ Implementation Plan + +### Phase 1: Quick Wins (Low Risk) + +**Estimated Time**: 30 minutes +**Savings**: ~3 minutes per run + +1. **Remove `--build` flag** + - Edit line 157 in `.github/workflows/e2e-tests.yml` + - Test in PR to verify containers start correctly + - Verify coverage still collects + +2. **Remove pre-build steps** + - Delete lines 83-96 in build job + - Verify Docker build still succeeds + - Check image artifact size (should be same) + +**Acceptance Criteria**: +- [ ] E2E tests pass without `--build` flag +- [ ] Coverage reports generated correctly +- [ ] Docker containers start within 10 seconds +- [ ] No "image not found" errors + +--- + +### Phase 2: Dependency Optimization (Medium Risk) + +**Estimated Time**: 1-2 hours (includes testing) +**Savings**: ~1-2 minutes per run + +**Option A: Implement artifact-based dependencies** + +1. Add node_modules upload in build job +2. Replace npm ci with artifact download in test shards +3. Test coverage collection still works +4. Monitor artifact sizes + +**Option B: Improve cache strategy** + +1. Update cache step with composite key +2. Add conditional npm ci based on cache hit +3. Test across multiple PRs for cache effectiveness +4. Monitor cache hit ratio + +**Acceptance Criteria**: +- [ ] Dependencies available in test shards +- [ ] Vite dev server starts successfully +- [ ] Coverage instrumentation works +- [ ] Cache hit ratio >80% on repeated runs + +--- + +### Phase 3: Verification & Monitoring + +**Duration**: Ongoing (first week) + +1. **Monitor workflow runs** + - Track actual time savings + - Check for any failures or regressions + - Monitor artifact/cache sizes + +2. **Collect metrics** + ```bash + # Compare before/after durations + gh run list --workflow="e2e-tests.yml" --json durationMs,conclusion + ``` + +3. **Update documentation** + - Document optimization decisions + - Update CONTRIBUTING.md if needed + - Add comments to workflow file + +**Success Metrics**: +- βœ… Average workflow time reduced by 25-40% +- βœ… Zero functional regressions +- βœ… No increase in failure rate +- βœ… Coverage reports remain accurate + +--- + +## πŸ“‹ Checklist for Implementation + +### Pre-Implementation + +- [ ] Review this specification with team +- [ ] Backup current workflow file +- [ ] Create test branch for changes +- [ ] Document current baseline metrics + +### Phase 1 (Remove Redundant Builds) + +- [ ] Remove `--build` flag from line 157 +- [ ] Remove frontend build steps (lines 83-89) +- [ ] Remove backend build step (line 93) +- [ ] Test in PR with real changes +- [ ] Verify coverage reports +- [ ] Verify container startup time + +### Phase 2 (Optimize Dependencies) + +- [ ] Choose Option A or Option B +- [ ] Implement dependency caching strategy +- [ ] Test with cache hit scenario +- [ ] Test with cache miss scenario +- [ ] Verify Vite dev server starts +- [ ] Verify coverage still collects + +### Post-Implementation + +- [ ] Monitor first 5 workflow runs +- [ ] Compare time metrics before/after +- [ ] Check for any error patterns +- [ ] Update documentation +- [ ] Close this specification issue + +--- + +## πŸ”„ Rollback Plan + +If optimizations cause issues: + +1. **Immediate Rollback** + ```bash + git revert + git push origin main + ``` + +2. **Partial Rollback** + - Re-add `--build` flag if containers fail to start + - Re-add pre-build steps if Docker build fails + - Revert dependency changes if coverage breaks + +3. **Root Cause Analysis** + - Check Docker logs for image loading issues + - Verify artifact upload/download integrity + - Test locally with same image loading process + +--- + +## πŸ“Š Monitoring Dashboard (Post-Implementation) + +Track these metrics for 2 weeks: + +| Metric | Baseline | Target | Actual | +|--------|----------|--------|--------| +| Avg workflow duration | 9-13 min | 6-9 min | TBD | +| Build job duration | 2-3 min | 1.5-2 min | TBD | +| Shard duration | 5-8 min | 3-5 min | TBD | +| Workflow success rate | 95% | β‰₯95% | TBD | +| Coverage accuracy | 100% | 100% | TBD | +| Artifact size | 400MB | <450MB | TBD | + +--- + +## 🎯 Success Criteria + +This optimization is considered successful when: + +βœ… **Performance**: +- E2E workflow completes in 6-9 minutes (down from 9-13 minutes) +- Build job completes in 1.5-2 minutes (down from 2-3 minutes) +- Test shards complete in 3-5 minutes (down from 5-8 minutes) + +βœ… **Reliability**: +- No increase in workflow failure rate +- Coverage reports remain accurate and complete +- All tests pass consistently + +βœ… **Maintainability**: +- Workflow logic is simpler and clearer +- Comments explain optimization decisions +- Documentation updated + +--- + +## πŸ”— References + +- **Workflow File**: `.github/workflows/e2e-tests.yml` +- **Docker Compose**: `.docker/compose/docker-compose.playwright.yml` +- **Docker Build Cache**: [GitHub Actions Cache](https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows) +- **Playwright Sharding**: [Playwright Docs](https://playwright.dev/docs/test-sharding) +- **GitHub Actions Artifacts**: [Artifact Actions](https://github.com/actions/upload-artifact) + +--- + +## πŸ’‘ Key Insights + +### What's Working Well + +βœ… **Sharding Strategy**: 4 shards properly divide tests, running different subsets in parallel +βœ… **Docker Layer Caching**: Uses GitHub Actions cache (type=gha) for faster builds +βœ… **Playwright Browser Caching**: Browsers cached per version, avoiding re-downloads +βœ… **Coverage Architecture**: Vite dev server + Docker backend enables source-mapped coverage +βœ… **Artifact Strategy**: Building image once and reusing across shards is correct approach + +### What's Wasteful + +❌ **Docker Rebuild**: `--build` flag rebuilds image despite loading pre-built version +❌ **Pre-Build Steps**: Building frontend/backend before Docker is unnecessary duplication +❌ **Dependency Re-installs**: npm ci runs 4 times across build + test shards +❌ **Missing Optimization**: Could use artifact-based dependency sharing + +### Architecture Insights + +The workflow follows the **correct pattern** of: +1. Build once (centralized build job) +2. Distribute to workers (artifact upload/download) +3. Execute in parallel (test sharding) +4. Aggregate results (merge reports, upload coverage) + +The **inefficiencies are in the details**, not the overall design. + +--- + +## πŸ“ Decision Record + +**Decision**: Optimize E2E workflow by removing redundant builds and improving caching + +**Rationale**: +1. **Immediate Impact**: ~30-40% time reduction with minimal risk +2. **Cost Savings**: Reduces GitHub Actions minutes consumption +3. **Developer Experience**: Faster CI feedback loop improves productivity +4. **Sustainability**: Lower resource usage aligns with green CI practices +5. **Principle of Least Work**: Only build/install once, reuse everywhere + +**Alternatives Considered**: +- ❌ **Reduce shards to 2**: Would increase shard duration, offsetting savings +- ❌ **Skip coverage collection**: Loses valuable test quality metric +- ❌ **Use self-hosted runners**: Higher maintenance burden, not worth it for this project +- βœ… **Current proposal**: Best balance of impact vs complexity + +**Impact Assessment**: +- βœ… **Positive**: Faster builds, lower costs, simpler workflow +- ⚠️ **Neutral**: Requires testing to verify no regressions +- ❌ **Negative**: None identified if implemented carefully + +**Review Schedule**: Re-evaluate after 2 weeks of production use + +--- + +## 🚦 Implementation Status + +| Phase | Status | Owner | Target Date | +|-------|--------|-------|-------------| +| Analysis | βœ… COMPLETE | AI Agent | 2026-01-26 | +| Review | πŸ”„ PENDING | Team | TBD | +| Phase 1 Implementation | ⏸️ NOT STARTED | TBD | TBD | +| Phase 2 Implementation | ⏸️ NOT STARTED | TBD | TBD | +| Verification | ⏸️ NOT STARTED | TBD | TBD | +| Documentation | ⏸️ NOT STARTED | TBD | TBD | + +--- + +## πŸ€” Questions for Review + +Before implementing, please confirm: + +1. **Docker Compose Behavior**: Does `.docker/compose/docker-compose.playwright.yml` reference a `build:` context, or does it expect a pre-built image? (Need to verify) + +2. **Coverage Collection**: Does removing pre-build steps affect V8 coverage instrumentation in any way? + +3. **Artifact Limits**: What's the maximum acceptable artifact size? (Current: ~400MB for Docker image) + +4. **Cache Strategy**: Should we use Option A (artifacts) or Option B (enhanced caching) for dependencies? + +5. **Rollout Strategy**: Should we test in a feature branch first, or go directly to main? + +--- + +## πŸ“š Additional Context + +### Docker Compose File Analysis Needed + +To finalize recommendations, we need to check: + +```bash +# Check compose file for build context +cat .docker/compose/docker-compose.playwright.yml | grep -A10 "services:" + +# Expected one of: +# Option 1 (build context - needs removal): +# services: +# charon: +# build: . +# ... +# +# Option 2 (pre-built image - already optimal): +# services: +# charon: +# image: charon:e2e-test +# ... +``` + +**Next Action**: Read compose file to determine exact optimization needed. + +--- + +## πŸ“‹ Appendix: Full Redundancy Details + +### A. Build Job Redundant Steps (Lines 77-96) + +```yaml +# Lines 77-82: Cache npm dependencies +- name: Cache npm dependencies + uses: actions/cache@v5 + with: + path: ~/.npm + key: npm-${{ hashFiles('package-lock.json') }} + restore-keys: npm- + +# Line 81: Install root dependencies +- name: Install dependencies + run: npm ci + # Why: Needed for... nothing in build job actually uses root node_modules + # Used by: Test shards (but they re-install) + # Verdict: Could be removed from build job + +# Lines 84-86: Install frontend dependencies +- name: Install frontend dependencies + run: npm ci + working-directory: frontend + # Why: Supposedly for "npm run build" next + # Used by: Immediately consumed by build step + # Verdict: Unnecessary - Dockerfile does this + +# Lines 90-91: Build frontend +- name: Build frontend + run: npm run build + working-directory: frontend + # Creates: frontend/dist/* (not used by Docker) + # Dockerfile: Does same build internally + # Verdict: ❌ REMOVE + +# Line 93-94: Build backend +- name: Build backend + run: make build + # Creates: backend/bin/api (not used by Docker) + # Dockerfile: Compiles Go binary internally + # Verdict: ❌ REMOVE +``` + +### B. Test Shard Redundant Steps (Lines 205, 215-218) + +```yaml +# Line 205: Re-install root dependencies +- name: Install dependencies + run: npm ci + # Why: Playwright needs @playwright/test package + # Problem: Already installed in build job + # Solution: Share via artifact or cache + +# Lines 215-218: Re-install frontend dependencies +- name: Install Frontend Dependencies + run: | + cd frontend + npm ci + # Why: Vite dev server needs React, etc. + # Problem: Already installed in build job + # Solution: Share via artifact or cache +``` + +### C. Docker Rebuild Evidence + +```bash +# Hypothetical compose file content: +# .docker/compose/docker-compose.playwright.yml +services: + charon: + build: . # ← Triggers rebuild with --build flag + image: charon:e2e-test + # Should be: + # image: charon:e2e-test # ← Use pre-built image only + # (no build: context) +``` + +--- + +**End of Specification** + +**Total Analysis Time**: ~45 minutes +**Confidence Level**: 95% - High confidence in identified issues and solutions +**Recommended Next Step**: Review with team, then implement Phase 1 (quick wins)