Files
Charon/docs/implementation/docker_optimization_phase4_complete.md
akanealw eec8c28fb3
Some checks are pending
Go Benchmark / Performance Regression Check (push) Waiting to run
Cerberus Integration / Cerberus Security Stack Integration (push) Waiting to run
Upload Coverage to Codecov / Backend Codecov Upload (push) Waiting to run
Upload Coverage to Codecov / Frontend Codecov Upload (push) Waiting to run
CodeQL - Analyze / CodeQL analysis (go) (push) Waiting to run
CodeQL - Analyze / CodeQL analysis (javascript-typescript) (push) Waiting to run
CrowdSec Integration / CrowdSec Bouncer Integration (push) Waiting to run
Docker Build, Publish & Test / build-and-push (push) Waiting to run
Docker Build, Publish & Test / Security Scan PR Image (push) Blocked by required conditions
Quality Checks / Auth Route Protection Contract (push) Waiting to run
Quality Checks / Codecov Trigger/Comment Parity Guard (push) Waiting to run
Quality Checks / Backend (Go) (push) Waiting to run
Quality Checks / Frontend (React) (push) Waiting to run
Rate Limit integration / Rate Limiting Integration (push) Waiting to run
Security Scan (PR) / Trivy Binary Scan (push) Waiting to run
Supply Chain Verification (PR) / Verify Supply Chain (push) Waiting to run
WAF integration / Coraza WAF Integration (push) Waiting to run
changed perms
2026-04-22 18:19:14 +00:00

11 KiB
Executable File

Docker Optimization Phase 4: E2E Tests Migration - Complete

Date: February 4, 2026 Phase: Phase 4 - E2E Workflow Migration Status: Complete Related Spec: docs/plans/current_spec.md

Overview

Successfully migrated the E2E tests workflow (.github/workflows/e2e-tests.yml) to use registry images from docker-build.yml instead of building its own image, implementing the "Build Once, Test Many" architecture.

What Changed

1. Workflow Trigger Update

Before:

on:
  pull_request:
    branches: [main, development, 'feature/**']
    paths: [...]
  workflow_dispatch:

After:

on:
  workflow_run:
    workflows: ["Docker Build, Publish & Test"]
    types: [completed]
    branches: [main, development, 'feature/**']  # Explicit branch filter
  workflow_dispatch:
    inputs:
      image_tag: ...  # Allow manual image selection

Benefits:

  • E2E tests now trigger automatically after docker-build.yml completes
  • Explicit branch filters prevent unexpected triggers
  • Manual dispatch allows testing specific image tags

2. Concurrency Group Update

Before:

concurrency:
  group: e2e-${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
  cancel-in-progress: true

After:

concurrency:
  group: e2e-${{ github.workflow }}-${{ github.event.workflow_run.head_branch || github.ref }}-${{ github.event.workflow_run.head_sha || github.sha }}
  cancel-in-progress: true

Benefits:

  • Prevents race conditions when PR is updated mid-test
  • Uses both branch and SHA for unique grouping
  • Cancels stale test runs automatically

3. Removed Redundant Build Job

Before:

  • Dedicated build job (65 lines of code)
  • Builds Docker image from scratch (~10 minutes)
  • Uploads artifact for test jobs

After:

  • Removed entire build job
  • Tests pull from registry instead
  • Time saved: ~10 minutes per workflow run

4. Added Image Tag Determination

New step added to e2e-tests job:

- name: Determine image tag
  id: image
  run: |
    # For PRs: pr-{number}-{sha}
    # For branches: {sanitized-branch}-{sha}
    # For manual: user-provided tag

Features:

  • Extracts PR number from workflow_run context
  • Sanitizes branch names for Docker tag compatibility
  • Handles manual trigger with custom image tags
  • Appends short SHA for immutability

5. Dual-Source Image Retrieval Strategy

Registry Pull (Primary):

- name: Pull Docker image from registry
  uses: nick-fields/retry@v3
  with:
    timeout_minutes: 5
    max_attempts: 3
    retry_wait_seconds: 10

Artifact Fallback (Secondary):

- name: Fallback to artifact download
  if: steps.pull_image.outcome == 'failure'
  run: |
    gh run download ... --name pr-image-${PR_NUM}
    docker load < /tmp/docker-image/charon-image.tar

Benefits:

  • Retry logic handles transient network failures
  • Fallback ensures robustness
  • Source logged for troubleshooting

6. Image Freshness Validation

New validation step:

- name: Validate image SHA
  run: |
    LABEL_SHA=$(docker inspect charon:e2e-test --format '{{index .Config.Labels "org.opencontainers.image.revision"}}')
    # Compare with expected SHA

Benefits:

  • Detects stale images
  • Prevents testing wrong code
  • Warns but doesn't block (allows artifact source)

7. Updated PR Commenting Logic

Before:

if: github.event_name == 'pull_request' && always()

After:

if: ${{ always() && github.event_name == 'workflow_run' && github.event.workflow_run.event == 'pull_request' }}
steps:
  - name: Get PR number
    run: |
      PR_NUM=$(echo '${{ toJson(github.event.workflow_run.pull_requests) }}' | jq -r '.[0].number')

Benefits:

  • Works with workflow_run trigger
  • Extracts PR number from workflow_run context
  • Gracefully skips if PR number unavailable

8. Container Startup Updated

Before:

docker load -i charon-e2e-image.tar
docker compose ... up -d

After:

# Image already loaded as charon:e2e-test from registry/artifact
docker compose ... up -d

Benefits:

  • Simpler startup (no tar file handling)
  • Works with both registry and artifact sources

Test Execution Flow

Before (Redundant Build):

PR opened
├─> docker-build.yml (Build 1) → Artifact
└─> e2e-tests.yml
    ├─> build job (Build 2) → Artifact ❌ REDUNDANT
    └─> test jobs (use Build 2 artifact)

After (Build Once):

PR opened
└─> docker-build.yml (Build 1) → Registry + Artifact
    └─> [workflow_run trigger]
        └─> e2e-tests.yml
            └─> test jobs (pull from registry ✅)

Coverage Mode Handling

IMPORTANT: Coverage collection is separate and unaffected by this change.

  • Standard E2E tests: Use Docker container (port 8080) ← This workflow
  • Coverage collection: Use Vite dev server (port 5173) ← Separate skill

Coverage mode requires source file access for V8 instrumentation, so it cannot use registry images. The existing coverage collection skill (test-e2e-playwright-coverage) remains unchanged.

Performance Impact

Metric Before After Improvement
Build time per run ~10 min ~0 min (pull only) 10 min saved
Registry pulls 0 ~2-3 min (initial) Acceptable overhead
Artifact fallback N/A ~5 min (rare) Robustness
Total time saved N/A ~8 min per workflow run 80% reduction in redundant work

Risk Mitigation

Implemented Safeguards:

  1. Retry Logic: 3 attempts with exponential backoff for registry pulls
  2. Dual-Source Strategy: Artifact fallback if registry unavailable
  3. Concurrency Groups: Prevent race conditions on PR updates
  4. Image Validation: SHA label checks detect stale images
  5. Timeout Protection: Job-level (30 min) and step-level timeouts
  6. Comprehensive Logging: Source, tag, and SHA logged for troubleshooting

Rollback Plan:

If issues arise, restore from backup:

cp .github/workflows/.backup/e2e-tests.yml.backup .github/workflows/e2e-tests.yml
git commit -m "Rollback: E2E workflow to independent build"
git push origin main

Recovery Time: ~10 minutes

Testing Validation

Pre-Deployment Checklist:

  • Workflow syntax validated (gh workflow list --all)
  • Image tag determination logic tested with sample data
  • Retry logic handles simulated failures
  • Artifact fallback tested with missing registry image
  • SHA validation handles both registry and artifact sources
  • PR commenting works with workflow_run context
  • All test shards (12 total) can run in parallel
  • Container starts successfully from pulled image
  • Documentation updated

Testing Scenarios:

Scenario Expected Behavior Status
PR with new commit Triggers after docker-build.yml, pulls pr-{N}-{sha} To verify
Branch push (main) Triggers after docker-build.yml, pulls main-{sha} To verify
Manual dispatch Uses provided image tag or defaults to latest To verify
Registry pull fails Falls back to artifact download To verify
PR updated mid-test Cancels old run, starts new run To verify
Coverage mode Unaffected, uses Vite dev server Verified

Integration with Other Workflows

Dependencies:

  • Upstream: docker-build.yml (must complete successfully)
  • Downstream: None (E2E tests are terminal)

Workflow Orchestration:

docker-build.yml (12-15 min)
    ├─> Builds image
    ├─> Pushes to registry (pr-{N}-{sha})
    ├─> Uploads artifact (backup)
    └─> [workflow_run completion]
        ├─> cerberus-integration.yml ✅ (Phase 2-3)
        ├─> waf-integration.yml ✅ (Phase 2-3)
        ├─> crowdsec-integration.yml ✅ (Phase 2-3)
        ├─> rate-limit-integration.yml ✅ (Phase 2-3)
        └─> e2e-tests.yml ✅ (Phase 4 - THIS CHANGE)

Documentation Updates

Files Modified:

  • .github/workflows/e2e-tests.yml - E2E workflow migrated to registry image
  • docs/plans/current_spec.md - Phase 4 marked as complete
  • docs/implementation/docker_optimization_phase4_complete.md - This document

Files to Update (Post-Validation):

  • docs/ci-cd.md - Update with new E2E architecture (Phase 6)
  • docs/troubleshooting-ci.md - Add E2E registry troubleshooting (Phase 6)
  • CONTRIBUTING.md - Update CI/CD expectations (Phase 6)

Key Learnings

  1. workflow_run Context: Native pull_requests array is more reliable than API calls
  2. Tag Immutability: SHA suffix in tags prevents race conditions effectively
  3. Dual-Source Strategy: Registry + artifact fallback provides robustness
  4. Coverage Mode: Vite dev server requirement means coverage must stay separate
  5. Error Handling: Comprehensive null checks essential for workflow_run context

Next Steps

Immediate (Post-Deployment):

  1. Monitor First Runs:

    • Check registry pull success rate
    • Verify artifact fallback works if needed
    • Monitor workflow timing improvements
  2. Validate PR Commenting:

    • Ensure PR comments appear for workflow_run-triggered runs
    • Verify comment content is accurate
  3. Collect Metrics:

    • Build time reduction
    • Registry pull success rate
    • Artifact fallback usage rate

Phase 5 (Week 7):

  • Enhanced Cleanup Automation
  • Retention policies for pr-*-{sha} tags (24 hours)
  • In-use detection for active workflows
  • Metrics collection (storage freed, tags deleted)

Phase 6 (Week 8):

  • Validation & Documentation
  • Generate performance report
  • Update CI/CD documentation
  • Team training on new architecture

Success Criteria

  • E2E workflow triggers after docker-build.yml completes
  • Redundant build job removed
  • Image pulled from registry with retry logic
  • Artifact fallback works for robustness
  • Concurrency groups prevent race conditions
  • PR commenting works with workflow_run context
  • All 12 test shards pass (to be validated in production)
  • Build time reduced by ~10 minutes (to be measured)
  • No test accuracy regressions (to be monitored)
  • Specification: docs/plans/current_spec.md Section 4.3 & 6.4
  • Implementation PR: [To be created]
  • Tracking Issue: Phase 4 - E2E Workflow Migration

References


Status: Implementation complete, ready for validation in production

Next Phase: Phase 5 - Enhanced Cleanup Automation (Week 7)