10 KiB
Executable File
Docker CI/CD Optimization: Phase 2-3 Implementation Complete
Date: February 4, 2026 Phase: 2-3 (Integration Workflow Migration) Status: ✅ Complete - Ready for Testing
Executive Summary
Successfully migrated 4 integration test workflows to use the registry image from docker-build.yml instead of building their own images. This eliminates ~40 minutes of redundant build time per PR.
Workflows Migrated
- ✅
.github/workflows/crowdsec-integration.yml - ✅
.github/workflows/cerberus-integration.yml - ✅
.github/workflows/waf-integration.yml - ✅
.github/workflows/rate-limit-integration.yml
Implementation Details
Changes Applied (Per Section 4.2 of Spec)
1. Trigger Mechanism ✅
- Added:
workflow_runtrigger waiting for "Docker Build, Publish & Test" - Added: Explicit branch filters:
[main, development, 'feature/**'] - Added:
workflow_dispatchfor manual testing with optional tag input - Removed: Direct
pushandpull_requesttriggers
Before:
on:
push:
branches: [ main, development, 'feature/**' ]
pull_request:
branches: [ main, development ]
After:
on:
workflow_run:
workflows: ["Docker Build, Publish & Test"]
types: [completed]
branches: [main, development, 'feature/**']
workflow_dispatch:
inputs:
image_tag:
description: 'Docker image tag to test'
required: false
2. Conditional Execution ✅
- Added: Job-level conditional: only run if docker-build.yml succeeded
- Added: Support for manual dispatch override
if: ${{ github.event.workflow_run.conclusion == 'success' || github.event_name == 'workflow_dispatch' }}
3. Concurrency Controls ✅
- Added: Concurrency groups using branch + SHA
- Added:
cancel-in-progress: trueto prevent race conditions - Handles: PR updates mid-test (old runs auto-canceled)
concurrency:
group: ${{ github.workflow }}-${{ github.event.workflow_run.head_branch || github.ref }}-${{ github.event.workflow_run.head_sha || github.sha }}
cancel-in-progress: true
4. Image Tag Determination ✅
- Uses: Native
github.event.workflow_run.pull_requestsarray (NO API calls) - Handles: PR events →
pr-{number}-{sha} - Handles: Branch push events →
{sanitized-branch}-{sha} - Applies: Tag sanitization (lowercase, replace
/with-, remove special chars) - Validates: PR number extraction with comprehensive error handling
PR Tag Example:
PR #123 with commit abc1234 → pr-123-abc1234
Branch Tag Example:
feature/Add_New-Feature with commit def5678 → feature-add-new-feature-def5678
5. Registry Pull with Retry ✅
- Uses:
nick-fields/retry@v3action - Configuration:
- Timeout: 5 minutes
- Max attempts: 3
- Retry wait: 10 seconds
- Pulls from:
ghcr.io/wikid82/charon:{tag} - Tags as:
charon:localfor test scripts
- name: Pull Docker image from registry
id: pull_image
uses: nick-fields/retry@v3
with:
timeout_minutes: 5
max_attempts: 3
retry_wait_seconds: 10
command: |
IMAGE_NAME="ghcr.io/${{ github.repository_owner }}/charon:${{ steps.image.outputs.tag }}"
docker pull "$IMAGE_NAME"
docker tag "$IMAGE_NAME" charon:local
6. Dual-Source Fallback Strategy ✅
- Primary: Registry pull (fast, network-optimized)
- Fallback: Artifact download (if registry fails)
- Handles: Both PR and branch artifacts
- Logs: Which source was used for troubleshooting
Fallback Logic:
- name: Fallback to artifact download
if: steps.pull_image.outcome == 'failure'
run: |
# Determine artifact name (pr-image-{N} or push-image)
gh run download ${{ github.event.workflow_run.id }} --name "$ARTIFACT_NAME"
docker load < /tmp/docker-image/charon-image.tar
docker tag $(docker images --format "{{.Repository}}:{{.Tag}}" | head -1) charon:local
7. Image Freshness Validation ✅
- Checks: Image label SHA matches expected commit SHA
- Warns: If mismatch detected (stale image)
- Logs: Both expected and actual SHA for debugging
- name: Validate image SHA
run: |
LABEL_SHA=$(docker inspect charon:local --format '{{index .Config.Labels "org.opencontainers.image.revision"}}' | cut -c1-7)
if [[ "$LABEL_SHA" != "$SHA" ]]; then
echo "⚠️ WARNING: Image SHA mismatch!"
fi
8. Build Steps Removed ✅
- Removed:
docker/setup-buildx-actionstep - Removed:
docker buildcommand (~10 minutes per workflow) - Kept: All test execution logic unchanged
- Result: ~40 minutes saved per PR (4 workflows × 10 min each)
Testing Checklist
Before merging to main, verify:
Manual Testing
-
PR from feature branch:
- Open test PR with trivial change
- Wait for docker-build.yml to complete
- Verify all 4 integration workflows trigger
- Confirm image tag format:
pr-{N}-{sha} - Check workflows use registry image (no build step)
-
Push to development branch:
- Push to development branch
- Wait for docker-build.yml to complete
- Verify integration workflows trigger
- Confirm image tag format:
development-{sha}
-
Manual dispatch:
- Trigger each workflow manually via Actions UI
- Test with explicit tag (e.g.,
latest) - Test without tag (defaults to
latest)
-
Concurrency cancellation:
- Open PR with commit A
- Wait for workflows to start
- Force-push commit B to same PR
- Verify old workflows are canceled
-
Artifact fallback:
- Simulate registry failure (incorrect tag)
- Verify workflows fall back to artifact download
- Confirm tests still pass
Automated Validation
-
Build time reduction:
- Compare PR build times before/after
- Expected: ~40 minutes saved (4 × 10 min builds eliminated)
- Verify in GitHub Actions logs
-
Image SHA validation:
- Check workflow logs for "Image SHA matches expected commit"
- Verify no stale images used
-
Registry usage:
- Confirm no
docker buildcommands in logs - Verify
docker pull ghcr.io/wikid82/charon:*instead
- Confirm no
Rollback Plan
If issues are detected:
Partial Rollback (Single Workflow)
# Restore specific workflow from git history
git checkout HEAD~1 -- .github/workflows/crowdsec-integration.yml
git commit -m "Rollback: crowdsec-integration to pre-migration state"
git push
Full Rollback (All Workflows)
# Create rollback branch
git checkout -b rollback/integration-workflows
# Revert migration commit
git revert HEAD --no-edit
# Push to main
git push origin rollback/integration-workflows:main
Time to rollback: ~5 minutes per workflow
Expected Benefits
Build Time Reduction
| Metric | Before | After | Improvement |
|---|---|---|---|
| Builds per PR | 5x (1 main + 4 integration) | 1x (main only) | 5x reduction |
| Build time per workflow | ~10 min | 0 min (pull only) | 100% saved |
| Total redundant time | ~40 min | 0 min | 40 min saved |
| CI resource usage | 5x parallel builds | 1 build + 4 pulls | 80% reduction |
Consistency Improvements
- ✅ All tests use identical image (no "works on my build" issues)
- ✅ Tests always use latest successful build (no stale code)
- ✅ Race conditions prevented via immutable tags with SHA
- ✅ Build failures isolated to docker-build.yml (easier debugging)
Next Steps
Immediate (Phase 3 Complete)
- ✅ Merge this implementation to feature branch
- 🔄 Test with real PRs (see Testing Checklist)
- 🔄 Monitor for 1 week on development branch
- 🔄 Merge to main after validation
Phase 4 (Week 6)
- Migrate
e2e-tests.ymlworkflow - Remove build job from E2E workflow
- Apply same pattern (workflow_run + registry pull)
Phase 5 (Week 7)
- Enhance
container-prune.ymlfor PR image cleanup - Add retention policies (24h for PR images)
- Implement "in-use" detection
Metrics to Monitor
Track these metrics post-deployment:
| Metric | Target | How to Measure |
|---|---|---|
| Average PR build time | <20 min (vs 62 min before) | GitHub Actions insights |
| Image pull success rate | >95% | Workflow logs |
| Artifact fallback rate | <5% | Grep logs for "falling back" |
| Test failure rate | <5% (no regression) | GitHub Actions insights |
| Workflow trigger accuracy | 100% (no missed triggers) | Manual verification |
Documentation Updates Required
- Update
CONTRIBUTING.mdwith new workflow behavior - Update
docs/ci-cd.mdwith architecture diagrams - Create troubleshooting guide for integration tests
- Update PR template with CI/CD expectations
Known Limitations
-
Requires docker-build.yml to succeed first
- Integration tests won't run if build fails
- This is intentional (fail fast)
-
Manual dispatch requires knowing image tag
- Use
latestfor quick testing - Use
pr-{N}-{sha}for specific PR testing
- Use
-
Registry must be accessible
- If GHCR is down, workflows fall back to artifacts
- Artifact fallback adds ~30 seconds
Success Criteria Met
✅ All 4 workflows migrated (crowdsec, cerberus, waf, rate-limit)
✅ No redundant builds (verified by removing build steps)
✅ workflow_run trigger with explicit branch filters
✅ Conditional execution (only if docker-build.yml succeeds)
✅ Image tag determination using native context (no API calls)
✅ Tag sanitization for feature branches
✅ Retry logic for registry pulls (3 attempts)
✅ Dual-source strategy (registry + artifact fallback)
✅ Concurrency controls (race condition prevention)
✅ Image SHA validation (freshness check)
✅ Comprehensive error handling (clear error messages)
✅ All test logic preserved (only image sourcing changed)
Questions & Support
- Spec Reference:
docs/plans/current_spec.md(Section 4.2) - Implementation: Section 4.2 requirements fully met
- Testing: See "Testing Checklist" above
- Issues: Check Docker build logs first, then integration workflow logs
Approval
Ready for Phase 4 (E2E Migration): ✅ Yes, after 1 week validation period
Estimated Time Savings per PR: 40 minutes Estimated Resource Savings: 80% reduction in parallel build compute