15 KiB
Docker Compose CI Failure Remediation Plan
Status: Active
Created: 2026-01-30
Priority: CRITICAL (Blocking CI)
Executive Summary
The E2E test workflow (e2e-tests.yml) is failing when attempting to start containers via docker-compose.playwright-ci.yml. The root cause is an incorrect Docker image reference format in the compose file that attempts to use a bare SHA256 digest instead of a fully-qualified image reference with registry and repository.
Error Message:
charon-app Error pull access denied for sha256, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
Root Cause: The compose file's image: directive evaluates to a bare SHA256 digest (e.g., sha256:057a9998...) instead of a properly formatted image reference like ghcr.io/wikid82/charon@sha256:057a9998....
Issue 1: Nightly Build - GoReleaser macOS Cross-Compile Failure
Problem Statement
The nightly build fails during GoReleaser release step when cross-compiling for macOS (darwin) using Zig:
release failed after 4m19s
error=
build failed: exit status 1: go: downloading github.com/gin-gonic/gin v1.11.0
info: zig can provide libc for related target x86_64-macos.11-none
target=darwin_amd64_v1
Root Cause Analysis
The .goreleaser.yaml darwin build uses incorrect Zig target specification:
Current (WRONG):
CC=zig cc -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-gnu
CXX=zig c++ -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-gnu
Issue: macOS uses its own libc (libSystem), not GNU libc. The -gnu suffix is invalid for macOS targets. Zig expects -macos-none or -macos.11-none for macOS builds.
Affected Files
| File | Change Type |
|---|---|
.goreleaser.yaml |
Fix Zig target for darwin builds |
Recommended Fix
Update the darwin build configuration to use the correct Zig target triple:
Option A: Use -macos-none (Recommended)
- id: darwin
dir: backend
main: ./cmd/api
binary: charon
env:
- CGO_ENABLED=1
- CC=zig cc -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-none
- CXX=zig c++ -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-none
Option B: Specify macOS version (for specific SDK compatibility)
- CC=zig cc -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos.11-none
- CXX=zig c++ -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos.11-none
Option C: Remove darwin builds entirely (if macOS support is not required)
# Remove the entire `- id: darwin` build block from .goreleaser.yaml
# Update archives section to remove darwin from the `nix` archive builds
Implementation Details
--- a/.goreleaser.yaml
+++ b/.goreleaser.yaml
@@ -47,8 +47,8 @@
binary: charon
env:
- CGO_ENABLED=1
- - CC=zig cc -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-gnu
- - CXX=zig c++ -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-gnu
+ - CC=zig cc -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-none
+ - CXX=zig c++ -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-none
goos:
- darwin
goarch:
Verification
# Local test (requires Zig installed)
cd backend
CGO_ENABLED=1 CC="zig cc -target x86_64-macos-none" go build -o charon-darwin ./cmd/api
# Nightly workflow test
gh workflow run nightly-build.yml --ref development -f reason="Test darwin build fix"
Issue 2: Playwright E2E - Admin API Socket Hang Up
Problem Statement
Playwright test zzz-admin-whitelist-blocking.spec.ts:126 fails with:
Error: apiRequestContext.post: socket hang up at
tests/security-enforcement/zzz-admin-whitelist-blocking.spec.ts:126:21
The test POSTs to http://localhost:2020/emergency/security-reset but cannot reach the emergency server.
Root Cause Analysis
The playwright.yml workflow starts the Charon container but does not set the CHARON_EMERGENCY_BIND environment variable:
Current workflow (.github/workflows/playwright.yml):
docker run -d \
--name charon-test \
-p 8080:8080 \
-p 127.0.0.1:2019:2019 \
-p "[::1]:2019:2019" \
-p 127.0.0.1:2020:2020 \
-p "[::1]:2020:2020" \
-e CHARON_ENV="${CHARON_ENV}" \
-e CHARON_DEBUG="${CHARON_DEBUG}" \
-e CHARON_ENCRYPTION_KEY="${CHARON_ENCRYPTION_KEY}" \
-e CHARON_EMERGENCY_TOKEN="${CHARON_EMERGENCY_TOKEN}" \
-e CHARON_EMERGENCY_SERVER_ENABLED="${CHARON_EMERGENCY_SERVER_ENABLED}" \
"${IMAGE_REF}"
Missing: CHARON_EMERGENCY_BIND=0.0.0.0:2020
Without this variable, the emergency server may not bind to the correct address, or may bind to a loopback-only address that isn't accessible via Docker port mapping.
Comparison with working compose file:
# .docker/compose/docker-compose.playwright-ci.yml
- CHARON_EMERGENCY_BIND=0.0.0.0:2020
- CHARON_EMERGENCY_USERNAME=admin
- CHARON_EMERGENCY_PASSWORD=changeme
Affected Files
| File | Change Type |
|---|---|
.github/workflows/playwright.yml |
Add missing emergency server env vars |
Recommended Fix
Add the missing emergency server environment variables to the docker run command:
--- a/.github/workflows/playwright.yml
+++ b/.github/workflows/playwright.yml
@@ -163,6 +163,10 @@ jobs:
-e CHARON_ENCRYPTION_KEY="${CHARON_ENCRYPTION_KEY}" \
-e CHARON_EMERGENCY_TOKEN="${CHARON_EMERGENCY_TOKEN}" \
-e CHARON_EMERGENCY_SERVER_ENABLED="${CHARON_EMERGENCY_SERVER_ENABLED}" \
+ -e CHARON_EMERGENCY_BIND="0.0.0.0:2020" \
+ -e CHARON_EMERGENCY_USERNAME="admin" \
+ -e CHARON_EMERGENCY_PASSWORD="changeme" \
+ -e CHARON_SECURITY_TESTS_ENABLED="true" \
"${IMAGE_REF}"
Full Updated Step
- name: Start Charon container
if: steps.check-artifact.outputs.artifact_exists == 'true'
run: |
echo "🚀 Starting Charon container..."
# Normalize image name (GitHub lowercases repository owner names in GHCR)
IMAGE_NAME=$(echo "${{ github.repository_owner }}/charon" | tr '[:upper:]' '[:lower:]')
if [[ "${{ steps.pr-info.outputs.is_push }}" == "true" ]]; then
IMAGE_REF="ghcr.io/${IMAGE_NAME}:${{ steps.sanitize.outputs.branch }}"
else
IMAGE_REF="ghcr.io/${IMAGE_NAME}:pr-${{ steps.pr-info.outputs.pr_number }}"
fi
echo "📦 Starting container with image: ${IMAGE_REF}"
docker run -d \
--name charon-test \
-p 8080:8080 \
-p 127.0.0.1:2019:2019 \
-p "[::1]:2019:2019" \
-p 127.0.0.1:2020:2020 \
-p "[::1]:2020:2020" \
-e CHARON_ENV="${CHARON_ENV}" \
-e CHARON_DEBUG="${CHARON_DEBUG}" \
-e CHARON_ENCRYPTION_KEY="${CHARON_ENCRYPTION_KEY}" \
-e CHARON_EMERGENCY_TOKEN="${CHARON_EMERGENCY_TOKEN}" \
-e CHARON_EMERGENCY_SERVER_ENABLED="${CHARON_EMERGENCY_SERVER_ENABLED}" \
-e CHARON_EMERGENCY_BIND="0.0.0.0:2020" \
-e CHARON_EMERGENCY_USERNAME="admin" \
-e CHARON_EMERGENCY_PASSWORD="changeme" \
-e CHARON_SECURITY_TESTS_ENABLED="true" \
"${IMAGE_REF}"
echo "✅ Container started"
Verification
# After fix, verify emergency server is listening
docker exec charon-test curl -sf http://localhost:2020/health || echo "Failed"
# Test emergency reset endpoint
curl -X POST http://localhost:2020/emergency/security-reset \
-H "Authorization: Basic $(echo -n 'admin:changeme' | base64)" \
-H "X-Emergency-Token: $CHARON_EMERGENCY_TOKEN"
Issue 3: Trivy Scan - Invalid Image Reference Format
Problem Statement
Trivy scan fails with "invalid image reference format" when:
- PR number is missing (manual dispatch without PR number)
- Feature branch names contain
/characters (e.g.,feature/new-thing) is_pushandpr_numberare both empty/false
Resulting in invalid Docker tags like:
ghcr.io/owner/charon:pr-(empty PR number)ghcr.io/owner/charon:(no tag at all)
Root Cause Analysis
Location: .github/workflows/playwright.yml - "Start Charon container" step
if [[ "${{ steps.pr-info.outputs.is_push }}" == "true" ]]; then
IMAGE_REF="ghcr.io/${IMAGE_NAME}:${{ steps.sanitize.outputs.branch }}"
else
IMAGE_REF="ghcr.io/${IMAGE_NAME}:pr-${{ steps.pr-info.outputs.pr_number }}"
fi
Problem: When is_push != "true" AND pr_number is empty, this creates:
IMAGE_REF="ghcr.io/owner/charon:pr-"
This is an invalid Docker reference.
Affected Files
| File | Change Type |
|---|---|
.github/workflows/playwright.yml |
Add validation for IMAGE_REF |
.github/workflows/docker-build.yml |
Add validation guards (CVE verification step) |
Recommended Fix
Add defensive validation to fail fast with a clear error message:
--- a/.github/workflows/playwright.yml
+++ b/.github/workflows/playwright.yml
# Normalize image name (GitHub lowercases repository owner names in GHCR)
IMAGE_NAME=$(echo "${{ github.repository_owner }}/charon" | tr '[:upper:]' '[:lower:]')
if [[ "${{ steps.pr-info.outputs.is_push }}" == "true" ]]; then
IMAGE_REF="ghcr.io/${IMAGE_NAME}:${{ steps.sanitize.outputs.branch }}"
- else
+ elif [[ -n "${{ steps.pr-info.outputs.pr_number }}" ]]; then
IMAGE_REF="ghcr.io/${IMAGE_NAME}:pr-${{ steps.pr-info.outputs.pr_number }}"
+ else
+ echo "❌ ERROR: Cannot determine image reference"
+ echo " - is_push: ${{ steps.pr-info.outputs.is_push }}"
+ echo " - pr_number: ${{ steps.pr-info.outputs.pr_number }}"
+ echo " - branch: ${{ steps.sanitize.outputs.branch }}"
+ echo ""
+ echo "This can happen when:"
+ echo " 1. workflow_dispatch without pr_number input"
+ echo " 2. workflow_run triggered by non-PR, non-push event"
+ exit 1
fi
+ # Validate the image reference format
+ if [[ ! "${IMAGE_REF}" =~ ^ghcr\.io/[a-z0-9_-]+/[a-z0-9_-]+:[a-zA-Z0-9._-]+$ ]]; then
+ echo "❌ ERROR: Invalid image reference format: ${IMAGE_REF}"
+ exit 1
+ fi
+
echo "📦 Starting container with image: ${IMAGE_REF}"
Additional Fix for docker-build.yml
The same issue can occur in docker-build.yml at the CVE verification step:
# Line ~174 in docker-build.yml
if [ "${{ github.event_name }}" = "pull_request" ]; then
IMAGE_REF="${{ env.GHCR_REGISTRY }}/${{ env.IMAGE_NAME }}:pr-${{ github.event.pull_request.number }}"
Fix:
--- a/.github/workflows/docker-build.yml
+++ b/.github/workflows/docker-build.yml
# Determine the image reference based on event type
if [ "${{ github.event_name }}" = "pull_request" ]; then
- IMAGE_REF="${{ env.GHCR_REGISTRY }}/${{ env.IMAGE_NAME }}:pr-${{ github.event.pull_request.number }}"
+ PR_NUM="${{ github.event.pull_request.number }}"
+ if [ -z "${PR_NUM}" ]; then
+ echo "❌ ERROR: Pull request number is empty"
+ exit 1
+ fi
+ IMAGE_REF="${{ env.GHCR_REGISTRY }}/${{ env.IMAGE_NAME }}:pr-${PR_NUM}"
echo "Using PR image: $IMAGE_REF"
else
IMAGE_REF="${{ env.GHCR_REGISTRY }}/${{ env.IMAGE_NAME }}@${{ steps.build-and-push.outputs.digest }}"
+ if [ -z "${{ steps.build-and-push.outputs.digest }}" ]; then
+ echo "❌ ERROR: Build digest is empty"
+ exit 1
+ fi
echo "Using digest: $IMAGE_REF"
fi
Verification
# Test with empty PR number (should fail fast with clear error)
gh workflow run playwright.yml --ref development
# Check IMAGE_REF construction in logs
gh run view --log | grep "IMAGE_REF"
Implementation Plan
Phase 1: Immediate Fixes (Single PR)
Objective: Fix all three CI failures in a single PR for immediate resolution.
Files to Modify:
| File | Changes |
|---|---|
.goreleaser.yaml |
Change -macos-gnu to -macos-none for darwin builds |
.github/workflows/playwright.yml |
Add missing emergency server env vars; Add IMAGE_REF validation |
.github/workflows/docker-build.yml |
Add IMAGE_REF validation guards |
Phase 2: Verification
- Push changes to a feature branch
- Open PR to trigger docker-build.yml
- Verify Trivy scan passes with valid IMAGE_REF
- Verify Playwright workflow if triggered
- Manually trigger nightly-build.yml with
--refpointing to feature branch - Verify darwin build succeeds
Phase 3: Cleanup (Optional)
- Add validation logic to a shared script (
scripts/validate-image-ref.sh) - Add integration tests for emergency server connectivity
- Document Zig target requirements for future contributors
Requirements (EARS Notation)
- WHEN GoReleaser builds darwin targets, THE SYSTEM SHALL use
-macos-noneZig target (not-macos-gnu). - WHEN the Playwright workflow starts the Charon container, THE SYSTEM SHALL set
CHARON_EMERGENCY_BIND=0.0.0.0:2020to ensure the emergency server is reachable. - WHEN constructing Docker image references, THE SYSTEM SHALL validate that the tag portion is non-empty before attempting to use it.
- IF the PR number is empty in a PR-triggered workflow, THEN THE SYSTEM SHALL fail fast with a clear error message explaining the issue.
- WHEN a feature branch contains
/characters, THE SYSTEM SHALL sanitize the branch name by replacing/with-before using it as a Docker tag.
Acceptance Criteria
- Nightly build completes successfully with darwin binaries
- Playwright E2E tests pass with emergency server accessible on port 2020
- Trivy scan passes with valid image reference for all trigger types
- Workflow failures produce clear, actionable error messages
- No regression in existing CI functionality
Risks & Mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Zig target change breaks darwin binaries | Low | High | Test with local Zig build first |
| Emergency server env vars conflict with existing config | Low | Medium | Verify against docker-compose.playwright-ci.yml |
| IMAGE_REF validation too strict | Medium | Low | Use permissive regex, log values before validation |
Handoff Contract
{
"plan": "CI Workflow Failures - Fix Plan",
"status": "Ready for Implementation",
"owner": "DevOps",
"handoffTargets": ["Backend_Dev", "DevOps"],
"files": [
".goreleaser.yaml",
".github/workflows/playwright.yml",
".github/workflows/docker-build.yml"
],
"estimatedEffort": "2-3 hours",
"priority": "HIGH",
"blockedWorkflows": [
"nightly-build.yml",
"playwright.yml",
"docker-build.yml (Trivy scan step)"
]
}