15 KiB
CI Workflow Failures - Fix Plan
Version: 1.0 Status: Ready for Implementation Priority: HIGH Created: 2026-01-30 Scope: Three CI failures in GitHub Actions workflows
Executive Summary
Three CI workflows are failing in production. This plan documents the root causes, affected files, and specific fixes required for each issue:
- Nightly Build Failure: GoReleaser macOS cross-compile failing with incorrect Zig target
- Playwright E2E Failure: Emergency server unreachable on port 2020 due to missing env var
- Trivy Scan Failure: Invalid Docker image reference when PR number is missing
Issue 1: Nightly Build - GoReleaser macOS Cross-Compile Failure
Problem Statement
The nightly build fails during GoReleaser release step when cross-compiling for macOS (darwin) using Zig:
release failed after 4m19s
error=
build failed: exit status 1: go: downloading github.com/gin-gonic/gin v1.11.0
info: zig can provide libc for related target x86_64-macos.11-none
target=darwin_amd64_v1
Root Cause Analysis
The .goreleaser.yaml darwin build uses incorrect Zig target specification:
Current (WRONG):
CC=zig cc -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-gnu
CXX=zig c++ -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-gnu
Issue: macOS uses its own libc (libSystem), not GNU libc. The -gnu suffix is invalid for macOS targets. Zig expects -macos-none or -macos.11-none for macOS builds.
Affected Files
| File | Change Type |
|---|---|
.goreleaser.yaml |
Fix Zig target for darwin builds |
Recommended Fix
Update the darwin build configuration to use the correct Zig target triple:
Option A: Use -macos-none (Recommended)
- id: darwin
dir: backend
main: ./cmd/api
binary: charon
env:
- CGO_ENABLED=1
- CC=zig cc -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-none
- CXX=zig c++ -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-none
Option B: Specify macOS version (for specific SDK compatibility)
- CC=zig cc -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos.11-none
- CXX=zig c++ -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos.11-none
Option C: Remove darwin builds entirely (if macOS support is not required)
# Remove the entire `- id: darwin` build block from .goreleaser.yaml
# Update archives section to remove darwin from the `nix` archive builds
Implementation Details
--- a/.goreleaser.yaml
+++ b/.goreleaser.yaml
@@ -47,8 +47,8 @@
binary: charon
env:
- CGO_ENABLED=1
- - CC=zig cc -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-gnu
- - CXX=zig c++ -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-gnu
+ - CC=zig cc -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-none
+ - CXX=zig c++ -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-none
goos:
- darwin
goarch:
Verification
# Local test (requires Zig installed)
cd backend
CGO_ENABLED=1 CC="zig cc -target x86_64-macos-none" go build -o charon-darwin ./cmd/api
# Nightly workflow test
gh workflow run nightly-build.yml --ref development -f reason="Test darwin build fix"
Issue 2: Playwright E2E - Admin API Socket Hang Up
Problem Statement
Playwright test zzz-admin-whitelist-blocking.spec.ts:126 fails with:
Error: apiRequestContext.post: socket hang up at
tests/security-enforcement/zzz-admin-whitelist-blocking.spec.ts:126:21
The test POSTs to http://localhost:2020/emergency/security-reset but cannot reach the emergency server.
Root Cause Analysis
The playwright.yml workflow starts the Charon container but does not set the CHARON_EMERGENCY_BIND environment variable:
Current workflow (.github/workflows/playwright.yml):
docker run -d \
--name charon-test \
-p 8080:8080 \
-p 127.0.0.1:2019:2019 \
-p "[::1]:2019:2019" \
-p 127.0.0.1:2020:2020 \
-p "[::1]:2020:2020" \
-e CHARON_ENV="${CHARON_ENV}" \
-e CHARON_DEBUG="${CHARON_DEBUG}" \
-e CHARON_ENCRYPTION_KEY="${CHARON_ENCRYPTION_KEY}" \
-e CHARON_EMERGENCY_TOKEN="${CHARON_EMERGENCY_TOKEN}" \
-e CHARON_EMERGENCY_SERVER_ENABLED="${CHARON_EMERGENCY_SERVER_ENABLED}" \
"${IMAGE_REF}"
Missing: CHARON_EMERGENCY_BIND=0.0.0.0:2020
Without this variable, the emergency server may not bind to the correct address, or may bind to a loopback-only address that isn't accessible via Docker port mapping.
Comparison with working compose file:
# .docker/compose/docker-compose.playwright-ci.yml
- CHARON_EMERGENCY_BIND=0.0.0.0:2020
- CHARON_EMERGENCY_USERNAME=admin
- CHARON_EMERGENCY_PASSWORD=changeme
Affected Files
| File | Change Type |
|---|---|
.github/workflows/playwright.yml |
Add missing emergency server env vars |
Recommended Fix
Add the missing emergency server environment variables to the docker run command:
--- a/.github/workflows/playwright.yml
+++ b/.github/workflows/playwright.yml
@@ -163,6 +163,10 @@ jobs:
-e CHARON_ENCRYPTION_KEY="${CHARON_ENCRYPTION_KEY}" \
-e CHARON_EMERGENCY_TOKEN="${CHARON_EMERGENCY_TOKEN}" \
-e CHARON_EMERGENCY_SERVER_ENABLED="${CHARON_EMERGENCY_SERVER_ENABLED}" \
+ -e CHARON_EMERGENCY_BIND="0.0.0.0:2020" \
+ -e CHARON_EMERGENCY_USERNAME="admin" \
+ -e CHARON_EMERGENCY_PASSWORD="changeme" \
+ -e CHARON_SECURITY_TESTS_ENABLED="true" \
"${IMAGE_REF}"
Full Updated Step
- name: Start Charon container
if: steps.check-artifact.outputs.artifact_exists == 'true'
run: |
echo "🚀 Starting Charon container..."
# Normalize image name (GitHub lowercases repository owner names in GHCR)
IMAGE_NAME=$(echo "${{ github.repository_owner }}/charon" | tr '[:upper:]' '[:lower:]')
if [[ "${{ steps.pr-info.outputs.is_push }}" == "true" ]]; then
IMAGE_REF="ghcr.io/${IMAGE_NAME}:${{ steps.sanitize.outputs.branch }}"
else
IMAGE_REF="ghcr.io/${IMAGE_NAME}:pr-${{ steps.pr-info.outputs.pr_number }}"
fi
echo "📦 Starting container with image: ${IMAGE_REF}"
docker run -d \
--name charon-test \
-p 8080:8080 \
-p 127.0.0.1:2019:2019 \
-p "[::1]:2019:2019" \
-p 127.0.0.1:2020:2020 \
-p "[::1]:2020:2020" \
-e CHARON_ENV="${CHARON_ENV}" \
-e CHARON_DEBUG="${CHARON_DEBUG}" \
-e CHARON_ENCRYPTION_KEY="${CHARON_ENCRYPTION_KEY}" \
-e CHARON_EMERGENCY_TOKEN="${CHARON_EMERGENCY_TOKEN}" \
-e CHARON_EMERGENCY_SERVER_ENABLED="${CHARON_EMERGENCY_SERVER_ENABLED}" \
-e CHARON_EMERGENCY_BIND="0.0.0.0:2020" \
-e CHARON_EMERGENCY_USERNAME="admin" \
-e CHARON_EMERGENCY_PASSWORD="changeme" \
-e CHARON_SECURITY_TESTS_ENABLED="true" \
"${IMAGE_REF}"
echo "✅ Container started"
Verification
# After fix, verify emergency server is listening
docker exec charon-test curl -sf http://localhost:2020/health || echo "Failed"
# Test emergency reset endpoint
curl -X POST http://localhost:2020/emergency/security-reset \
-H "Authorization: Basic $(echo -n 'admin:changeme' | base64)" \
-H "X-Emergency-Token: $CHARON_EMERGENCY_TOKEN"
Issue 3: Trivy Scan - Invalid Image Reference Format
Problem Statement
Trivy scan fails with "invalid image reference format" when:
- PR number is missing (manual dispatch without PR number)
- Feature branch names contain
/characters (e.g.,feature/new-thing) is_pushandpr_numberare both empty/false
Resulting in invalid Docker tags like:
ghcr.io/owner/charon:pr-(empty PR number)ghcr.io/owner/charon:(no tag at all)
Root Cause Analysis
Location: .github/workflows/playwright.yml - "Start Charon container" step
if [[ "${{ steps.pr-info.outputs.is_push }}" == "true" ]]; then
IMAGE_REF="ghcr.io/${IMAGE_NAME}:${{ steps.sanitize.outputs.branch }}"
else
IMAGE_REF="ghcr.io/${IMAGE_NAME}:pr-${{ steps.pr-info.outputs.pr_number }}"
fi
Problem: When is_push != "true" AND pr_number is empty, this creates:
IMAGE_REF="ghcr.io/owner/charon:pr-"
This is an invalid Docker reference.
Affected Files
| File | Change Type |
|---|---|
.github/workflows/playwright.yml |
Add validation for IMAGE_REF |
.github/workflows/docker-build.yml |
Add validation guards (CVE verification step) |
Recommended Fix
Add defensive validation to fail fast with a clear error message:
--- a/.github/workflows/playwright.yml
+++ b/.github/workflows/playwright.yml
# Normalize image name (GitHub lowercases repository owner names in GHCR)
IMAGE_NAME=$(echo "${{ github.repository_owner }}/charon" | tr '[:upper:]' '[:lower:]')
if [[ "${{ steps.pr-info.outputs.is_push }}" == "true" ]]; then
IMAGE_REF="ghcr.io/${IMAGE_NAME}:${{ steps.sanitize.outputs.branch }}"
- else
+ elif [[ -n "${{ steps.pr-info.outputs.pr_number }}" ]]; then
IMAGE_REF="ghcr.io/${IMAGE_NAME}:pr-${{ steps.pr-info.outputs.pr_number }}"
+ else
+ echo "❌ ERROR: Cannot determine image reference"
+ echo " - is_push: ${{ steps.pr-info.outputs.is_push }}"
+ echo " - pr_number: ${{ steps.pr-info.outputs.pr_number }}"
+ echo " - branch: ${{ steps.sanitize.outputs.branch }}"
+ echo ""
+ echo "This can happen when:"
+ echo " 1. workflow_dispatch without pr_number input"
+ echo " 2. workflow_run triggered by non-PR, non-push event"
+ exit 1
fi
+ # Validate the image reference format
+ if [[ ! "${IMAGE_REF}" =~ ^ghcr\.io/[a-z0-9_-]+/[a-z0-9_-]+:[a-zA-Z0-9._-]+$ ]]; then
+ echo "❌ ERROR: Invalid image reference format: ${IMAGE_REF}"
+ exit 1
+ fi
+
echo "📦 Starting container with image: ${IMAGE_REF}"
Additional Fix for docker-build.yml
The same issue can occur in docker-build.yml at the CVE verification step:
# Line ~174 in docker-build.yml
if [ "${{ github.event_name }}" = "pull_request" ]; then
IMAGE_REF="${{ env.GHCR_REGISTRY }}/${{ env.IMAGE_NAME }}:pr-${{ github.event.pull_request.number }}"
Fix:
--- a/.github/workflows/docker-build.yml
+++ b/.github/workflows/docker-build.yml
# Determine the image reference based on event type
if [ "${{ github.event_name }}" = "pull_request" ]; then
- IMAGE_REF="${{ env.GHCR_REGISTRY }}/${{ env.IMAGE_NAME }}:pr-${{ github.event.pull_request.number }}"
+ PR_NUM="${{ github.event.pull_request.number }}"
+ if [ -z "${PR_NUM}" ]; then
+ echo "❌ ERROR: Pull request number is empty"
+ exit 1
+ fi
+ IMAGE_REF="${{ env.GHCR_REGISTRY }}/${{ env.IMAGE_NAME }}:pr-${PR_NUM}"
echo "Using PR image: $IMAGE_REF"
else
IMAGE_REF="${{ env.GHCR_REGISTRY }}/${{ env.IMAGE_NAME }}@${{ steps.build-and-push.outputs.digest }}"
+ if [ -z "${{ steps.build-and-push.outputs.digest }}" ]; then
+ echo "❌ ERROR: Build digest is empty"
+ exit 1
+ fi
echo "Using digest: $IMAGE_REF"
fi
Verification
# Test with empty PR number (should fail fast with clear error)
gh workflow run playwright.yml --ref development
# Check IMAGE_REF construction in logs
gh run view --log | grep "IMAGE_REF"
Implementation Plan
Phase 1: Immediate Fixes (Single PR)
Objective: Fix all three CI failures in a single PR for immediate resolution.
Files to Modify:
| File | Changes |
|---|---|
.goreleaser.yaml |
Change -macos-gnu to -macos-none for darwin builds |
.github/workflows/playwright.yml |
Add missing emergency server env vars; Add IMAGE_REF validation |
.github/workflows/docker-build.yml |
Add IMAGE_REF validation guards |
Phase 2: Verification
- Push changes to a feature branch
- Open PR to trigger docker-build.yml
- Verify Trivy scan passes with valid IMAGE_REF
- Verify Playwright workflow if triggered
- Manually trigger nightly-build.yml with
--refpointing to feature branch - Verify darwin build succeeds
Phase 3: Cleanup (Optional)
- Add validation logic to a shared script (
scripts/validate-image-ref.sh) - Add integration tests for emergency server connectivity
- Document Zig target requirements for future contributors
Requirements (EARS Notation)
- WHEN GoReleaser builds darwin targets, THE SYSTEM SHALL use
-macos-noneZig target (not-macos-gnu). - WHEN the Playwright workflow starts the Charon container, THE SYSTEM SHALL set
CHARON_EMERGENCY_BIND=0.0.0.0:2020to ensure the emergency server is reachable. - WHEN constructing Docker image references, THE SYSTEM SHALL validate that the tag portion is non-empty before attempting to use it.
- IF the PR number is empty in a PR-triggered workflow, THEN THE SYSTEM SHALL fail fast with a clear error message explaining the issue.
- WHEN a feature branch contains
/characters, THE SYSTEM SHALL sanitize the branch name by replacing/with-before using it as a Docker tag.
Acceptance Criteria
- Nightly build completes successfully with darwin binaries
- Playwright E2E tests pass with emergency server accessible on port 2020
- Trivy scan passes with valid image reference for all trigger types
- Workflow failures produce clear, actionable error messages
- No regression in existing CI functionality
Risks & Mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Zig target change breaks darwin binaries | Low | High | Test with local Zig build first |
| Emergency server env vars conflict with existing config | Low | Medium | Verify against docker-compose.playwright-ci.yml |
| IMAGE_REF validation too strict | Medium | Low | Use permissive regex, log values before validation |
Handoff Contract
{
"plan": "CI Workflow Failures - Fix Plan",
"status": "Ready for Implementation",
"owner": "DevOps",
"handoffTargets": ["Backend_Dev", "DevOps"],
"files": [
".goreleaser.yaml",
".github/workflows/playwright.yml",
".github/workflows/docker-build.yml"
],
"estimatedEffort": "2-3 hours",
"priority": "HIGH",
"blockedWorkflows": [
"nightly-build.yml",
"playwright.yml",
"docker-build.yml (Trivy scan step)"
]
}