Files
Charon/docs/plans/current_spec.md

15 KiB

CI Workflow Failures - Fix Plan

Version: 1.0 Status: Ready for Implementation Priority: HIGH Created: 2026-01-30 Scope: Three CI failures in GitHub Actions workflows


Executive Summary

Three CI workflows are failing in production. This plan documents the root causes, affected files, and specific fixes required for each issue:

  1. Nightly Build Failure: GoReleaser macOS cross-compile failing with incorrect Zig target
  2. Playwright E2E Failure: Emergency server unreachable on port 2020 due to missing env var
  3. Trivy Scan Failure: Invalid Docker image reference when PR number is missing

Issue 1: Nightly Build - GoReleaser macOS Cross-Compile Failure

Problem Statement

The nightly build fails during GoReleaser release step when cross-compiling for macOS (darwin) using Zig:

release failed after 4m19s
error=
  build failed: exit status 1: go: downloading github.com/gin-gonic/gin v1.11.0
  info: zig can provide libc for related target x86_64-macos.11-none
target=darwin_amd64_v1

Root Cause Analysis

The .goreleaser.yaml darwin build uses incorrect Zig target specification:

Current (WRONG):

CC=zig cc -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-gnu
CXX=zig c++ -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-gnu

Issue: macOS uses its own libc (libSystem), not GNU libc. The -gnu suffix is invalid for macOS targets. Zig expects -macos-none or -macos.11-none for macOS builds.

Affected Files

File Change Type
.goreleaser.yaml Fix Zig target for darwin builds

Update the darwin build configuration to use the correct Zig target triple:

Option A: Use -macos-none (Recommended)

- id: darwin
  dir: backend
  main: ./cmd/api
  binary: charon
  env:
    - CGO_ENABLED=1
    - CC=zig cc -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-none
    - CXX=zig c++ -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-none

Option B: Specify macOS version (for specific SDK compatibility)

    - CC=zig cc -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos.11-none
    - CXX=zig c++ -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos.11-none

Option C: Remove darwin builds entirely (if macOS support is not required)

# Remove the entire `- id: darwin` build block from .goreleaser.yaml
# Update archives section to remove darwin from the `nix` archive builds

Implementation Details

--- a/.goreleaser.yaml
+++ b/.goreleaser.yaml
@@ -47,8 +47,8 @@
     binary: charon
     env:
       - CGO_ENABLED=1
-      - CC=zig cc -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-gnu
-      - CXX=zig c++ -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-gnu
+      - CC=zig cc -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-none
+      - CXX=zig c++ -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-none
     goos:
       - darwin
     goarch:

Verification

# Local test (requires Zig installed)
cd backend
CGO_ENABLED=1 CC="zig cc -target x86_64-macos-none" go build -o charon-darwin ./cmd/api

# Nightly workflow test
gh workflow run nightly-build.yml --ref development -f reason="Test darwin build fix"

Issue 2: Playwright E2E - Admin API Socket Hang Up

Problem Statement

Playwright test zzz-admin-whitelist-blocking.spec.ts:126 fails with:

Error: apiRequestContext.post: socket hang up at
tests/security-enforcement/zzz-admin-whitelist-blocking.spec.ts:126:21

The test POSTs to http://localhost:2020/emergency/security-reset but cannot reach the emergency server.

Root Cause Analysis

The playwright.yml workflow starts the Charon container but does not set the CHARON_EMERGENCY_BIND environment variable:

Current workflow (.github/workflows/playwright.yml):

docker run -d \
  --name charon-test \
  -p 8080:8080 \
  -p 127.0.0.1:2019:2019 \
  -p "[::1]:2019:2019" \
  -p 127.0.0.1:2020:2020 \
  -p "[::1]:2020:2020" \
  -e CHARON_ENV="${CHARON_ENV}" \
  -e CHARON_DEBUG="${CHARON_DEBUG}" \
  -e CHARON_ENCRYPTION_KEY="${CHARON_ENCRYPTION_KEY}" \
  -e CHARON_EMERGENCY_TOKEN="${CHARON_EMERGENCY_TOKEN}" \
  -e CHARON_EMERGENCY_SERVER_ENABLED="${CHARON_EMERGENCY_SERVER_ENABLED}" \
  "${IMAGE_REF}"

Missing: CHARON_EMERGENCY_BIND=0.0.0.0:2020

Without this variable, the emergency server may not bind to the correct address, or may bind to a loopback-only address that isn't accessible via Docker port mapping.

Comparison with working compose file:

# .docker/compose/docker-compose.playwright-ci.yml
- CHARON_EMERGENCY_BIND=0.0.0.0:2020
- CHARON_EMERGENCY_USERNAME=admin
- CHARON_EMERGENCY_PASSWORD=changeme

Affected Files

File Change Type
.github/workflows/playwright.yml Add missing emergency server env vars

Add the missing emergency server environment variables to the docker run command:

--- a/.github/workflows/playwright.yml
+++ b/.github/workflows/playwright.yml
@@ -163,6 +163,10 @@ jobs:
             -e CHARON_ENCRYPTION_KEY="${CHARON_ENCRYPTION_KEY}" \
             -e CHARON_EMERGENCY_TOKEN="${CHARON_EMERGENCY_TOKEN}" \
             -e CHARON_EMERGENCY_SERVER_ENABLED="${CHARON_EMERGENCY_SERVER_ENABLED}" \
+            -e CHARON_EMERGENCY_BIND="0.0.0.0:2020" \
+            -e CHARON_EMERGENCY_USERNAME="admin" \
+            -e CHARON_EMERGENCY_PASSWORD="changeme" \
+            -e CHARON_SECURITY_TESTS_ENABLED="true" \
             "${IMAGE_REF}"

Full Updated Step

      - name: Start Charon container
        if: steps.check-artifact.outputs.artifact_exists == 'true'
        run: |
          echo "🚀 Starting Charon container..."

          # Normalize image name (GitHub lowercases repository owner names in GHCR)
          IMAGE_NAME=$(echo "${{ github.repository_owner }}/charon" | tr '[:upper:]' '[:lower:]')
          if [[ "${{ steps.pr-info.outputs.is_push }}" == "true" ]]; then
            IMAGE_REF="ghcr.io/${IMAGE_NAME}:${{ steps.sanitize.outputs.branch }}"
          else
            IMAGE_REF="ghcr.io/${IMAGE_NAME}:pr-${{ steps.pr-info.outputs.pr_number }}"
          fi

          echo "📦 Starting container with image: ${IMAGE_REF}"
          docker run -d \
            --name charon-test \
            -p 8080:8080 \
            -p 127.0.0.1:2019:2019 \
            -p "[::1]:2019:2019" \
            -p 127.0.0.1:2020:2020 \
            -p "[::1]:2020:2020" \
            -e CHARON_ENV="${CHARON_ENV}" \
            -e CHARON_DEBUG="${CHARON_DEBUG}" \
            -e CHARON_ENCRYPTION_KEY="${CHARON_ENCRYPTION_KEY}" \
            -e CHARON_EMERGENCY_TOKEN="${CHARON_EMERGENCY_TOKEN}" \
            -e CHARON_EMERGENCY_SERVER_ENABLED="${CHARON_EMERGENCY_SERVER_ENABLED}" \
            -e CHARON_EMERGENCY_BIND="0.0.0.0:2020" \
            -e CHARON_EMERGENCY_USERNAME="admin" \
            -e CHARON_EMERGENCY_PASSWORD="changeme" \
            -e CHARON_SECURITY_TESTS_ENABLED="true" \
            "${IMAGE_REF}"

          echo "✅ Container started"

Verification

# After fix, verify emergency server is listening
docker exec charon-test curl -sf http://localhost:2020/health || echo "Failed"

# Test emergency reset endpoint
curl -X POST http://localhost:2020/emergency/security-reset \
  -H "Authorization: Basic $(echo -n 'admin:changeme' | base64)" \
  -H "X-Emergency-Token: $CHARON_EMERGENCY_TOKEN"

Issue 3: Trivy Scan - Invalid Image Reference Format

Problem Statement

Trivy scan fails with "invalid image reference format" when:

  1. PR number is missing (manual dispatch without PR number)
  2. Feature branch names contain / characters (e.g., feature/new-thing)
  3. is_push and pr_number are both empty/false

Resulting in invalid Docker tags like:

  • ghcr.io/owner/charon:pr- (empty PR number)
  • ghcr.io/owner/charon: (no tag at all)

Root Cause Analysis

Location: .github/workflows/playwright.yml - "Start Charon container" step

if [[ "${{ steps.pr-info.outputs.is_push }}" == "true" ]]; then
  IMAGE_REF="ghcr.io/${IMAGE_NAME}:${{ steps.sanitize.outputs.branch }}"
else
  IMAGE_REF="ghcr.io/${IMAGE_NAME}:pr-${{ steps.pr-info.outputs.pr_number }}"
fi

Problem: When is_push != "true" AND pr_number is empty, this creates:

IMAGE_REF="ghcr.io/owner/charon:pr-"

This is an invalid Docker reference.

Affected Files

File Change Type
.github/workflows/playwright.yml Add validation for IMAGE_REF
.github/workflows/docker-build.yml Add validation guards (CVE verification step)

Add defensive validation to fail fast with a clear error message:

--- a/.github/workflows/playwright.yml
+++ b/.github/workflows/playwright.yml
           # Normalize image name (GitHub lowercases repository owner names in GHCR)
           IMAGE_NAME=$(echo "${{ github.repository_owner }}/charon" | tr '[:upper:]' '[:lower:]')

           if [[ "${{ steps.pr-info.outputs.is_push }}" == "true" ]]; then
             IMAGE_REF="ghcr.io/${IMAGE_NAME}:${{ steps.sanitize.outputs.branch }}"
-          else
+          elif [[ -n "${{ steps.pr-info.outputs.pr_number }}" ]]; then
             IMAGE_REF="ghcr.io/${IMAGE_NAME}:pr-${{ steps.pr-info.outputs.pr_number }}"
+          else
+            echo "❌ ERROR: Cannot determine image reference"
+            echo "  - is_push: ${{ steps.pr-info.outputs.is_push }}"
+            echo "  - pr_number: ${{ steps.pr-info.outputs.pr_number }}"
+            echo "  - branch: ${{ steps.sanitize.outputs.branch }}"
+            echo ""
+            echo "This can happen when:"
+            echo "  1. workflow_dispatch without pr_number input"
+            echo "  2. workflow_run triggered by non-PR, non-push event"
+            exit 1
           fi

+          # Validate the image reference format
+          if [[ ! "${IMAGE_REF}" =~ ^ghcr\.io/[a-z0-9_-]+/[a-z0-9_-]+:[a-zA-Z0-9._-]+$ ]]; then
+            echo "❌ ERROR: Invalid image reference format: ${IMAGE_REF}"
+            exit 1
+          fi
+
           echo "📦 Starting container with image: ${IMAGE_REF}"

Additional Fix for docker-build.yml

The same issue can occur in docker-build.yml at the CVE verification step:

# Line ~174 in docker-build.yml
if [ "${{ github.event_name }}" = "pull_request" ]; then
  IMAGE_REF="${{ env.GHCR_REGISTRY }}/${{ env.IMAGE_NAME }}:pr-${{ github.event.pull_request.number }}"

Fix:

--- a/.github/workflows/docker-build.yml
+++ b/.github/workflows/docker-build.yml
           # Determine the image reference based on event type
           if [ "${{ github.event_name }}" = "pull_request" ]; then
-            IMAGE_REF="${{ env.GHCR_REGISTRY }}/${{ env.IMAGE_NAME }}:pr-${{ github.event.pull_request.number }}"
+            PR_NUM="${{ github.event.pull_request.number }}"
+            if [ -z "${PR_NUM}" ]; then
+              echo "❌ ERROR: Pull request number is empty"
+              exit 1
+            fi
+            IMAGE_REF="${{ env.GHCR_REGISTRY }}/${{ env.IMAGE_NAME }}:pr-${PR_NUM}"
             echo "Using PR image: $IMAGE_REF"
           else
             IMAGE_REF="${{ env.GHCR_REGISTRY }}/${{ env.IMAGE_NAME }}@${{ steps.build-and-push.outputs.digest }}"
+            if [ -z "${{ steps.build-and-push.outputs.digest }}" ]; then
+              echo "❌ ERROR: Build digest is empty"
+              exit 1
+            fi
             echo "Using digest: $IMAGE_REF"
           fi

Verification

# Test with empty PR number (should fail fast with clear error)
gh workflow run playwright.yml --ref development

# Check IMAGE_REF construction in logs
gh run view --log | grep "IMAGE_REF"

Implementation Plan

Phase 1: Immediate Fixes (Single PR)

Objective: Fix all three CI failures in a single PR for immediate resolution.

Files to Modify:

File Changes
.goreleaser.yaml Change -macos-gnu to -macos-none for darwin builds
.github/workflows/playwright.yml Add missing emergency server env vars; Add IMAGE_REF validation
.github/workflows/docker-build.yml Add IMAGE_REF validation guards

Phase 2: Verification

  1. Push changes to a feature branch
  2. Open PR to trigger docker-build.yml
  3. Verify Trivy scan passes with valid IMAGE_REF
  4. Verify Playwright workflow if triggered
  5. Manually trigger nightly-build.yml with --ref pointing to feature branch
  6. Verify darwin build succeeds

Phase 3: Cleanup (Optional)

  1. Add validation logic to a shared script (scripts/validate-image-ref.sh)
  2. Add integration tests for emergency server connectivity
  3. Document Zig target requirements for future contributors

Requirements (EARS Notation)

  1. WHEN GoReleaser builds darwin targets, THE SYSTEM SHALL use -macos-none Zig target (not -macos-gnu).
  2. WHEN the Playwright workflow starts the Charon container, THE SYSTEM SHALL set CHARON_EMERGENCY_BIND=0.0.0.0:2020 to ensure the emergency server is reachable.
  3. WHEN constructing Docker image references, THE SYSTEM SHALL validate that the tag portion is non-empty before attempting to use it.
  4. IF the PR number is empty in a PR-triggered workflow, THEN THE SYSTEM SHALL fail fast with a clear error message explaining the issue.
  5. WHEN a feature branch contains / characters, THE SYSTEM SHALL sanitize the branch name by replacing / with - before using it as a Docker tag.

Acceptance Criteria

  1. Nightly build completes successfully with darwin binaries
  2. Playwright E2E tests pass with emergency server accessible on port 2020
  3. Trivy scan passes with valid image reference for all trigger types
  4. Workflow failures produce clear, actionable error messages
  5. No regression in existing CI functionality

Risks & Mitigations

Risk Likelihood Impact Mitigation
Zig target change breaks darwin binaries Low High Test with local Zig build first
Emergency server env vars conflict with existing config Low Medium Verify against docker-compose.playwright-ci.yml
IMAGE_REF validation too strict Medium Low Use permissive regex, log values before validation

Handoff Contract

{
  "plan": "CI Workflow Failures - Fix Plan",
  "status": "Ready for Implementation",
  "owner": "DevOps",
  "handoffTargets": ["Backend_Dev", "DevOps"],
  "files": [
    ".goreleaser.yaml",
    ".github/workflows/playwright.yml",
    ".github/workflows/docker-build.yml"
  ],
  "estimatedEffort": "2-3 hours",
  "priority": "HIGH",
  "blockedWorkflows": [
    "nightly-build.yml",
    "playwright.yml",
    "docker-build.yml (Trivy scan step)"
  ]
}

References