Files
Charon/docs/plans/current_spec.md
T

449 lines
15 KiB
Markdown

# Docker Compose CI Failure Remediation Plan
**Status**: Active
**Created**: 2026-01-30
**Priority**: CRITICAL (Blocking CI)
---
## Executive Summary
The E2E test workflow (`e2e-tests.yml`) is failing when attempting to start containers via `docker-compose.playwright-ci.yml`. The root cause is an incorrect Docker image reference format in the compose file that attempts to use a bare SHA256 digest instead of a fully-qualified image reference with registry and repository.
**Error Message**:
```
charon-app Error pull access denied for sha256, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
```
**Root Cause**: The compose file's `image:` directive evaluates to a bare SHA256 digest (e.g., `sha256:057a9998...`) instead of a properly formatted image reference like `ghcr.io/wikid82/charon@sha256:057a9998...`.
---
## Issue 1: Nightly Build - GoReleaser macOS Cross-Compile Failure
### Problem Statement
The nightly build fails during GoReleaser release step when cross-compiling for macOS (darwin) using Zig:
```text
release failed after 4m19s
error=
build failed: exit status 1: go: downloading github.com/gin-gonic/gin v1.11.0
info: zig can provide libc for related target x86_64-macos.11-none
target=darwin_amd64_v1
```
### Root Cause Analysis
The `.goreleaser.yaml` darwin build uses incorrect Zig target specification:
**Current (WRONG):**
```yaml
CC=zig cc -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-gnu
CXX=zig c++ -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-gnu
```
**Issue:** macOS uses its own libc (libSystem), not GNU libc. The `-gnu` suffix is invalid for macOS targets. Zig expects `-macos-none` or `-macos.11-none` for macOS builds.
### Affected Files
| File | Change Type |
|------|-------------|
| `.goreleaser.yaml` | Fix Zig target for darwin builds |
### Recommended Fix
Update the darwin build configuration to use the correct Zig target triple:
**Option A: Use `-macos-none` (Recommended)**
```yaml
- id: darwin
dir: backend
main: ./cmd/api
binary: charon
env:
- CGO_ENABLED=1
- CC=zig cc -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-none
- CXX=zig c++ -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-none
```
**Option B: Specify macOS version (for specific SDK compatibility)**
```yaml
- CC=zig cc -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos.11-none
- CXX=zig c++ -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos.11-none
```
**Option C: Remove darwin builds entirely (if macOS support is not required)**
```yaml
# Remove the entire `- id: darwin` build block from .goreleaser.yaml
# Update archives section to remove darwin from the `nix` archive builds
```
### Implementation Details
```diff
--- a/.goreleaser.yaml
+++ b/.goreleaser.yaml
@@ -47,8 +47,8 @@
binary: charon
env:
- CGO_ENABLED=1
- - CC=zig cc -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-gnu
- - CXX=zig c++ -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-gnu
+ - CC=zig cc -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-none
+ - CXX=zig c++ -target {{ if eq .Arch "amd64" }}x86_64{{ else }}aarch64{{ end }}-macos-none
goos:
- darwin
goarch:
```
### Verification
```bash
# Local test (requires Zig installed)
cd backend
CGO_ENABLED=1 CC="zig cc -target x86_64-macos-none" go build -o charon-darwin ./cmd/api
# Nightly workflow test
gh workflow run nightly-build.yml --ref development -f reason="Test darwin build fix"
```
---
## Issue 2: Playwright E2E - Admin API Socket Hang Up
### Problem Statement
Playwright test `zzz-admin-whitelist-blocking.spec.ts:126` fails with:
```text
Error: apiRequestContext.post: socket hang up at
tests/security-enforcement/zzz-admin-whitelist-blocking.spec.ts:126:21
```
The test POSTs to `http://localhost:2020/emergency/security-reset` but cannot reach the emergency server.
### Root Cause Analysis
The `playwright.yml` workflow starts the Charon container but **does not set** the `CHARON_EMERGENCY_BIND` environment variable:
**Current workflow (`.github/workflows/playwright.yml`):**
```yaml
docker run -d \
--name charon-test \
-p 8080:8080 \
-p 127.0.0.1:2019:2019 \
-p "[::1]:2019:2019" \
-p 127.0.0.1:2020:2020 \
-p "[::1]:2020:2020" \
-e CHARON_ENV="${CHARON_ENV}" \
-e CHARON_DEBUG="${CHARON_DEBUG}" \
-e CHARON_ENCRYPTION_KEY="${CHARON_ENCRYPTION_KEY}" \
-e CHARON_EMERGENCY_TOKEN="${CHARON_EMERGENCY_TOKEN}" \
-e CHARON_EMERGENCY_SERVER_ENABLED="${CHARON_EMERGENCY_SERVER_ENABLED}" \
"${IMAGE_REF}"
```
**Missing:** `CHARON_EMERGENCY_BIND=0.0.0.0:2020`
Without this variable, the emergency server may not bind to the correct address, or may bind to a loopback-only address that isn't accessible via Docker port mapping.
**Comparison with working compose file:**
```yaml
# .docker/compose/docker-compose.playwright-ci.yml
- CHARON_EMERGENCY_BIND=0.0.0.0:2020
- CHARON_EMERGENCY_USERNAME=admin
- CHARON_EMERGENCY_PASSWORD=changeme
```
### Affected Files
| File | Change Type |
|------|-------------|
| `.github/workflows/playwright.yml` | Add missing emergency server env vars |
### Recommended Fix
Add the missing emergency server environment variables to the docker run command:
```diff
--- a/.github/workflows/playwright.yml
+++ b/.github/workflows/playwright.yml
@@ -163,6 +163,10 @@ jobs:
-e CHARON_ENCRYPTION_KEY="${CHARON_ENCRYPTION_KEY}" \
-e CHARON_EMERGENCY_TOKEN="${CHARON_EMERGENCY_TOKEN}" \
-e CHARON_EMERGENCY_SERVER_ENABLED="${CHARON_EMERGENCY_SERVER_ENABLED}" \
+ -e CHARON_EMERGENCY_BIND="0.0.0.0:2020" \
+ -e CHARON_EMERGENCY_USERNAME="admin" \
+ -e CHARON_EMERGENCY_PASSWORD="changeme" \
+ -e CHARON_SECURITY_TESTS_ENABLED="true" \
"${IMAGE_REF}"
```
### Full Updated Step
```yaml
- name: Start Charon container
if: steps.check-artifact.outputs.artifact_exists == 'true'
run: |
echo "🚀 Starting Charon container..."
# Normalize image name (GitHub lowercases repository owner names in GHCR)
IMAGE_NAME=$(echo "${{ github.repository_owner }}/charon" | tr '[:upper:]' '[:lower:]')
if [[ "${{ steps.pr-info.outputs.is_push }}" == "true" ]]; then
IMAGE_REF="ghcr.io/${IMAGE_NAME}:${{ steps.sanitize.outputs.branch }}"
else
IMAGE_REF="ghcr.io/${IMAGE_NAME}:pr-${{ steps.pr-info.outputs.pr_number }}"
fi
echo "📦 Starting container with image: ${IMAGE_REF}"
docker run -d \
--name charon-test \
-p 8080:8080 \
-p 127.0.0.1:2019:2019 \
-p "[::1]:2019:2019" \
-p 127.0.0.1:2020:2020 \
-p "[::1]:2020:2020" \
-e CHARON_ENV="${CHARON_ENV}" \
-e CHARON_DEBUG="${CHARON_DEBUG}" \
-e CHARON_ENCRYPTION_KEY="${CHARON_ENCRYPTION_KEY}" \
-e CHARON_EMERGENCY_TOKEN="${CHARON_EMERGENCY_TOKEN}" \
-e CHARON_EMERGENCY_SERVER_ENABLED="${CHARON_EMERGENCY_SERVER_ENABLED}" \
-e CHARON_EMERGENCY_BIND="0.0.0.0:2020" \
-e CHARON_EMERGENCY_USERNAME="admin" \
-e CHARON_EMERGENCY_PASSWORD="changeme" \
-e CHARON_SECURITY_TESTS_ENABLED="true" \
"${IMAGE_REF}"
echo "✅ Container started"
```
### Verification
```bash
# After fix, verify emergency server is listening
docker exec charon-test curl -sf http://localhost:2020/health || echo "Failed"
# Test emergency reset endpoint
curl -X POST http://localhost:2020/emergency/security-reset \
-H "Authorization: Basic $(echo -n 'admin:changeme' | base64)" \
-H "X-Emergency-Token: $CHARON_EMERGENCY_TOKEN"
```
---
## Issue 3: Trivy Scan - Invalid Image Reference Format
### Problem Statement
Trivy scan fails with "invalid image reference format" when:
1. PR number is missing (manual dispatch without PR number)
2. Feature branch names contain `/` characters (e.g., `feature/new-thing`)
3. `is_push` and `pr_number` are both empty/false
Resulting in invalid Docker tags like:
- `ghcr.io/owner/charon:pr-` (empty PR number)
- `ghcr.io/owner/charon:` (no tag at all)
### Root Cause Analysis
**Location:** `.github/workflows/playwright.yml` - "Start Charon container" step
```bash
if [[ "${{ steps.pr-info.outputs.is_push }}" == "true" ]]; then
IMAGE_REF="ghcr.io/${IMAGE_NAME}:${{ steps.sanitize.outputs.branch }}"
else
IMAGE_REF="ghcr.io/${IMAGE_NAME}:pr-${{ steps.pr-info.outputs.pr_number }}"
fi
```
**Problem:** When `is_push != "true"` AND `pr_number` is empty, this creates:
```
IMAGE_REF="ghcr.io/owner/charon:pr-"
```
This is an invalid Docker reference.
### Affected Files
| File | Change Type |
|------|-------------|
| `.github/workflows/playwright.yml` | Add validation for IMAGE_REF |
| `.github/workflows/docker-build.yml` | Add validation guards (CVE verification step) |
### Recommended Fix
Add defensive validation to fail fast with a clear error message:
```diff
--- a/.github/workflows/playwright.yml
+++ b/.github/workflows/playwright.yml
# Normalize image name (GitHub lowercases repository owner names in GHCR)
IMAGE_NAME=$(echo "${{ github.repository_owner }}/charon" | tr '[:upper:]' '[:lower:]')
if [[ "${{ steps.pr-info.outputs.is_push }}" == "true" ]]; then
IMAGE_REF="ghcr.io/${IMAGE_NAME}:${{ steps.sanitize.outputs.branch }}"
- else
+ elif [[ -n "${{ steps.pr-info.outputs.pr_number }}" ]]; then
IMAGE_REF="ghcr.io/${IMAGE_NAME}:pr-${{ steps.pr-info.outputs.pr_number }}"
+ else
+ echo "❌ ERROR: Cannot determine image reference"
+ echo " - is_push: ${{ steps.pr-info.outputs.is_push }}"
+ echo " - pr_number: ${{ steps.pr-info.outputs.pr_number }}"
+ echo " - branch: ${{ steps.sanitize.outputs.branch }}"
+ echo ""
+ echo "This can happen when:"
+ echo " 1. workflow_dispatch without pr_number input"
+ echo " 2. workflow_run triggered by non-PR, non-push event"
+ exit 1
fi
+ # Validate the image reference format
+ if [[ ! "${IMAGE_REF}" =~ ^ghcr\.io/[a-z0-9_-]+/[a-z0-9_-]+:[a-zA-Z0-9._-]+$ ]]; then
+ echo "❌ ERROR: Invalid image reference format: ${IMAGE_REF}"
+ exit 1
+ fi
+
echo "📦 Starting container with image: ${IMAGE_REF}"
```
### Additional Fix for docker-build.yml
The same issue can occur in `docker-build.yml` at the CVE verification step:
```yaml
# Line ~174 in docker-build.yml
if [ "${{ github.event_name }}" = "pull_request" ]; then
IMAGE_REF="${{ env.GHCR_REGISTRY }}/${{ env.IMAGE_NAME }}:pr-${{ github.event.pull_request.number }}"
```
**Fix:**
```diff
--- a/.github/workflows/docker-build.yml
+++ b/.github/workflows/docker-build.yml
# Determine the image reference based on event type
if [ "${{ github.event_name }}" = "pull_request" ]; then
- IMAGE_REF="${{ env.GHCR_REGISTRY }}/${{ env.IMAGE_NAME }}:pr-${{ github.event.pull_request.number }}"
+ PR_NUM="${{ github.event.pull_request.number }}"
+ if [ -z "${PR_NUM}" ]; then
+ echo "❌ ERROR: Pull request number is empty"
+ exit 1
+ fi
+ IMAGE_REF="${{ env.GHCR_REGISTRY }}/${{ env.IMAGE_NAME }}:pr-${PR_NUM}"
echo "Using PR image: $IMAGE_REF"
else
IMAGE_REF="${{ env.GHCR_REGISTRY }}/${{ env.IMAGE_NAME }}@${{ steps.build-and-push.outputs.digest }}"
+ if [ -z "${{ steps.build-and-push.outputs.digest }}" ]; then
+ echo "❌ ERROR: Build digest is empty"
+ exit 1
+ fi
echo "Using digest: $IMAGE_REF"
fi
```
### Verification
```bash
# Test with empty PR number (should fail fast with clear error)
gh workflow run playwright.yml --ref development
# Check IMAGE_REF construction in logs
gh run view --log | grep "IMAGE_REF"
```
---
## Implementation Plan
### Phase 1: Immediate Fixes (Single PR)
**Objective:** Fix all three CI failures in a single PR for immediate resolution.
**Files to Modify:**
| File | Changes |
|------|---------|
| `.goreleaser.yaml` | Change `-macos-gnu` to `-macos-none` for darwin builds |
| `.github/workflows/playwright.yml` | Add missing emergency server env vars; Add IMAGE_REF validation |
| `.github/workflows/docker-build.yml` | Add IMAGE_REF validation guards |
### Phase 2: Verification
1. Push changes to a feature branch
2. Open PR to trigger docker-build.yml
3. Verify Trivy scan passes with valid IMAGE_REF
4. Verify Playwright workflow if triggered
5. Manually trigger nightly-build.yml with `--ref` pointing to feature branch
6. Verify darwin build succeeds
### Phase 3: Cleanup (Optional)
1. Add validation logic to a shared script (`scripts/validate-image-ref.sh`)
2. Add integration tests for emergency server connectivity
3. Document Zig target requirements for future contributors
---
## Requirements (EARS Notation)
1. WHEN GoReleaser builds darwin targets, THE SYSTEM SHALL use `-macos-none` Zig target (not `-macos-gnu`).
2. WHEN the Playwright workflow starts the Charon container, THE SYSTEM SHALL set `CHARON_EMERGENCY_BIND=0.0.0.0:2020` to ensure the emergency server is reachable.
3. WHEN constructing Docker image references, THE SYSTEM SHALL validate that the tag portion is non-empty before attempting to use it.
4. IF the PR number is empty in a PR-triggered workflow, THEN THE SYSTEM SHALL fail fast with a clear error message explaining the issue.
5. WHEN a feature branch contains `/` characters, THE SYSTEM SHALL sanitize the branch name by replacing `/` with `-` before using it as a Docker tag.
---
## Acceptance Criteria
1. [ ] Nightly build completes successfully with darwin binaries
2. [ ] Playwright E2E tests pass with emergency server accessible on port 2020
3. [ ] Trivy scan passes with valid image reference for all trigger types
4. [ ] Workflow failures produce clear, actionable error messages
5. [ ] No regression in existing CI functionality
---
## Risks & Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| Zig target change breaks darwin binaries | Low | High | Test with local Zig build first |
| Emergency server env vars conflict with existing config | Low | Medium | Verify against docker-compose.playwright-ci.yml |
| IMAGE_REF validation too strict | Medium | Low | Use permissive regex, log values before validation |
---
## Handoff Contract
```json
{
"plan": "CI Workflow Failures - Fix Plan",
"status": "Ready for Implementation",
"owner": "DevOps",
"handoffTargets": ["Backend_Dev", "DevOps"],
"files": [
".goreleaser.yaml",
".github/workflows/playwright.yml",
".github/workflows/docker-build.yml"
],
"estimatedEffort": "2-3 hours",
"priority": "HIGH",
"blockedWorkflows": [
"nightly-build.yml",
"playwright.yml",
"docker-build.yml (Trivy scan step)"
]
}
```
---
## References
- [docs/actions/nightly-build-failure.md](../actions/nightly-build-failure.md)
- [docs/actions/playwright-e2e-failures.md](../actions/playwright-e2e-failures.md)
- [Zig Cross-Compilation Targets](https://ziglang.org/documentation/master/#Targets)
- [GoReleaser CGO Cross-Compilation](https://goreleaser.com/customization/build/#cross-compiling)