fix: address CI Docker build and scanning failure for PR #666

This commit is contained in:
GitHub Actions
2026-02-08 01:19:50 +00:00
parent 61dc2098df
commit 62a36dff01
5 changed files with 484 additions and 271 deletions
+278 -93
View File
@@ -1,141 +1,326 @@
# Plan: Conditional E2E Rebuild Rules + Navigation Test Continuation
---
title: "CI Docker Build and Scanning Blocker (PR #666)"
status: "draft"
scope: "ci/docker-build-scan"
---
## 1. Introduction
This plan updates E2E testing instructions so the Docker rebuild runs only when application code changes, explicitly skips rebuilds for test-only changes, and then continues navigation E2E testing using the existing task. The intent is to reduce unnecessary rebuild time while keeping the environment reliable and consistent.
This plan addresses the CI failure that blocks Docker build and scanning
for PR #666. The goal is to restore a clean, deterministic pipeline
where the image builds once, scans consistently, and security artifacts
align across workflows. The approach is minimal and evidence-driven:
collect logs, map the path, isolate the blocker, and apply the smallest
effective fix.
Objectives:
- Define clear, repeatable criteria for when an E2E container rebuild is required vs optional.
- Update instruction and agent documents to use the same conditional rebuild guidance.
- Preserve current E2E execution behavior and task surface, then proceed with navigation testing.
- Identify the exact failing step in the build/scan chain.
- Trace the failure to a reproducible root cause.
- Propose minimal workflow/Dockerfile changes to restore green CI.
- Ensure all scan workflows resolve the same PR image.
- Review .gitignore, codecov.yml, .dockerignore, and Dockerfile if
needed for artifact hygiene.
## 2. Research Findings
- The current testing protocol mandates rebuilding the E2E container before Playwright runs in [testing.instructions.md](.github/instructions/testing.instructions.md).
- The Management and Playwright agent definitions require rebuilding the E2E container before each test run in [Management.agent.md](.github/agents/Management.agent.md) and [Playwright_Dev.agent.md](.github/agents/Playwright_Dev.agent.md).
- QA Security also mandates rebuilds on every code change in [QA_Security.agent.md](.github/agents/QA_Security.agent.md).
- The main E2E skill doc encourages rebuilds before testing in [test-e2e-playwright.SKILL.md](.github/skills/test-e2e-playwright.SKILL.md).
- The rebuild skill itself is stable and already describes when it should be used in [docker-rebuild-e2e.SKILL.md](.github/skills/docker-rebuild-e2e.SKILL.md).
- Navigation test tasks already exist in [tasks.json](.vscode/tasks.json), including “Test: E2E Playwright (FireFox) - Core: Navigation”.
- CI E2E jobs rebuild via Docker image creation in [e2e-tests-split.yml](.github/workflows/e2e-tests-split.yml); no CI changes are required for this instruction-only update.
CI workflow and build context (already reviewed):
- Docker build orchestration: .github/workflows/docker-build.yml
- Security scan for PR artifacts: .github/workflows/security-pr.yml
- Supply-chain verification for PRs: .github/workflows/supply-chain-pr.yml
- SBOM verification for non-PR builds: .github/workflows/supply-chain-verify.yml
- Dockerfile linting: .github/workflows/docker-lint.yml and
.hadolint.yaml
- Weekly rebuild and scan: .github/workflows/security-weekly-rebuild.yml
- Quality checks (non-Docker): .github/workflows/quality-checks.yml
- Build context filters: .dockerignore
- Runtime Docker build instructions: Dockerfile
- Ignored artifacts: .gitignore
- Coverage configuration: codecov.yml
Observed from the public workflow summary (PR #666):
- Job build-and-push failed in the Docker Build, Publish & Test workflow.
- Logs require GitHub authentication; obtained via gh CLI after auth.
- Evidence status: confirmed via gh CLI logs (see Results).
Root cause captured from CI logs (authenticated gh CLI):
- npm ci failed with ERESOLVE due to eslint@10 conflicting with the
@typescript-eslint peer dependency range.
Secondary/unconfirmed mismatch to verify only if remediation fails:
- PR tags are generated as pr-{number}-{short-sha} in docker-build.yml.
- Several steps reference pr-{number} (no short SHA) and use
--pull=never.
- This can cause image-not-found errors after Buildx pushes without
--load.
## 3. Technical Specifications
### 3.1 Rebuild Decision Rules
### 3.1 CI Flow Map (Build -> Scan -> Verify)
Define explicit change categories to decide when to rebuild:
```mermaid
flowchart LR
A[PR Push] --> B[docker-build.yml: build-and-push]
B --> C[docker-build.yml: scan-pr-image]
B --> D[security-pr.yml: Trivy binary scan]
B --> E[supply-chain-pr.yml: SBOM + Grype]
B --> F[supply-chain-verify.yml: SBOM verify (non-PR)]
```
- **Rebuild Required (application/runtime changes)**
- Application code or dependencies: backend/**, frontend/**, backend/go.mod, backend/go.sum, package.json, package-lock.json.
- Container build/runtime configuration: Dockerfile, .docker/**, .docker/compose/docker-compose.playwright-*.yml, .docker/docker-entrypoint.sh.
- Runtime behavior changes that affect container startup (e.g., config files baked into the image).
### 3.2 Primary Failure Hypotheses (Ordered)
- **Rebuild Optional (test-only changes)**
- Playwright tests and fixtures: tests/**.
- Playwright config and test runners: playwright.config.js, playwright.caddy-debug.config.js.
- Documentation or planning files: docs/**, requirements.md, design.md, tasks.md.
- CI/workflow changes that do not affect runtime images: .github/workflows/**.
1) Eslint peer dependency conflict (confirmed root cause)
Decision guidance:
- npm ci failed with ERESOLVE due to eslint@10 conflicting with the
@typescript-eslint peer dependency range.
- If only test files or documentation change, reuse the existing E2E container if it is already healthy.
- If the container is not running, start it with docker-rebuild-e2e even for test-only changes.
- If there is uncertainty about whether a change affects the runtime image, default to rebuilding.
2) Tag mismatch between build output and verification steps
### 3.2 Instruction Targets and Proposed Wording
- Build tags for PRs are pr-{number}-{short-sha} (metadata action).
- Verification steps reference pr-{number} (no SHA) and do not pull.
- This is consistent with image-not-found errors.
Status: unconfirmed secondary hypothesis.
Update the following instruction and agent files to align with the conditional rebuild policy:
3) Buildx push without local image for verification steps
- [testing.instructions.md](.github/instructions/testing.instructions.md)
- Replace the current “Always rebuild the E2E container before running Playwright tests” statement with:
- “Rebuild the E2E container when application or Docker build inputs change (backend, frontend, dependencies, Dockerfile, .docker/compose). If changes are test-only, reuse the existing container when it is already healthy; rebuild only if the container is not running or state is suspect.”
- Add a short file-scope checklist defining “rebuild required” vs “test-only.”
- Build uses docker buildx build --push without --load.
- Verification steps use docker run --pull=never with local tags.
- Buildx does not allow --load with multi-arch builds; --load only
produces a single-platform image. For multi-arch, prioritize pull by
digest or publish a single-platform build output for local checks.
- If the tag is not local and not pulled, verification fails.
- [Management.agent.md](.github/agents/Management.agent.md)
- Update the “PREREQUISITE: Rebuild E2E container before each test run” bullet to:
- “PREREQUISITE: Rebuild the E2E container only when application or Docker build inputs change; skip rebuild for test-only changes if the container is already healthy.”
4) Dockerfile stage failure during network-heavy steps
- [Playwright_Dev.agent.md](.github/agents/Playwright_Dev.agent.md)
- Update “ALWAYS rebuild the E2E container before running tests” to:
- “Rebuild the E2E container when application or Docker build inputs change. For test-only changes, reuse the running container if healthy; rebuild only when the container is not running or state is suspect.”
- gosu-builder: git clone and Go build
- frontend-builder: npm ci / npm run build
- backend-builder: go mod download / xx-go build
- caddy-builder: xcaddy build and Go dependency patching
- crowdsec-builder: git clone + go get + sed patch
- GeoLite2 download and checksum verification
- [QA_Security.agent.md](.github/agents/QA_Security.agent.md)
- Update workflow step 1 to:
- “Rebuild the E2E image and container when application or Docker build inputs change. Skip rebuild for test-only changes if the container is already healthy.”
Any of these can fail with network timeouts or dependency resolution
errors in CI. The eslint peer dependency conflict is confirmed; other
hypotheses remain unconfirmed.
- [test-e2e-playwright.SKILL.md](.github/skills/test-e2e-playwright.SKILL.md)
- Adjust “Quick Start” language to:
- “Run docker-rebuild-e2e when application or Docker build inputs change. If only tests changed and the container is already healthy, skip rebuild and run the tests.”
### 3.3 Evidence Required (Single-Request Capture)
- Optional alignment (if desired for consistency):
- [test-e2e-playwright-debug.SKILL.md](.github/skills/test-e2e-playwright-debug.SKILL.md)
- [test-e2e-playwright-coverage.SKILL.md](.github/skills/test-e2e-playwright-coverage.SKILL.md)
- Update prerequisite language in the same conditional format when referencing docker-rebuild-e2e.
Evidence capture completed in a single session. The following items
were captured:
### 3.3 Data Flow and Component Impact
- Full logs for the failing docker-build.yml build-and-push job
- Failing step name and exit code
- Buildx command line as executed
- Metadata tags produced by docker/metadata-action
- Dockerfile stage that failed (if build failure)
- No API, database, or runtime component changes are introduced.
- The change is documentation-only: it modifies decision guidance for when to rebuild the E2E container.
- The E2E execution flow remains: optionally rebuild → run navigation test task → review Playwright report.
If accessible, also capture downstream scan job logs to confirm the image
reference used.
### 3.4 Error Handling and Edge Cases
### 3.4 Specific Files and Components to Investigate
- If the container is running but tests fail due to stale state, rebuild with docker-rebuild-e2e and re-run the navigation test.
- If only tests changed but the container is stopped, rebuild to create a known-good environment.
- If Dockerfile or .docker/compose changes occurred, rebuild is required even if tests are the only edited files in the last commit.
Docker build and tagging:
## 4. Implementation Plan
- .github/workflows/docker-build.yml
- Generate Docker metadata (tag formatting)
- Build and push Docker image (with retry)
- Verify Caddy Security Patches
- Verify CrowdSec Security Patches
- Job: scan-pr-image
### Phase 1: Instruction Updates (Documentation-only)
Security scanning:
- Update conditional rebuild guidance in the instruction files listed in section 3.2.
- Ensure the rebuild decision criteria are consistent and use the same file-scope examples across documents.
- .github/workflows/security-pr.yml
- Extract PR number from workflow_run
- Extract charon binary from container (image reference)
- Trivy scans (fs, SARIF, blocking table)
### Phase 2: Supporting Artifacts
Supply-chain verification:
- Update requirements.md with EARS requirements for conditional rebuild behavior.
- Update design.md to document the decision rules and file-scope criteria.
- Update tasks.md with a checklist that explicitly separates rebuild-required vs test-only scenarios.
- .github/workflows/supply-chain-pr.yml
- Check for PR image artifact
- Load Docker image (artifact)
- Build Docker image (Local)
### Phase 3: Navigation Test Continuation
Dockerfile stages and critical components:
- Determine change scope:
- If application/runtime files changed, run the Docker rebuild step first.
- If only tests or docs changed and the E2E container is already healthy, skip rebuild.
- Run the existing navigation task: “Test: E2E Playwright (FireFox) - Core: Navigation” from [tasks.json](.vscode/tasks.json).
- If the navigation test fails due to environment issues, rebuild and re-run.
- gosu-builder: /tmp/gosu build
- frontend-builder: /app/frontend build
- backend-builder: /app/backend build
- caddy-builder: xcaddy build and Go dependency patching
- crowdsec-builder: Go build and sed patch in
pkg/exprhelpers/debugger.go
- Final runtime stage: GeoLite2 download and checksum
## 5. Acceptance Criteria
### 3.5 Tag and Digest Source-of-Truth Propagation
- Instruction and agent files reflect the same conditional rebuild policy.
- Rebuild-required vs test-only criteria are explicitly defined with file path examples.
- Navigation tests can be run without a rebuild when only tests change and the container is healthy.
- The navigation test task remains unchanged and is used for validation.
- requirements.md, design.md, and tasks.md are updated to reflect the new rebuild rules.
Source of truth for the PR image reference is the output of the metadata
and build steps in docker-build.yml. Downstream workflows must consume a
single canonical reference, defined as:
## 6. Testing Steps
- primary: digest from buildx outputs (immutable)
- secondary: pr-{number}-{short-sha} tag (human-friendly)
- If application/runtime files changed, run the E2E rebuild using docker-rebuild-e2e before testing.
- If only tests changed and the container is healthy, skip rebuild.
- Run the navigation test task: “Test: E2E Playwright (FireFox) - Core: Navigation”.
- Review Playwright report and logs if failures occur; rebuild and re-run if the failure is environment-related.
Propagation rules:
## 7. Config Hygiene Review (Requested Files)
- docker-build.yml SHALL publish the digest and tag as job outputs.
- docker-build.yml SHALL write digest and tag to a small artifact
(e.g., pr-image-ref.txt) for downstream workflow_run consumers.
- security-pr.yml and supply-chain-pr.yml SHALL prefer the digest from
outputs or artifact, and only fall back to tag if digest is absent.
- Any step that runs a local container SHALL ensure the referenced image
is available by either --load (local) or explicit pull by digest.
- .gitignore: No change required for this instruction update.
- codecov.yml: No change required; E2E outputs are already ignored.
- .dockerignore: No change required; tests/ and Playwright artifacts remain excluded from image build context.
- Dockerfile: No change required.
### 3.6 Required Outcome (EARS Requirements)
## 8. Risks and Mitigations
- WHEN a pull request triggers docker-build.yml, THE SYSTEM SHALL build a
PR image tagged as pr-{number}-{short-sha} and emit its digest.
- WHEN verification steps run in docker-build.yml, THE SYSTEM SHALL
reference the same digest or tag emitted by the build step.
- WHEN security-pr.yml runs for a workflow_run, THE SYSTEM SHALL resolve
the PR image using the digest or the emitted tag.
- WHEN supply-chain-pr.yml runs, THE SYSTEM SHALL load the exact PR image
by digest or by the emitted tag without ambiguity.
- IF the image reference cannot be resolved, THEN THE SYSTEM SHALL fail
fast with a clear message that includes the expected digest and tag.
- Risk: Tests may run against stale containers when changes are misclassified as test-only. Mitigation: Provide explicit file-scope criteria and default to rebuild when unsure.
- Risk: Contributors interpret “test-only” too narrowly. Mitigation: include dependency files and Docker build inputs in rebuild-required list.
### 3.7 Config Hygiene Review (Requested Files)
## 9. Confidence Score
.gitignore:
- Ensure CI scan artifacts are ignored locally, including any new names
introduced by fixes (e.g., trivy-pr-results.sarif,
trivy-binary-results.sarif, grype-results.json,
sbom.cyclonedx.json).
codecov.yml:
- Confirm CI-generated security artifacts are excluded from coverage.
- Add any new artifact names if introduced by fixes.
.dockerignore:
- Verify required frontend/backend sources and manifests are included.
Dockerfile:
- Review GeoLite2 download behavior for CI reliability.
- Confirm CADDY_IMAGE build-arg naming consistency across workflows.
## 4. Implementation Plan (Minimal-Request Phases)
### Phase 0: Evidence Capture (Single Request)
Status: completed. Evidence captured and root cause confirmed.
- Retrieve full logs for the failing docker-build.yml build-and-push job.
- Capture the exact failing step, error output, and emitted tags/digest.
- Record the buildx command output as executed.
- Capture downstream scan logs if accessible to confirm image reference.
### Phase 1: Reproducibility Pass (Single Local Build)
- Run a local docker buildx build using the same arguments as
docker-build.yml.
- Capture any stage failures and map them to Dockerfile stages.
- Confirm whether Buildx produces local images or only remote tags.
### Phase 2: Root Cause Isolation
Status: completed. Root cause identified as the eslint peer dependency
conflict in the frontend build stage.
- If failure is tag mismatch, trace tag references across docker-build.yml,
security-pr.yml, and supply-chain-pr.yml.
- If failure is a Dockerfile stage, isolate to specific step (gosu,
frontend, backend, caddy, crowdsec, GeoLite2).
- If failure is network-related, document retries/timeout behavior and
any missing mirrors.
### Phase 3: Targeted Remediation Plan
Focus on validating the eslint remediation. Revisit secondary
hypotheses only if the remediation does not resolve CI.
Conditional options (fallbacks, unconfirmed):
Option A (Tag alignment):
- Update verification steps to use pr-{number}-{short-sha} tag.
- Or add a secondary tag pr-{number} for compatibility.
Option B (Local image availability):
- Add --load for PR builds so verification can run locally.
- Or explicitly pull by digest/tag before verification and remove
--pull=never.
Option C (Workflow scan alignment):
- Update security-pr.yml and supply-chain-pr.yml to consume the digest
or emitted tag from docker-build.yml outputs/artifact.
- Add fallback order: digest artifact -> emitted tag -> local build.
## 5. Results (Evidence)
Evidence status: confirmed via gh CLI logs after authentication.
Root cause (confirmed):
- Align eslint with the @typescript-eslint peer range to resolve npm ci
ERESOLVE in the frontend build stage.
### Phase 4: Validation (Minimal Jobs)
- Rerun docker-build.yml for the PR (or workflow_dispatch).
- Confirm build-and-push succeeds and verification steps resolve the
exact digest or tag.
- Confirm security-pr.yml and supply-chain-pr.yml resolve the same
digest or tag and complete scans.
- Deterministic check: use docker buildx imagetools inspect on the
emitted tag and compare the reported digest to the recorded build
digest, or pull by digest and verify the digest of the local image
matches the build output.
### Phase 5: Documentation and Hygiene
- Document the final tag/digest propagation in this plan.
- Update .gitignore / .dockerignore / codecov.yml if new artifacts are
produced.
## 6. Acceptance Criteria
- docker-build.yml build-and-push succeeds for PR #666.
- Verification steps resolve the same digest or tag emitted by build.
- security-pr.yml and supply-chain-pr.yml consume the same digest or tag
published by docker-build.yml.
- A validation check confirms tag-to-digest alignment across workflows
(digest matches tag for the PR image), using buildx imagetools inspect
or an equivalent digest comparison.
- No new CI artifacts are committed to the repository.
- Root cause is documented with logs and mapped to specific steps.
## 7. Risks and Mitigations
- Risk: CI logs are inaccessible without login, delaying diagnosis.
- Mitigation: request logs or export them once, then reproduce locally.
- Risk: Multiple workflows use divergent tag formats.
- Mitigation: define a single source of truth for PR tags and digest
propagation.
- Risk: Buildx produces only remote tags, breaking local verification.
- Mitigation: add --load for PR builds or pull by digest before
verification.
## 8. Confidence Score
Confidence: 88 percent
Rationale: This is a documentation-only change with no runtime or CI impact, but it relies on consistent interpretation of file-scope criteria.
Rationale: The eslint peer dependency conflict is confirmed as the
frontend build failure. Secondary tag mismatch hypotheses remain
unconfirmed and are now conditional fallbacks only.