Files
Charon/docs/plans/current_spec.md

384 lines
14 KiB
Markdown

# Security Scan (PR) Deterministic Artifact Policy - Supervisor Remediation Plan
## 1. Introduction
### Overview
`Security Scan (PR)` failed because `.github/workflows/security-pr.yml` loaded
an artifact image tag (`pr-718-385081f`) and later attempted extraction with a
different synthesized tag (`pr-718`).
Supervisor conflict resolution in this plan selects Option A:
`workflow_run` artifact handling is restricted to upstream
`pull_request` events only.
### Root-Cause Clarity (Preserved)
The failure was not a Docker load failure. It was a source-of-truth violation in
image selection:
1. Artifact load path succeeded.
2. Extraction path reconstructed an alternate reference.
3. Alternate reference did not exist, causing `docker create ... not found`.
This plan keeps scope strictly on `.github/workflows/security-pr.yml`.
### Objectives
1. Remove all ambiguous behavior for artifact absence on `workflow_run`.
2. Remove `workflow_run` support for upstream `push` events to align with PR
artifact naming contract (`pr-image-<pr_number>`).
3. Codify one deterministic `workflow_dispatch` policy in SHALL form.
4. Harden image selection so it is not brittle on `RepoTags[0]`.
5. Add CI security hardening requirements for permissions and trust boundary.
6. Expand validation matrix to include `pull_request` and negative paths.
---
## 2. Research Findings
### 2.1 Failure Evidence
Source: `.github/logs/ci_failure.log`
Observed facts:
1. Artifact `pr-image-718` was found and downloaded from run `22164807859`.
2. `docker load` reported: `Loaded image: ghcr.io/wikid82/charon:pr-718-385081f`.
3. Extraction attempted: `docker create ghcr.io/wikid82/charon:pr-718`.
4. Docker reported: `... pr-718: not found`.
### 2.2 Producer Contract
Source: `.github/workflows/docker-build.yml`
Producer emits immutable PR tags with SHA suffix (`pr-<num>-<sha>`). Consumer
must consume artifact metadata/load output, not reconstruct mutable tags.
### 2.3 Current Consumer Gaps
Source: `.github/workflows/security-pr.yml`
Current consumer contains ambiguous policy points:
1. `workflow_run` artifact absence behavior can be interpreted as skip or fail.
2. `workflow_dispatch` policy is not single-path deterministic.
3. Image identification relies on single `RepoTags[0]` assumption.
4. Trust boundary and permission minimization are not explicitly codified as
requirements.
---
## 3. Technical Specifications
### 3.1 Deterministic EARS Requirements (Blocking)
1. WHEN `security-pr.yml` is triggered by `workflow_run` with
`conclusion == success` and upstream event `pull_request`, THE SYSTEM SHALL
require the expected image artifact to exist and SHALL hard fail the job if
the artifact is missing.
2. WHEN `security-pr.yml` is triggered by `workflow_run` and artifact lookup
fails, THEN THE SYSTEM SHALL exit non-zero with a diagnostic that includes:
upstream run id, expected artifact name, and reason category (`not found` or
`api/error`).
3. WHEN `security-pr.yml` is triggered by `workflow_run` and upstream event is
not `pull_request`, THEN THE SYSTEM SHALL hard fail immediately with reason
category `unsupported_upstream_event` and SHALL NOT attempt artifact lookup,
image load, or extraction.
4. WHEN `security-pr.yml` is triggered by `workflow_dispatch`, THE SYSTEM SHALL
require `inputs.pr_number` and SHALL hard fail immediately if input is empty.
5. WHEN `security-pr.yml` is triggered by `workflow_dispatch` with valid
`inputs.pr_number`, THE SYSTEM SHALL resolve artifact `pr-image-<pr_number>`
from the latest successful `docker-build.yml` run for that PR and SHALL hard
fail if artifact resolution or download fails.
6. WHEN artifact image is loaded, THE SYSTEM SHALL derive a canonical local
image alias (`charon:artifact`) from validated load result and SHALL use only
that alias for `docker create` in artifact-based paths.
7. WHEN artifact metadata parsing is required, THE SYSTEM SHALL NOT depend only
on `RepoTags[0]`; it SHALL validate all available repo tags and SHALL support
fallback selection using docker load image ID when tags are absent/corrupt.
8. IF no valid tag and no valid load image ID can be resolved, THEN THE SYSTEM
SHALL hard fail before extraction.
9. WHEN event is `pull_request` or `push`, THE SYSTEM SHALL build and use
`charon:local` only and SHALL NOT execute artifact lookup/load logic.
### 3.2 Deterministic Policy Decisions
#### Policy A: `workflow_run` Missing Artifact
Decision: hard fail only.
No skip behavior is allowed for upstream-success `workflow_run`.
#### Policy A1: `workflow_run` Upstream Event Contract
Decision: upstream event MUST be `pull_request`.
If upstream event is `push` or any non-PR event, fail immediately with
`unsupported_upstream_event`; no artifact path execution is allowed.
#### Policy B: `workflow_dispatch`
Decision: artifact-only manual replay.
No local-build fallback is allowed for `workflow_dispatch`. Required input is
`pr_number`; missing input is immediate hard fail.
### 3.3 Image Selection Hardening Contract
For step `Load Docker image` in `.github/workflows/security-pr.yml`:
1. Validate artifact file exists and is readable tar.
2. Parse `manifest.json` and iterate all candidate tags under `RepoTags[]`.
3. Run `docker load` and capture structured output.
4. Resolve source image by deterministic priority:
- First valid tag from `RepoTags[]` that exists locally after load.
- Else image ID extracted from `docker load` output (if present).
- Else fail.
5. Retag resolved source to `charon:artifact`.
6. Emit outputs:
- `image_ref=charon:artifact`
- `source_image_ref=<resolved tag or image id>`
- `source_resolution_mode=manifest_tag|load_image_id`
### 3.4 CI Security Hardening Requirements
For job `security-scan` in `.github/workflows/security-pr.yml`:
1. THE SYSTEM SHALL enforce least-privilege permissions by default:
- `contents: read`
- `actions: read`
- `security-events: write`
- No additional write scopes unless explicitly required.
2. THE SYSTEM SHALL restrict `pull-requests: write` usage to only steps that
require PR annotations/comments. If no such step exists, this permission
SHALL be removed.
3. THE SYSTEM SHALL enforce workflow_run trust boundary guards:
- Upstream workflow name must match expected producer.
- Upstream conclusion must be `success`.
- Upstream event must be `pull_request` only.
- Upstream head repository must equal `${{ github.repository }}` (same-repo
trust boundary), otherwise hard fail.
4. THE SYSTEM SHALL NOT use untrusted `workflow_run` payload values to build
shell commands without validation and quoting.
### 3.5 Step-Level Scope in `security-pr.yml`
Targeted steps:
1. `Extract PR number from workflow_run`
2. `Validate workflow_run upstream event contract`
3. `Check for PR image artifact`
4. `Skip if no artifact` (to be converted to deterministic fail paths for
`workflow_run` and `workflow_dispatch`)
5. `Load Docker image`
6. `Extract charon binary from container`
### 3.6 Event Data Flow (Deterministic)
```text
pull_request/push
-> Build Docker image (Local)
-> image_ref=charon:local
-> Extract /app/charon
-> Trivy scan
workflow_run (upstream success only)
-> Assert upstream event == pull_request (hard fail if false)
-> Require artifact exists (hard fail if missing)
-> Load/validate image
-> image_ref=charon:artifact
-> Extract /app/charon
-> Trivy scan
workflow_dispatch
-> Require pr_number input (hard fail if missing)
-> Resolve pr-image-<pr_number> artifact (hard fail if missing)
-> Load/validate image
-> image_ref=charon:artifact
-> Extract /app/charon
-> Trivy scan
```
### 3.7 Error Handling Matrix
| Step | Condition | Required Behavior |
|---|---|---|
| Validate workflow_run upstream event contract | `workflow_run` upstream event is not `pull_request` | Hard fail with `unsupported_upstream_event`; stop before artifact lookup |
| Check for PR image artifact | `workflow_run` upstream success but artifact missing | Hard fail with run id + artifact name |
| Extract PR number from workflow_run | `workflow_dispatch` and empty `inputs.pr_number` | Hard fail with input requirement message |
| Load Docker image | Missing/corrupt `charon-pr-image.tar` | Hard fail before `docker load` |
| Load Docker image | Missing/corrupt `manifest.json` | Attempt load-image-id fallback; fail if unresolved |
| Load Docker image | No valid `RepoTags[]` and no load image id | Hard fail |
| Extract charon binary from container | Empty/invalid `image_ref` | Hard fail before `docker create` |
| Extract charon binary from container | `/app/charon` missing | Hard fail with chosen image reference |
### 3.8 API/DB Changes
No backend API, frontend, or database schema changes.
---
## 4. Implementation Plan
### Phase 1: Playwright Impact Check
1. Mark Playwright scope as N/A because this change is workflow-only.
2. Record N/A rationale in PR description.
### Phase 2: Deterministic Event Policies
File: `.github/workflows/security-pr.yml`
1. Convert ambiguous skip/fail logic to hard-fail policy for
`workflow_run` missing artifact after upstream success.
2. Enforce deterministic `workflow_dispatch` policy:
- Required `pr_number` input.
- Artifact-only replay path.
- No local fallback.
3. Enforce PR-only `workflow_run` event contract:
- Upstream event must be `pull_request`.
- Upstream `push` or any non-PR event hard fails with
`unsupported_upstream_event`.
### Phase 3: Image Selection Hardening
File: `.github/workflows/security-pr.yml`
1. Harden `Load Docker image` with manifest validation and multi-tag handling.
2. Add fallback resolution via docker load image ID.
3. Emit explicit outputs for traceability (`source_resolution_mode`).
4. Ensure extraction consumes only selected alias (`charon:artifact`).
### Phase 4: CI Security Hardening
File: `.github/workflows/security-pr.yml`
1. Reduce job permissions to least privilege.
2. Remove/conditionalize `pull-requests: write` if not required.
3. Add workflow_run trust-boundary guard conditions and explicit fail messages.
### Phase 5: Validation
1. `pre-commit run actionlint --files .github/workflows/security-pr.yml`
2. Simulate deterministic paths (or equivalent CI replay) for all matrix cases.
3. Verify logs show chosen `source_image_ref` and `source_resolution_mode`.
---
## 5. Validation Matrix
| ID | Trigger Path | Scenario | Expected Result |
|---|---|---|---|
| V1 | `workflow_run` | Upstream success + artifact present | Pass, uses `charon:artifact` |
| V2 | `workflow_run` | Upstream success + artifact missing | Hard fail (non-zero) |
| V3 | `workflow_run` | Upstream success + artifact manifest corrupted | Hard fail after validation/fallback attempt |
| V4 | `workflow_run` | Upstream success + upstream event `push` | Hard fail with `unsupported_upstream_event` |
| V5 | `pull_request` | Direct PR trigger | Pass, uses `charon:local`, no artifact lookup |
| V6 | `push` | Direct push trigger | Pass, uses `charon:local`, no artifact lookup |
| V7 | `workflow_dispatch` | Missing `pr_number` input | Hard fail immediately |
| V8 | `workflow_dispatch` | Valid `pr_number` + artifact exists | Pass, uses `charon:artifact` |
| V9 | `workflow_dispatch` | Valid `pr_number` + artifact missing | Hard fail |
| V10 | `workflow_run` | Upstream from untrusted repository context | Hard fail by trust-boundary guard |
---
## 6. Acceptance Criteria
1. Plan states unambiguous hard-fail behavior for missing artifact on
`workflow_run` after upstream `pull_request` success.
2. Plan states `workflow_run` event contract is PR-only and that upstream
`push` is a deterministic hard-fail contract violation.
3. Plan states one deterministic `workflow_dispatch` policy in SHALL terms:
required `pr_number`, artifact-only path, no local fallback.
4. Plan defines robust image resolution beyond `RepoTags[0]`, including
load-image-id fallback and deterministic aliasing.
5. Plan includes least-privilege permissions and explicit workflow_run trust
boundary constraints.
6. Plan includes validation coverage for `pull_request` and direct `push` local
paths plus negative paths: unsupported upstream event, missing dispatch
input, missing artifact, corrupted/missing manifest.
7. Root cause remains explicit: image-reference mismatch inside
`.github/workflows/security-pr.yml` after successful artifact load.
---
## 7. Risks and Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Overly strict dispatch policy blocks ad-hoc scans | Medium | Document explicit manual replay contract in workflow description |
| PR-only workflow_run contract fails upstream push-triggered runs | Medium | Intentional contract enforcement; document `unsupported_upstream_event` and route push scans through direct push path |
| Manifest parsing edge cases | Medium | Multi-source resolver with load-image-id fallback |
| Permission tightening breaks optional PR annotations | Low | Make PR-write permission step-scoped only if needed |
| Trust-boundary guards reject valid internal events | Medium | Add clear diagnostics and test cases V1/V10 |
---
## 8. PR Slicing Strategy
### Decision
Single PR.
### Trigger Reasons
1. Change is isolated to one workflow (`security-pr.yml`).
2. Deterministic policy + hardening are tightly coupled and safest together.
3. Split PRs would create temporary policy inconsistency.
### Ordered Slice
#### PR-1: Deterministic Policy and Security Hardening for `security-pr.yml`
Scope:
1. Deterministic missing-artifact handling (`workflow_run` hard fail).
2. Deterministic `workflow_dispatch` artifact-only policy.
3. Hardened image resolution and aliasing.
4. Least-privilege + trust-boundary constraints.
5. Validation matrix execution evidence.
Files:
1. `.github/workflows/security-pr.yml`
2. `docs/plans/current_spec.md`
Dependencies:
1. `.github/workflows/docker-build.yml` artifact naming contract unchanged.
Validation Gates:
1. actionlint passes.
2. Validation matrix V1-V10 results captured.
3. No regression to `ghcr.io/...:pr-<num> not found` pattern.
Rollback / Contingency:
1. Revert PR-1 if trust-boundary guards block legitimate same-repo runs.
2. Keep hard-fail semantics; adjust guard predicate, not policy.
---
## 9. Handoff
After approval, implementation handoff to Supervisor SHALL include:
1. Exact step-level edits required in `.github/workflows/security-pr.yml`.
2. Proof logs for each failed/pass matrix case.
3. Confirmation that no files outside plan scope were required.
3. Require explicit evidence that artifact path no longer performs GHCR PR tag
reconstruction.