Charon/docs/plans/current_spec.md at 1adbd0aba4d51165c11972cd2c5f236c90a0b8d7

Files

GitHub Actions 1adbd0aba4 feat(ci): implement CI dry-run workflow and PR checklist for history rewrite process

2025-12-09 02:36:10 +00:00

51 KiB

Raw Blame History

Overview

This document outlines an investigation and remediation plan for the error message: "Could not build remote workspace index. Could not check the remote index status for this repo." It contains diagnoses, exact checks, reproductions, fixes, CI/security checks, and acceptance criteria.

Summary: Background & likely root causes

Background: Remote workspace index builds (e.g., on GitHub or other code hosting platforms) analyze repository contents to provide search, code navigation, and other services. Index processes can fail due to repository content, metadata, workflows, or platform limitations.
Likely root causes:
- Permission issues: index build needs read access or specific tokens (missing PAT or actions secrets, restricted branch protection blocking re-run).
- Repository size limits: excessive repo size (large binaries, codeql databases, caddy caches) triggers failures.
- Git LFS misconfiguration: large files stored in Git rather than LFS; remote indexer times out scanning big objects.
- Cyclic symlinks or malformed symlinks: indexers can hit infinite loops or unsupported references.
- Large or unsupported binary files: vendor/binaries and artifacts committed directly instead of using artifacts or releases.
- Missing or stale code scanning artifacts (e.g., codeql-db) present in repo root that confuse the indexer.
- Malformed .git/config or broken submodules: submodule references or malformed remote URL prevents indexer from checking status.
- Network issues: temporary outages between indexer service and repo host.
- GitHub Actions failures: workflows that set up the environment fail due to missing CI keys or secrets.
- Branch protection / repo policies: preventing GitHub from performing necessary indexing operations.

Files and Locations To Inspect (exact)

Repository configuration
- .git/config (.git/config) — inspect remotes, submodules, and alternate refs
- .gitattributes (.gitattributes) — look for LFS vs tracked files
- .gitignore (.gitignore) — ensure generated artifacts and codeql-db are ignored
CI and workflows
- .github/workflows/**/* (.github/workflows) — workflows and secrets usage
- .github/ (branch protection or other settings logged in repo web UI)
Code scanning artifacts
- codeql-db/ (codeql-db-go/, codeql-db-js/) — ensure not committed as code
- backend/codeql-db/ or backend/data/ — artifacts that increase repo size and confuse indexer
Build & deploy config
- Dockerfile (Dockerfile) — check for COPY of large files
- docker-compose*.yml (docker-compose.yml, docker-compose.local.yml ) — services and volumes
Platform / metadata
- .vscode/* (workspace settings) — local workspace mapping
- go.work (go.work) — ensures workspace go settings are correct
Frontend & package manifests
- frontend/package.json (frontend/package.json) — scripts and postinstall tasks that may run CI or build steps
- package.json (root/package.json) — library or workspace settings affecting index size
Binary caches & artifacts
- data/ (data/), backend/data/ — DB files, caddy files, and caches
- tools/ (tools/) — large binaries or vendor files committed for local dev
Git LFS and hooks
- .git/hooks/* — check for LFS pre-commit/pre-push hooks
GitHub admin side
- Branch protection settings in GitHub UI (branches) — check protected branches and required status checks
Logs & scan results

.github/workflows action logs (in GitHub Actions run view or gh CLI)
backend/test-output.txt (backend/test-output.txt) and other preexisting logs

Commands & Logs To Collect (reproduction & evidence gathering)

Local repo inspection
1. git status --porcelain --ignored — show ignored & modified files
2. git fsck --full — scan object store for corruption
3. git count-objects -vH — get repo size (packed/unpacked).
4. du -sh . and du -sh $(git rev-parse --show-toplevel)/* — check workspace size from file system
5. git rev-list --objects --all | sort -k 2 > allfiles.txt then git rev-list --objects --all | sort -k 2 | tail -n 50 — find large commits/objects
6. git verify-pack -v .git/objects/pack/pack-*.idx | sort -k 3 -n | tail -n 50 — inspect object packs
7. git lfs ls-files and git lfs env — check LFS tracked files and env
8. find . -type f -size +100M -exec ls -lh {} \; — detect large files in working tree
9. rg -S "codeql-db|codeql database|codeql" -n — ripgrep for CodeQL references

git submodule status --recursive — check submodules
git config --list --show-origin — confirm config and values (e.g., LFS config)

Reproducing remote run locally / in GH Actions 12. Use gh CLI: gh run list --repo owner/repo then gh run view RUN_ID --log --repo owner/repo to collect logs 13. Rerun the failing index action: gh run rerun RUN_ID --repo owner/repo (requires permissions) 14. Recreate CodeQL database locally to test: codeql database create codeql-db --language=go --command='cd backend && go test ./... -c' then codeql database analyze codeql-db codeql-custom-queries-go to check for broken DBs 15. docker build --no-cache -t charon:local . and docker run --rm charon:local to catch missing git or LFS artifacts copied into images
Network & permissions 16. Check PAT and ACTIONS tokens: gh auth status and gh secret list --repo owner/repo and verify repository secrets usage in workflows 17. Verify branch protection policies via gh api repos/:owner/:repo/branches/:branch or through the GitHub UI 18. Use curl -I https://api.github.com/repos/:owner/:repo to confirm GitHub's API reachability

Steps To Gather Evidence (sequence)

Triage: Run the local repo inspections above (commands 1–11). Save outputs to scripts/diagnostics/repo_health.OUTPUT.
Confirm large files or code scan DBs are present: find . -type d -name "codeql-db*" -o -name "db*" -o -name "node_modules" -o -name "vendor".
Inspect git lfs ls-files and git lfs env for LFS misconfig.
Collect CI history and logs for failing runs using gh run view --log for the action(s) that produce the index error.
Verify permission and branch protections: gh api queries & gh secret list --repo.
Verify Dockerfile and workflow steps that might include copying large artifacts and causing indexer timeouts.

Short-Term Mitigations (quick steps)

Disable or re-run failing index builds with proper permissions: gh run rerun RUN_ID --repo owner/repo.
Move large artifacts out of the repo: create a new docs/ or data/ to store metadata and put codeql-db*, tools/, data/, and backend/data/ into .gitignore and .gitattributes to exclude them from the index.
Add codeql-db to .gitignore and create .gitattributes entries to ensure no binary files are tracked without LFS:
- .gitattributes additions: `codeql-db/** !binary
- Update .gitignore: add /codeql-db, /data, /backend/data, /.vscode/*/.
Re-run the index build once large files are removed.

Medium-Term Fixes

Git filter-repo/BFG: If large files are committed, remove them from history using git filter-repo or BFG and force-push a cleaned branch: git filter-repo --strip-blobs-bigger-than 50M.
Convert large tracks to Git LFS for binary files: git lfs track "*.iso" and git lfs track "*.db" and git add .gitattributes && git commit -m "Track binaries in LFS" && git push.
Prevent accidental artifacts in the future with a pre-commit hook (add to scripts/pre-commit-hooks and reference in .git/hooks or pre-commit framework): run pre-commit rule to enforce max-file-size and LFS checks.
Add GH actions health-check workflow (e.g., .github/workflows/repo-health.yml) that runs a small script to check for large files, LFS config, and codeql-db folders and opens an issue if thresholds are exceeded.
Document LFS / large file policy in docs/ (e.g., docs/getting-started.md and docs/features.md) and add codeowner references.

Longer-Term Hardening

Automate periodic repo health checks (monthly) via GH Actions to run git count-objects -v, find -size + and warn maintainers.
Add a repository dashboard for codeql-db and other artifacts using scripts/ and a GitHub Action to report stats to PRs or issues.
Harden the remote indexing process by requesting GitHub support if the issue is intermittent and caused by GitHub's indexer failure.
Add a scripts/ci_pre_check.sh script to run on PR open that checks git fsck, LFS, and ensures codeql-db is not committed.
Add a scripts/repo_health_check.sh file and include it in .vscode/tasks.json and as a GH Action to be optionally invoked by maintainers.

Recommended Fixes (exact files & edits, tests to run, CI checks)

Add .gitignore and .gitattributes updates
- Files to edit: .gitignore, .gitattributes.
- Steps:
  - Add /codeql-db, /codeql-db-*, /.vscode/, /backend/data, /data, /node_modules to .gitignore.
  - Add *.db filter=lfs diff=lfs merge=lfs -text to .gitattributes for DB files and codeql-db/** binary to mark codeql files as binary.
  - Commit and push.
- Tests & checks:
  - Locally run git status --ignored to ensure these are now ignored.
  - Run git lfs ls-files to ensure large files are LFS tracked if intended.
  - CI: Ensure pre-commit checks pass, run gh run view to verify indexing.
Add small repo-health GH Action
- Files to add: .github/workflows/repo-health.yml, scripts/repo_health_check.sh.
- Steps:
  - Implement scripts/repo_health_check.sh that runs git count-objects -vH, find . -size +100M and git fsck and prints a short JSON summary.
  - Add repo-health.yml with a scheduled trigger and PR check to run the script.
- Tests & checks:
  - Run bash scripts/repo_health_check.sh locally. Ensure it exits 0 when checks are clean.
  - CI: Ensure the workflow runs and reports results in the Actions tab.
Build-time protections for codeql artifacts
- Files to edit: .github/workflows/ci.yml (or equivalent CI) and .gitattributes / .gitignore.
- Steps:
  - Remove codeql-db directories from CI cache and artifact paths; don't commit them.
  - Ensure CodeQL analysis workflow uses the actions/cache and actions/upload-artifact correctly, not storing DBs in the repo.
- Tests & checks:
  - Re-run gh actions CodeQL workflow: gh run rerun RUN_ID --repo owner/repo and verify action no longer stores DBs as code.
Pre-commit hook for large files & LFS enforcement
- Files to add: scripts/pre-commit-hooks/check-large-file.sh, and enable via .pre-commit-config.yaml or scripts/pre-commit-install.sh.
- Steps:
  - Implement a hook to fail commits larger than 50MB unless tracked in LFS.
  - Add to pre-commit config and install in the repo.
- Tests & checks:
  - Attempt to commit a test large file > 50MB to verify the commit is rejected unless LFS tracked.
  - CI: Add a PR check running pre-commit to ensure commits follow policy.
CI policy verification for branches
- Files / settings to revise: .github/workflows for runner permissions, branch protection settings via GitHub UI
- Steps:
  - Confirm user-level or organization-level actions and workflows permissions allow required actions to run indexers.
  - Modify workflow triggers: ensure that pull_request and push do not include large artifacts or directories.
- Tests & checks:
  - Open PR with scripts/ changes to trigger the updated workflows; verify that they run and pass.
Automated Monitoring & Alerts
- Files to add: .github/workflows/monitor-repo.yml, scripts/repo-monitor.sh.
- Steps:
  - Implement periodic monitoring workflow to run repo health checks and open an issue or send a slack message when thresholds crossed.
- Tests & checks:
  - Local run and scheduled run for the workflow to prove the alert state.
Documentation updates
- Files to update: docs/getting-started.md, docs/features.md, docs/security.md, CONtributing.md
- Steps:
  - Add guidelines for how to store large artifacts, a policy to use Git LFS, instructions on running scripts/repo_health_check.sh.
- Tests & checks:
  - Verify that docs references builds and CI pass with updated instructions.
CI Integration Tests to validate fixes
- Files to add/edit: .github/workflows/ci.yml, backend build scripts, frontend script checks
- Steps:
  - Add a ci.yml step to run bash scripts/repo_health_check.sh, go test ./... and npm run build (frontend) as a gating check.
- Tests & checks:
  - Ensure go build, go test, and npm run build pass in CI after changes.
Forced cleanup and migration of large objects (if necessary)
- Files to change: none—these are history-edit operations
- Steps:
  - If large files are present and the repo will not admit LFS for them, use git filter-repo --strip-blobs-bigger-than 50M or BFG and push to a cleaned branch.
  - Recreate workflows or branch references as needed after forced push.
- Tests & checks:
  - Run git count-objects -vH before/after and verify the pack size decreases significantly.
Validate GH Actions & secrets

Files to inspect: .github/workflows/* and GitHub repo settings (secrets)
Steps:
- Ensure that required secrets, PATs, or GITHUB_TOKEN usage is correct; verify actions/checkout uses LFS fetch: actions/checkout@v2 with fetch-depth: 0 and proper lfs enabled.
Tests & checks:
- Rerun an action to ensure git lfs ls-files lists expected 4-5 known files and that the gh run does not fail with read errors.

Phased Work Plan

Phase 1 — Triage & evidence collection (2-3 days)

Tasks:
1. Run all diagnostic commands and collect output (scripts/diagnostics/repo_health.OUTPUT).
2. Collect failing GH Action run logs and CodeQL run logs via gh run view.
3. Determine whether the failure is reproducible or intermittent.
Acceptance criteria:
- Have a full diagnostic report (PAIR) that identifies a top-2 likely causes.
- Have the failing Action ID or workflow file referenced.

Phase 2 — Short-term fixes & re-run (1-2 days)

Tasks:
1. Add immediate .gitignore and .gitattributes protections for known artifacts.
2. Add scripts/repo_health_check.sh, a .github/workflows/repo-health.yml workflow, and pre-commit LFS checks.
3. Re-run GH Actions and index builds to check for improvement.
Acceptance criteria:
- Repo health workflow runs on PRs and schedule; the health check succeeds or reports actionable items.
- Index build does not fail due to size or missing LFS objects on re-run.
- No unexpected artifacts are present in commits.
- Pre-commit hooks block large files that are not tracked by Git LFS.

Status: Short-term fixes implemented (gitattributes, pre-commit LFS check, health script, scheduled workflow).

Phase 3 — Medium-term fixes (2-5 days)

Tasks:
1. Add pre-commit hooks and a# CrowdSec Hub Presets Sync & Apply Plan (feature/beta-release)

🚨 CI/CD Incident Report - Run 20046135423-29 (2025-12-08 23:20 UTC)

Status: ALL BUILDS FAILING on feature/beta-release Trigger: Push of commit 571a61a (CrowdSec cscli installation) Impact: Docker publish blocked, codecov upload failed, all integration tests skipped

Root Causes Identified

1. CRITICAL: Missing Frontend File `frontend/src/data/crowdsecPresets.ts`

Evidence: TypeScript compilation fails in Docker build, frontend tests, and WAF integration
Error: Cannot find module '../data/crowdsecPresets' or its corresponding type declarations
Affected Jobs:
- Run 20046135429 (Docker Build) - exit code 2 at Dockerfile:47
- Run 20046135423 (Frontend Codecov) - 2 test suites failed
- Run 20046135424 (WAF Integration) - Docker build failed
- Run 20046135426 (Quality Checks - Frontend) - test failures
Files Importing Missing Module:
- frontend/src/pages/CrowdSecConfig.tsx:17
- frontend/src/pages/__tests__/CrowdSecConfig.spec.tsx:13
- frontend/src/pages/__tests__/CrowdSecConfig.test.tsx (indirect via CrowdSecConfig.tsx)
Type Errors Cascade:
- CrowdSecConfig.tsx(86,52): error TS7006: Parameter 'preset' implicitly has an 'any' type
- CrowdSecConfig.tsx(92-96): error TS2339: Property 'title'|'description'|'content'|'tags'|'warning' does not exist on type '{}'
- 10 TypeScript errors total prevent npm build completion
Git History: File never existed in repository; referenced in current_spec.md line 6 but never committed
Remediation: Create frontend/src/data/crowdsecPresets.ts with CROWDSEC_PRESETS constant and CrowdsecPreset type export

2. Backend Coverage Below Threshold (84.8% < 85.0% required)

Evidence: Go test suite passes all tests but coverage check fails
Error: Coverage 84.8% is below required 85% (set CHARON_MIN_COVERAGE or CPM_MIN_COVERAGE to override)
Affected Job: Run 20046135423 (Backend Codecov Upload) - exit code 1
Impact: Codecov upload skipped, quality gate not met
Analysis: Recent commits added CrowdSec hub sync code without corresponding unit tests
Likely Contributors:
- Commit be2900b: "add HUB_BASE_URL configuration and enhance CrowdSec hub sync"
- Commit 571a61a: "install CrowdSec CLI (cscli) in Docker runtime"
- New code in backend/internal/crowdsec/ lacks test coverage
Remediation Options:
1. Add unit tests for new CrowdSec hub sync functions to reach 85%+ coverage
2. Temporarily lower threshold via CHARON_MIN_COVERAGE=84 (not recommended for merge)
3. Exclude untested experimental code from coverage calculation until implementation complete

3. Frontend Test Failures (2 test suites failed, 587 tests passed)

Evidence: Vitest reports 2 failed suites due to missing module
Affected Job: Run 20046135423 (Frontend Codecov Upload) & Run 20046135426 (Quality Checks)
Failed Suites:
- src/pages/__tests__/CrowdSecConfig.spec.tsx
- src/pages/__tests__/CrowdSecConfig.test.tsx
Root Cause: Same as #1 - missing crowdsecPresets.ts file
Consequence: Frontend coverage calculation incomplete, 587 other tests pass

4. Docker Multi-Arch Build Failure (linux/amd64, linux/arm64)

Evidence: Build canceled at frontend stage with TypeScript errors
Affected Job: Run 20046135429 (Docker Build, Publish & Test)
Build Stages:
- Stage frontend-builder 6/6 failed during npm run build
- Stages backend-builder canceled due to frontend failure
- No images pushed to ghcr.io, Trivy scan skipped
Root Cause: Same as #1 - TypeScript compilation blocked by missing module
Downstream Impact: Test Docker Image job skipped (no image available)

5. WAF Integration Tests Skipped

Evidence: Docker build failed before tests could run
Affected Job: Run 20046135424 (WAF Integration Tests)
Build Step Failure: Same TypeScript errors at Dockerfile:47
Container Status: charon-debug container never created
Root Cause: Same as #1 - build precondition not met

Are These Fixed by Recent Commits?

NO - Analysis of commits since 571a61a (the triggering commit at 2025-12-08 23:19:38Z):

Commit 8f48e03: Merge development → feature/beta-release (no fixes)
Commit 32ed8bc: Merge PR #332 development → feature/beta-release (no fixes)
Latest commit on feature/beta-release: 32ed8bc (2025-12-09 00:26:07Z)
Missing file still not present in workspace or git history
Coverage issue unaddressed - no new tests added

Required Remediation Steps

IMMEDIATE (blocks all CI):

Create Missing Frontend File

# Create frontend/src/data/crowdsecPresets.ts with structure:
export interface CrowdsecPreset {
  slug: string;
  title: string;
  description: string;
  content: string;
  tags: string[];
  warning?: string;
}
export const CROWDSEC_PRESETS: CrowdsecPreset[] = [
  // Populate from backend/internal/crowdsec/presets.go or empty array
];

Verify TypeScript Compilation

cd frontend && npm run build
cd frontend && npm run test:ci

REQUIRED (for merge):

Add Backend Unit Tests for CrowdSec Hub Sync
- Target files: backend/internal/crowdsec/hub_sync.go, hub_cache.go
- Create: backend/internal/crowdsec/hub_sync_test.go, hub_cache_test.go
- Achieve: ≥85% coverage threshold

Run Full CI Validation

# Backend
cd backend && go test ./... -v -coverprofile=coverage.txt
# Frontend
cd frontend && npm run test:ci
# Docker
docker build --platform linux/amd64 -t charon:test .

OPTIONAL (technical debt):

Update Documentation
- Fix docs/plans/current_spec.md line 6 reference to non-existent file
- Add troubleshooting entry for missing preset file scenario
Add Pre-commit Hook
- Validate TypeScript imports resolve before commit
- Block commits with missing module references

Prevention Measures

Pre-commit validation: TypeScript type checking must pass (tsc --noEmit)
Coverage enforcement: CI should fail immediately when coverage drops below threshold
Integration test gating: Block merge if Docker build fails on any platform
Module existence checks: Lint for import statements referencing non-existent files
Test coverage for new features: Require tests in same commit as feature code

Current State (what exists today)

Backend: backend/internal/api/handlers/crowdsec_handler.go exposes ListPresets (returns curated list from backend/internal/crowdsec/presets.go) and a stubbed PullAndApplyPreset that only validates slug and returns preview or HTTP 501 when apply=true. No real hub sync or apply.
Backend uses CommandExecutor for cscli decisions only; no hub pull/install logic and no cache/backups beyond file write backups in WriteFile and import flow.
Frontend: frontend/src/pages/CrowdSecConfig.tsx calls pullAndApplyCrowdsecPreset then falls back to local writeCrowdsecFile apply. Preset catalog merges backend list with frontend/src/data/crowdsecPresets.ts. Errors 501/404 are surfaced as info to keep local apply working. Overview toggle/start/stop already wired to startCrowdsec/stopCrowdsec.
Docs: docs/cerberus.md still notes CrowdSec integration is a placeholder; no hub sync described.

Recent updates (2025-12-09)

Backed up and removed local codeql-db* directories from the working tree to data/backups/codeql-db-backup-<timestamp>.tar.gz to avoid indexer confusion and reduce working tree noise.
Added a pre-commit hook scripts/pre-commit-hooks/block-codeql-db-commits.sh and enabled it in .pre-commit-config.yaml to prevent committing codeql-db artifacts.
Added a repo health check and CI safety steps; ran scripts/repo_health_check.sh locally and confirmed it exits OK after directory removal.
Created a local backup: data/backups/codeql-db-backup-20251209T015533Z.tar.gz (not tracked).

Next steps

Consider history cleaning if archived codeql DBs affect size or packs; recommend git filter-repo with the --strip-blobs-bigger-than option and a clear backup/rewrite plan.
Run git gc --prune=now and git fsck --full to clean garbage objects after a scripted history rewrite or large object removals.

History Rewrite Plan (if required)

Confirm the set of blobs & paths that must be removed from history:

git rev-list --objects --all | sort -k2 | rg "codeql-db|codeql-db-" and git verify-pack -v .git/objects/pack/*.idx | sort -k3 -n | tail -n 300.

Create a safe snapshot & announce planned rewrite to maintainers:

git branch backup/main-YYYYMMDD push to origin as a backup branch (do not delete yet).

Run git filter-repo to remove the heavy blobs or paths:

Example: git filter-repo --invert-paths --paths codeql-db --paths backend/codeql-db --paths codeql-db-js --paths codeql-db-go
Or, git filter-repo --strip-blobs-bigger-than 50M to strip large blobs.

Validate the repo after rewrite:

git count-objects -vH shows pack-shrunk sizes
Run CI checks locally: backend: Go Test, Frontend build, pre-commit run --all-files.

Coordinate forced push & relay steps to contributors:

git push --force --all and git push --force --tags (advertise to collaborators).
Ask maintainers to rebase/force-pull their forks/branches.

Ensure post-clean tasks:

Add branch protection policy to block force pushes to main except when required with documented approvals.
Create a short script & docs scripts/repair_after_filter_repo.sh for maintainers that describes rebase steps.

Notes:

History rewrite is destructive; only do after explicit approval and scheduling during a low-impact window.
If the repo has widely used forks or CI jobs referencing old commit hashes, establish a temporary redirect communication plan.

History rewrite summary (safe workflow)

For repository history cleanup to remove committed CodeQL DBs or large blobs, the repo now contains a small set of tools under scripts/history-rewrite to help plan and safely execute this action. They are:

scripts/history-rewrite/clean_history.sh — Preview and optionally (with --force) run a git-filter-repo history rewrite. Default is --dry-run and the script creates a timestamped backup branch named backup/history-YYYYMMDD-HHMMSS before any destructive operations. The script logs operations to data/backups/history_cleanup-YYYYMMDD-HHMMSS.log and prints next-step instructions. Do NOT run --force on main or master and coordinate with maintainers before force-pushing.
scripts/history-rewrite/preview_removals.sh — Print commit/object lists and example large blobs relevant to the paths and strip size for verification.
scripts/history-rewrite/validate_after_rewrite.sh — Run git fsck, git count-objects -vH, pre-commit hooks, backend go test ./..., and frontend npm run build to verify the repository after a rewrite.

Quick `clean_history.sh` usage examples

Dry-run:
- scripts/history-rewrite/clean_history.sh --dry-run --paths 'backend/codeql-db,codeql-db' --strip-size 50
- This logs what would be removed without making any changes; review data/backups/history_cleanup-*.log for details.
Preview only:
- scripts/history-rewrite/preview_removals.sh --paths 'backend/codeql-db,codeql-db' --strip-size 50
Destructive rewrite (ONLY after approval):
- scripts/history-rewrite/clean_history.sh --force --paths 'backend/codeql-db,codeql-db' --strip-size 50
- It will create backup/history-YYYYMMDD-HHMMSS, prompt for explicit confirmation I UNDERSTAND, run git filter-repo locally, then run git fsck and git gc.
- After rewrite, do not auto-push; perform git push --all --force and git push --tags --force only after team approval.

Warnings & notes

The scripts only prepare and perform the rewrite locally; they will not force-push to remote unless you do so manually.
Avoid running --force on main or master. Use a feature branch or a controlled clone.
The rewrite is destructive; maintainers must rebase or re-clone after force-push.
Always verify with scripts/history-rewrite/validate_after_rewrite.sh before any force push.

Incident Triage: CrowdSec preset pull/apply 502/500 (feature/beta-release)

Logs to pull first: backend app/GIN logs under /app/data/logs/charon.log (or data/logs/charon.log in dev) via backend/cmd/api/main.go; look for warnings "crowdsec preset pull failed" / "crowdsec preset apply failed" emitted in backend/internal/api/handlers/crowdsec_handler.go. Access logs will also show 502/500 for POST /api/v1/admin/crowdsec/presets/pull and /apply.
Routes and code paths: handlers PullPreset and ApplyPreset live in backend/internal/api/handlers/crowdsec_handler.go and delegate to HubService.Pull/Apply in backend/internal/crowdsec/hub_sync.go with cache helpers in backend/internal/crowdsec/hub_cache.go. Data dir used is data/crowdsec with cache under data/crowdsec/hub_cache from backend/internal/api/routes/routes.go.
Quick checks before repro: (1) Cerberus enabled (feature.cerberus.enabled setting or FEATURE_CERBERUS_ENABLED/CERBERUS_ENABLED env) or handler returns 404 early; (2) cscli on PATH and executable (HubService uses real executor and calls cscli version/cscli hub install); (3) outbound HTTPS to https://hub.crowdsec.net reachable (fallback after cscli hub list); (4) cache dir writable data/crowdsec/hub_cache and contains per-slug metadata.json, bundle.tgz, preview.yaml; (5) backup path writable (apply renames data/crowdsec to data/crowdsec.backup.<ts>).
Likely 502 on pull: hub cache unavailable or init failed (cache dir permission), invalid slug, hub index fetch errors (cscli hub list -o json or direct GET /api/index.json), download blocked/size >25MiB, preview/download HTTP non-200, or cache write errors. Handler logs warning and returns 502 with error string.
Likely 500 on apply: backup rename fails, cscli install fails with no cache fallback (if pull never succeeded or cache expired/missing), cache read errors (metadata.json/bundle.tgz unreadable), tar extraction rejects symlinks/unsafe paths, or rollback after extract failure. Handler writes CrowdsecPresetEvent (if DB reachable) with backup path and returns 500 with backup hint.
Validation steps during triage: verify cache entry freshness (TTL 24h) via metadata.json timestamps; confirm cscli hub install <slug> succeeds manually; if cscli missing, ensure prior pull populated cache; test hub egress with curl to hub index and archive URLs; check file ownership/permissions on data/crowdsec and data/crowdsec/hub_cache; confirm log lines around warnings for exact error message; inspect backup directory to restore if partial apply.

Current incident: preset apply returning "Network Error" (feature/beta-release)

What we see: frontend reports axios "Network Error" while applying a preset. Backend logs do not yet show the apply warning, suggesting the client drops before an HTTP response arrives. Apply path runs HubService.Apply in backend/internal/crowdsec/hub_sync.go with a 15s context; pull uses a 10s HTTP client timeout and does not follow redirects. Axios flags a network error when the TCP connection is reset/timeout rather than when a 4xx/5xx is returned.
Probable roots to verify quickly:
- Hub index/preview/archives now redirect to another host; our HTTP client forbids redirects, so FetchIndex/Pull return an error and the handler responds 502 only after the hub timeout. Long hub connect attempts can hit the 10s client timeout, causing the upstream (Caddy) or browser to drop the socket and surface a network error.
- Runtime image may be missing cscli if the release archive layout changed; Dockerfile only moves the binaries when expected paths exist. Without cscli, Apply falls back to cache, but if Pull already failed, Apply exits with an error and no response body. Validate cscli version inside the running container built from feature/beta-release.
- Outbound egress/proxy: container must reach https://hub-data.crowdsec.net (default) from within the Docker network. Missing HTTP(S)_PROXY/NO_PROXY or a transparent MITM can cause TLS handshake or connection timeouts that the client reports as network errors.
- TLS/HTML responses: hub returning HTML (maintenance/Cloudflare) or a 3xx/302 to http is treated as an error (hub index responded with HTML), which becomes 502. If the redirect/HTML arrives after ~10s the browser may already have given up.
- Timeout budget: 10s pull / 15s apply may be too tight for hub downloads + cscli install. When the context cancels mid-stream, gin closes the connection and axios logs network error instead of an HTTP code.
Remediation plan (no code yet):
- Confirm cscli exists in the runtime image from Dockerfile by running cscli version inside the failing container; if missing, adjust build or add a startup preflight that logs absence and forces HTTP hub path.
- Override HUB_BASE_URL to a known JSON endpoint (e.g., https://hub-data.crowdsec.net/api/index.json) when redirects occur, or point to an internal mirror reachable from the Docker network; document this in env examples.
- Ensure outbound 443 to hub-data is allowed or set HTTP(S)_PROXY/NO_PROXY on the container; retry pull/apply after validating curl -v https://hub-data.crowdsec.net/api/index.json inside the runtime.
- Consider raising pull/apply timeouts (and matching frontend request timeout) and log when contexts cancel so we return a 504/timeout JSON instead of a dropped socket.
- Capture docker logs for charon-debug during repro; look for crowdsec preset pull/apply failed warnings and any TLS/redirect messages from backend/internal/crowdsec/hub_sync.go.

Goal

Implement real CrowdSec Hub preset sync + apply on backend (using cscli or direct hub index) with caching, validation, backups, rollback, and wire the UI to new endpoints so operators can preview/apply hub items with clear status/errors.

Backend Plan (handlers, helpers, storage)

Route adjustments (gin group under /admin/crowdsec in backend/internal/api/handlers/crowdsec_handler.go):
- Replace stub endpoint with POST /admin/crowdsec/presets/pull → fetch hub item and cache; returns metadata + preview + cache key/etag.
- Add POST /admin/crowdsec/presets/apply → apply previously pulled item by cache key/slug; performs backup + cscli install + optional restart.
- Keep GET /admin/crowdsec/presets but include hub/etag info and whether cached locally.
- Optional: GET /admin/crowdsec/presets/cache/:slug → raw preview/download for UI.
Hub sync helper (new backend/internal/crowdsec/hub_sync.go):
- Provide type HubClient interface { FetchIndex(ctx) (HubIndex, error); FetchPreset(ctx, slug) (PresetBundle, error) } with real impl using either: a) cscli hub list -o json and cscli hub update + cscli hub install <item> (preferred if cscli present), or b) direct fetch of https://hub.crowdsec.net/ or GitHub raw .index.json + tarball download.
- Validate downloads: size limits, tarball path traversal guard, checksum/etag compare, basic YAML validation.
Caching (new backend/internal/crowdsec/hub_cache.go):
- Cache pulled bundles under ${DataDir}/hub_cache/<slug>/ with index metadata (etag, fetched_at, source URL) and preview YAML.
- Expose LoadCachedPreset(slug) and StorePreset(slug, bundle); evict stale on TTL (configurable, default 24h) or when etag changes.
Apply flow (extend handler):
- Pull: fetch index, resolve slug, download bundle to cache, return preview + warnings (missing cscli, requires restart, etc.).
- Apply: before modify, run backupDir := DataDir + ".backup." + timestamp (mirror current write/import backups). Then: a) If cscli available: cscli hub update, cscli hub install <slug> (or collection path), maybe cscli decisions list sanity check. Use CommandExecutor with context timeout. b) If cscli absent: extract bundle into DataDir with sanitized paths; preserve permissions. c) Write audit record to DB table crowdsec_preset_events (new model in backend/internal/models).
- On failure: restore backup (rename back), surface error + backup path.
Status and restart:
- After apply, optionally call h.Executor.Stop/Start if running to reload config; or cscli service reload when available. Return reload_performed flag.
Validation & security hardening:
- Enforce Cerberus enablement check (isCerberusEnabled) on all new routes.
- Path sanitization with filepath.Clean, limit tar extraction to DataDir, reject symlinks/abs paths.
- Timeouts on all external calls; default 10s pull, 15s apply.
- Log with context: slug, etag, source, backup path; redact secrets.
Migration of curated list:
- Keep curated presets in backend/internal/crowdsec/presets.go but add Source: "hub" for hub-backed items and include RequiresHub true when not bundled.
- ListPresets should merge curated + live hub index when available, mark availability per slug (cached, remote-only, local-bundled).

Frontend Plan (API wiring + UX)

API client updates in frontend/src/api/presets.ts:
- Replace pullAndApplyCrowdsecPreset with pullCrowdsecPreset({ slug }) and applyCrowdsecPreset({ slug, cache_key }); include response typing for preview/status/errors.
- Add getCrowdsecPresetCache(slug) if backend exposes cache preview.
CrowdSec config page frontend/src/pages/CrowdSecConfig.tsx:
- Use new mutations: pull to show preview + metadata (etag, fetched_at, source); disable local fallback unless backend says apply_supported=false.
- Show status strip (success/error) and backup path from apply response; surface reload flag and errors inline.
- Gate preset actions when Cerberus disabled; show tooltip if hub unreachable.
- Keep local backup + manual file apply as last-resort only when backend explicitly returns 501/NotImplemented.
Overview page frontend/src/pages/Security.tsx:
- No UI change except error surfacing when start/stop fails due to hub apply requiring reload; show toast from handler message.
Import page frontend/src/pages/ImportCrowdSec.tsx:
- Add note linking to presets apply so users prefer presets over raw package imports.

Hub Fetch/Validate/Apply Flow (detailed)

Pull
- Handler: CrowdsecHandler.PullPreset(ctx) (new) calls HubClient.FetchPreset → HubCache.StorePreset → returns {preset, preview_yaml, etag, cache_key, fetched_at}.
- If hub unavailable, return 503 with message; UI shows retry/cached copy option.
Apply
- Handler: CrowdsecHandler.ApplyPreset(ctx) loads cache by slug/cache_key → backupCurrentConfig() → InstallPreset() (cscli or manual) → optional restart → returns {status:"applied", backup, reloaded:true/false}.
- On error: restore backup, include {status:"failed", backup, error}.
Caching & rollback
- Cache directory per slug with checksum file; TTL enforced on pull; apply uses cached bundle unless force_refetch flag.
- Backups stored with timestamp; keep last N (configurable). Provide restoration note in response for UI.
Validation
- Tarball extraction guard: reject absolute paths, .., symlinks; limit total size.
- YAML sanity: parse key scenario/collection files to ensure readable; log warning not blocker unless parse fails.
- Require explicit apply=true separate from pull; no implicit apply on pull.

Security Considerations

Only allow these endpoints when Cerberus enabled and user authenticated to admin scope.
Use CommandExecutor to shell out to cscli; restrict PATH and working dir; do not pass user-controlled args without whitelist.
Network egress: if hub URL configurable, validate scheme is https and host is allowlisted (crowdsec official or configured mirror).
Rate limit pull/apply (simple in-memory token bucket) to avoid abuse.
Logging: include slug and etag, omit file contents; redact download URLs if they contain tokens (unlikely).

Required Tests

Backend unit/integration:
- backend/internal/api/handlers/crowdsec_handler_test.go: success and error cases for PullPreset (hub reachable/unreachable, invalid slug), ApplyPreset (cscli success, cscli missing fallback, apply fails and restores backup), ListPresets merging cached hub entries.
- backend/internal/crowdsec/hub_sync_test.go: parse index JSON, validate tar extraction guards, TTL eviction.
- backend/internal/crowdsec/hub_cache_test.go: store/load/evict logic and checksum verification.
- backend/internal/api/handlers/crowdsec_exec_test.go: ensure executor timeouts/commands constructed for cscli hub calls.
Frontend unit/UI:
- frontend/src/pages/tests/CrowdSecConfig.test.tsx: pull shows preview, apply success shows backup path/reload flag, hub failure falls back to cached/local message, Cerberus disabled disables actions.
- frontend/src/api/tests/presets.test.ts: client hits new endpoints and maps response.
- frontend/src/pages/tests/Security.test.tsx: start/stop toasts remain correct when apply errors bubble.

Docs Updates

Update docs/cerberus.md CrowdSec section with new hub preset flow, backup/rollback notes, and requirement for cscli availability when using hub.
Update docs/features.md to list “CrowdSec Hub presets sync/apply (admin)” and mention offline curated fallback.
Add short troubleshooting entry in docs/troubleshooting/crowdsec.md (new) for hub unreachable, checksum mismatch, or cscli missing.

Migration Notes

Existing curated presets remain but are marked as bundled; UI should continue to show them even if hub unreachable.
Stub endpoint POST /admin/crowdsec/presets/pull/apply is replaced by separate pull and apply; frontend must switch to new API paths before backend removal to avoid 404.
Backward compatibility: keep returning 501 from old endpoint until frontend merged; remove once new routes live and tested.

Automated CI Dry-Run & PR Checklist Plan

Objective: Add a CI dry-run workflow and a PR checklist enforcement job and template to reduce release/merge regressions. The dry-run will validate build/test/lint/packaging steps without publishing or writing secrets. The PR checklist job ensures contributors used the PR template.

Files to add / edit

Add: .github/workflows/dry-run.yml — main dry-run workflow triggered on PR and schedule (weekly)
Add: .github/workflows/pr-checklist.yml — PR body validation workflow to ensure PR checklist compliance
Add: .github/PULL_REQUEST_TEMPLATE/pr_template.md — PR template with developer checklist
Add: scripts/ci/dry_run_build.sh — wrapper script used by dry-run to orchestrate checks
Add: scripts/ci/dry_run_goreleaser.sh — goreleaser snapshot dry-run runner
Add: scripts/ci/check_pr_checklist.sh — standalone PR-checklist script (for local/CI usage)
Edit: .pre-commit-config.yaml — recommend add tsc --noEmit and golangci-lint local hooks (if not present as mandatory checks)
Review: .gitignore, .gitattributes — ensure scripts/ci temp artifacts are ignored and LFS/gitattribute rules still appropriate

Suggested job names & responsibilities

dry-run-backend (backend build: go build, go test with coverage, go vet)
dry-run-frontend (frontend build & tests, npm ci, npm run build, npm run test:ci)
dry-run-docker (docker build only, docker build, no push; multi-platform on branches only)
dry-run-goreleaser (goreleaser release in dry-run mode; --snapshot or --skip-publish)
dry-run-security (Trivy fs scan of built binary and optional static scans; non-publish SARIF upload disabled for fork PRs)
validate-pr-checklist (validate PR body contains required checklist items)

Workflow triggers

pull_request (types: opened, edited, synchronize, reopened)
workflow_dispatch (manual run)
schedule (cron — e.g., 0 3 * * 0 weekly; optional daily lightweight run for repo health exists already via repo-health.yml)

Permissions and secrets considerations

Default minimal perms: contents: read for checkout; use checks: write to set check statuses if necessary.
Do not use packages: write or GITHUB_TOKEN write duties on PRs (unless gated to branch-origin PRs). Dry-run avoids publishing and won't require packages: write.
Conditionally run goreleaser or any publish checks only when PR originates from repo owner and/or is a branch on the main repo (forks cannot access secrets): if: ${{ github.event.pull_request.head.repo.full_name == github.repository }}.
CHARON_TOKEN or custom PAT should not be used in PR runs from forks to avoid secret leakage.

YAML job snippet samples

Example dry-run-backend job

jobs:
  dry-run-backend:
    name: Backend Dry-Run
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v4
        with: { go-version: '1.25' }
      - name: Run repo health check
        run: bash scripts/repo_health_check.sh
      - name: Run backend tests
        run: bash scripts/go-test-coverage.sh

Example dry-run-frontend job

  dry-run-frontend:
    name: Frontend Dry-Run
    runs-on: ubuntu-latest
    needs: dry-run-backend
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '24' }
      - name: Install + Test
        working-directory: frontend
        run: |
          npm ci
          bash ../scripts/frontend-test-coverage.sh 2>&1 | tee frontend/test-output.txt

Example dry-run-docker job

  dry-run-docker:
    name: Build Docker Image (No Push)
    runs-on: ubuntu-latest
    needs: dry-run-frontend
    steps:
      - uses: actions/checkout@v4
      - name: Build Docker
        run: docker build --platform linux/amd64 -t charon:pr-${{ github.sha }} .

Example dry-run-goreleaser job protected from fork PR secrets exposure

  dry-run-goreleaser:
    name: GoReleaser Dry-Run
    runs-on: ubuntu-latest
    needs: dry-run-docker
    if: ${{ github.event_name == 'workflow_dispatch' || github.repository == github.event.pull_request.head.repo.full_name }}
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v4
        with: { go-version: '1.25' }
      - name: Run GoReleaser (dry-run)
        uses: goreleaser/goreleaser-action@v6
        with: { args: 'release --snapshot --rm-dist --skip-publish' }

PR checklist validation job (using github-script or local script)

jobs:
  validate-pr-checklist:
    name: Validate PR Checklist
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Validate checklist
        uses: actions/github-script@v7
        with:
          script: |
            const pr = await github.rest.pulls.get({owner: context.repo.owner, repo: context.repo.repo, pull_number: context.issue.number});
            const body = (pr.data && pr.data.body) || '';
            const required = [ 'Tests (backend/frontend) pass', 'Pre-commit checks pass', 'Lints and type checks pass', 'Changelog updated (if applicable)' ];
            for (const r of required) {
              if (!body.toLowerCase().includes(r.toLowerCase())) { core.setFailed('Missing checklist item: '+r); return; }
            }

Expected behavior and gating strategy

The dry-run workflow runs on PRs and fails the PR if build/test/goreleaser dry-run steps fail.
For PRs from forks, goreleaser/publish and other actions requiring secrets must be gated and skipped.
Maintain only lightweight jobs for PRs (run code checks, not artifact uploads). For nightly schedules, run heavier checks.

Edge cases & mitigation

Fork PRs: secrets are not available. Skip secret-dependent steps and run only build/test/lint steps.
Large builds or timeouts: Break checks into small jobs with needs to fail early, set timeouts on longest-running jobs and precautions for cache usage.
Flaky tests: mark flaky steps non-blocking but create an issue/annotation when flaky tests flake.

CI Scheduler recommendation

Keep the existing repo-health.yml schedule (daily). Add dry-run.yml with weekly schedule (0 3 * * 0) to validate the entire build pipeline weekly.

PR Template (simply create .github/PULL_REQUEST_TEMPLATE/pr_template.md) — basic checklist

## Summary

## Checklist
- [ ] Tests (backend & frontend) validated locally and CI
- [ ] All tests pass in CI (Quality Checks)
- [ ] Lint and type checks pass (pre-commit hooks run locally)
- [ ] Changelog updated (if relevant)
- [ ] Docs updated (if user-facing change)
- [ ] No sensitive info or secrets in this PR

.gitattributes, .gitignore, and pre-commit checks review

.gitattributes: ensure *.db, *.sqlite, codeql-db/** are LFS/binary as appropriate (file exists): verify no changes required; add new file patterns for scripts/ci artifacts if necessary.
.gitignore: add /.tmp-ci or other ephemeral artifact patterns for CI scripts to avoid noise.
.pre-commit-config.yaml: add a local tsc or npx tsc --noEmit hook if not present; keep check-lfs-large-files and block-codeql-db-commits enforced. Consider adding a new local ci-dry-run hook for pre-push gating (manual stage) but mainly rely on the CI workflows.

Testing plan & acceptance criteria

Run dry-run.yml via workflow_dispatch for a test branch (internal) and validate the dry-run jobs pass.
Open a sample PR with an incomplete checklist and confirm pr-checklist fails with actionable message.
Verify goreleaser dry-run step will not run for forked PRs and is only executed for branches in the same repository.
Acceptance: job completes successfully under normal code status; PR blocking on failures for main + feature/beta-release.

Next steps for maintainers

Create the initial dry-run.yml & pr-checklist.yml workflows and a test PR to check behavior.
Add scripts/ci/dry_run_* scripts and link them from the new workflows.
Add the PR template file in .github/PULL_REQUEST_TEMPLATE and update CONTRIBUTING.md to require PR checklist checks.
Test behavior for fork PRs and set branch protection rules to require these checks on relevant branches.
Iterate in a small number of PRs to tune the threshold and gating behavior.

Status: Implemented

Files added and wired into CI:

.github/workflows/dry-run-history-rewrite.yml — runs a non-destructive history/large-file check on PRs and schedule.
.github/PULL_REQUEST_TEMPLATE/history-rewrite.md — PR checklist for history rewrite PRs.
scripts/ci/dry_run_history_rewrite.sh — CI wrapper that fails when banned paths or large historical objects are found.-.github/workflows/pr-checklist.yml — validates the PR body contains required checklist items for history-rewrite PRs. Validation steps performed locally and via CI (dry-run):
scripts/ci/dry_run_history_rewrite.sh returns non-zero when repo history contains objects or commits touching backend/codeql-db or other listed paths.
The workflow uses actions/checkout@v4 with fetch-depth: 0 to ensure history is available for the check.
PR template ensures contributors attach dry-run output and backup logs prior to destructive cleanups.

Next considerations:

Add a validate-pr-checklist workflow to enforce general PR checklist items for all PRs if desired (future improvement).

51 KiB Raw Blame History Unescape Escape