51 KiB
Overview
This document outlines an investigation and remediation plan for the error message: "Could not build remote workspace index. Could not check the remote index status for this repo." It contains diagnoses, exact checks, reproductions, fixes, CI/security checks, and acceptance criteria.
Summary: Background & likely root causes
- Background: Remote workspace index builds (e.g., on GitHub or other code hosting platforms) analyze repository contents to provide search, code navigation, and other services. Index processes can fail due to repository content, metadata, workflows, or platform limitations.
- Likely root causes:
- Permission issues: index build needs read access or specific tokens (missing PAT or actions secrets, restricted branch protection blocking re-run).
- Repository size limits: excessive repo size (large binaries, codeql databases, caddy caches) triggers failures.
- Git LFS misconfiguration: large files stored in Git rather than LFS; remote indexer times out scanning big objects.
- Cyclic symlinks or malformed symlinks: indexers can hit infinite loops or unsupported references.
- Large or unsupported binary files: vendor/binaries and artifacts committed directly instead of using artifacts or releases.
- Missing or stale code scanning artifacts (e.g., codeql-db) present in repo root that confuse the indexer.
- Malformed .git/config or broken submodules: submodule references or malformed remote URL prevents indexer from checking status.
- Network issues: temporary outages between indexer service and repo host.
- GitHub Actions failures: workflows that set up the environment fail due to missing CI keys or secrets.
- Branch protection / repo policies: preventing GitHub from performing necessary indexing operations.
Files and Locations To Inspect (exact)
- Repository configuration
- .git/config (.git/config) — inspect remotes, submodules, and alternate refs
- .gitattributes (.gitattributes) — look for LFS vs tracked files
- .gitignore (.gitignore) — ensure generated artifacts and codeql-db are ignored
- CI and workflows
- .github/workflows/**/* (.github/workflows) — workflows and secrets usage
- .github/ (branch protection or other settings logged in repo web UI)
- Code scanning artifacts
- codeql-db/ (codeql-db-go/, codeql-db-js/) — ensure not committed as code
- backend/codeql-db/ or backend/data/ — artifacts that increase repo size and confuse indexer
- Build & deploy config
- Dockerfile (Dockerfile) — check for COPY of large files
- docker-compose*.yml (docker-compose.yml, docker-compose.local.yml ) — services and volumes
- Platform / metadata
- .vscode/* (workspace settings) — local workspace mapping
- go.work (go.work) — ensures workspace go settings are correct
- Frontend & package manifests
- frontend/package.json (frontend/package.json) — scripts and postinstall tasks that may run CI or build steps
- package.json (root/package.json) — library or workspace settings affecting index size
- Binary caches & artifacts
- data/ (data/), backend/data/ — DB files, caddy files, and caches
- tools/ (tools/) — large binaries or vendor files committed for local dev
- Git LFS and hooks
- .git/hooks/* — check for LFS pre-commit/pre-push hooks
- GitHub admin side
- Branch protection settings in GitHub UI (branches) — check protected branches and required status checks
- Logs & scan results
- .github/workflows action logs (in GitHub Actions run view or
ghCLI) - backend/test-output.txt (backend/test-output.txt) and other preexisting logs
Commands & Logs To Collect (reproduction & evidence gathering)
- Local repo inspection
git status --porcelain --ignored— show ignored & modified filesgit fsck --full— scan object store for corruptiongit count-objects -vH— get repo size (packed/unpacked).du -sh .anddu -sh $(git rev-parse --show-toplevel)/*— check workspace size from file systemgit rev-list --objects --all | sort -k 2 > allfiles.txtthengit rev-list --objects --all | sort -k 2 | tail -n 50— find large commits/objectsgit verify-pack -v .git/objects/pack/pack-*.idx | sort -k 3 -n | tail -n 50— inspect object packsgit lfs ls-filesandgit lfs env— check LFS tracked files and envfind . -type f -size +100M -exec ls -lh {} \;— detect large files in working treerg -S "codeql-db|codeql database|codeql" -n— ripgrep for CodeQL references
git submodule status --recursive— check submodulesgit config --list --show-origin— confirm config and values (e.g., LFS config)
-
Reproducing remote run locally / in GH Actions 12. Use
ghCLI:gh run list --repo owner/repothengh run view RUN_ID --log --repo owner/repoto collect logs 13. Rerun the failing index action:gh run rerun RUN_ID --repo owner/repo(requires permissions) 14. Recreate CodeQL database locally to test:codeql database create codeql-db --language=go --command='cd backend && go test ./... -c'thencodeql database analyze codeql-db codeql-custom-queries-goto check for broken DBs 15.docker build --no-cache -t charon:local .anddocker run --rm charon:localto catch missinggitor LFS artifacts copied into images -
Network & permissions 16. Check PAT and ACTIONS tokens:
gh auth statusandgh secret list --repo owner/repoand verify repository secrets usage in workflows 17. Verify branch protection policies viagh api repos/:owner/:repo/branches/:branchor through the GitHub UI 18. Usecurl -I https://api.github.com/repos/:owner/:repoto confirm GitHub's API reachability
Steps To Gather Evidence (sequence)
- Triage: Run the local repo inspections above (commands 1–11). Save outputs to
scripts/diagnostics/repo_health.OUTPUT. - Confirm large files or code scan DBs are present:
find . -type d -name "codeql-db*" -o -name "db*" -o -name "node_modules" -o -name "vendor". - Inspect
git lfs ls-filesandgit lfs envfor LFS misconfig. - Collect CI history and logs for failing runs using
gh run view --logfor the action(s) that produce the index error. - Verify permission and branch protections:
gh apiqueries &gh secret list --repo. - Verify Dockerfile and workflow steps that might include copying large artifacts and causing indexer timeouts.
Short-Term Mitigations (quick steps)
- Disable or re-run failing index builds with proper permissions:
gh run rerun RUN_ID --repo owner/repo. - Move large artifacts out of the repo: create a new docs/ or data/ to store metadata and put
codeql-db*,tools/,data/, andbackend/data/into.gitignoreand.gitattributesto exclude them from the index. - Add
codeql-dbto.gitignoreand create.gitattributesentries to ensure no binary files are tracked without LFS:- .gitattributes additions: `codeql-db/** !binary
- Update
.gitignore: add/codeql-db,/data,/backend/data,/.vscode/*/.
- Re-run the index build once large files are removed.
Medium-Term Fixes
- Git filter-repo/BFG: If large files are committed, remove them from history using
git filter-repoor BFG and force-push a cleaned branch:git filter-repo --strip-blobs-bigger-than 50M. - Convert large tracks to Git LFS for binary files:
git lfs track "*.iso"andgit lfs track "*.db"andgit add .gitattributes && git commit -m "Track binaries in LFS" && git push. - Prevent accidental artifacts in the future with a pre-commit hook (add to
scripts/pre-commit-hooksand reference in.git/hooksor pre-commit framework): runpre-commitrule to enforcemax-file-sizeand LFS checks. - Add GH actions health-check workflow (e.g.,
.github/workflows/repo-health.yml) that runs a small script to check for large files, LFS config, and codeql-db folders and opens an issue if thresholds are exceeded. - Document LFS / large file policy in
docs/(e.g.,docs/getting-started.mdanddocs/features.md) and add codeowner references.
Longer-Term Hardening
- Automate periodic repo health checks (monthly) via GH Actions to run
git count-objects -v,find -size +and warn maintainers. - Add a repository dashboard for
codeql-dband other artifacts usingscripts/and a GitHub Action to report stats to PRs or issues. - Harden the remote indexing process by requesting GitHub support if the issue is intermittent and caused by GitHub's indexer failure.
- Add a
scripts/ci_pre_check.shscript to run on PR open that checksgit fsck, LFS, and ensurescodeql-dbis not committed. - Add a
scripts/repo_health_check.shfile and include it in.vscode/tasks.jsonand as a GH Action to be optionally invoked by maintainers.
Recommended Fixes (exact files & edits, tests to run, CI checks)
-
Add
.gitignoreand.gitattributesupdates- Files to edit:
.gitignore,.gitattributes. - Steps:
- Add
/codeql-db,/codeql-db-*,/.vscode/,/backend/data,/data,/node_modulesto.gitignore. - Add
*.db filter=lfs diff=lfs merge=lfs -textto.gitattributesfor DB files andcodeql-db/** binaryto mark codeql files as binary. - Commit and push.
- Add
- Tests & checks:
- Locally run
git status --ignoredto ensure these are now ignored. - Run
git lfs ls-filesto ensure large files are LFS tracked if intended. - CI: Ensure
pre-commitchecks pass, rungh run viewto verify indexing.
- Locally run
- Files to edit:
-
Add small repo-health GH Action
- Files to add:
.github/workflows/repo-health.yml,scripts/repo_health_check.sh. - Steps:
- Implement
scripts/repo_health_check.shthat runsgit count-objects -vH,find . -size +100Mandgit fsckand prints a short JSON summary. - Add
repo-health.ymlwith a scheduled trigger and PR check to run the script.
- Implement
- Tests & checks:
- Run
bash scripts/repo_health_check.shlocally. Ensure it exits 0 when checks are clean. - CI: Ensure the workflow runs and reports results in the Actions tab.
- Run
- Files to add:
-
Build-time protections for codeql artifacts
- Files to edit:
.github/workflows/ci.yml(or equivalent CI) and.gitattributes/.gitignore. - Steps:
- Remove
codeql-dbdirectories from CI cache and artifact paths; don't commit them. - Ensure CodeQL analysis workflow uses the
actions/cacheandactions/upload-artifactcorrectly, not storing DBs in the repo.
- Remove
- Tests & checks:
- Re-run
gh actionsCodeQL workflow:gh run rerun RUN_ID --repo owner/repoand verify action no longer stores DBs as code.
- Re-run
- Files to edit:
-
Pre-commit hook for large files & LFS enforcement
- Files to add:
scripts/pre-commit-hooks/check-large-file.sh, and enable via.pre-commit-config.yamlorscripts/pre-commit-install.sh. - Steps:
- Implement a hook to fail commits larger than 50MB unless tracked in LFS.
- Add to
pre-commitconfig and install in the repo.
- Tests & checks:
- Attempt to commit a test large file > 50MB to verify the commit is rejected unless LFS tracked.
- CI: Add a PR check running
pre-committo ensure commits follow policy.
- Files to add:
-
CI policy verification for branches
- Files / settings to revise: .github/workflows for runner permissions, branch protection settings via GitHub UI
- Steps:
- Confirm user-level or organization-level
actionsandworkflowspermissions allow required actions to run indexers. - Modify workflow triggers: ensure that
pull_requestandpushdo not include large artifacts or directories.
- Confirm user-level or organization-level
- Tests & checks:
- Open PR with
scripts/changes to trigger the updated workflows; verify that they run and pass.
- Open PR with
-
Automated Monitoring & Alerts
- Files to add:
.github/workflows/monitor-repo.yml,scripts/repo-monitor.sh. - Steps:
- Implement periodic monitoring workflow to run repo health checks and open an issue or send a slack message when thresholds crossed.
- Tests & checks:
- Local run and scheduled run for the workflow to prove the alert state.
- Files to add:
-
Documentation updates
- Files to update:
docs/getting-started.md,docs/features.md,docs/security.md, CONtributing.md - Steps:
- Add guidelines for how to store large artifacts, a policy to use Git LFS, instructions on running
scripts/repo_health_check.sh.
- Add guidelines for how to store large artifacts, a policy to use Git LFS, instructions on running
- Tests & checks:
- Verify that docs references builds and CI pass with updated instructions.
- Files to update:
-
CI Integration Tests to validate fixes
- Files to add/edit:
.github/workflows/ci.yml,backendbuild scripts,frontendscript checks - Steps:
- Add a
ci.ymlstep to runbash scripts/repo_health_check.sh,go test ./...andnpm run build(frontend) as a gating check.
- Add a
- Tests & checks:
- Ensure
go build,go test, andnpm run buildpass in CI after changes.
- Ensure
- Files to add/edit:
-
Forced cleanup and migration of large objects (if necessary)
- Files to change: none—these are history-edit operations
- Steps:
- If large files are present and the repo will not admit LFS for them, use
git filter-repo --strip-blobs-bigger-than 50Mor BFG and push to a cleaned branch. - Recreate workflows or branch references as needed after forced push.
- If large files are present and the repo will not admit LFS for them, use
- Tests & checks:
- Run
git count-objects -vHbefore/after and verify the pack size decreases significantly.
- Run
-
Validate GH Actions & secrets
- Files to inspect:
.github/workflows/*and GitHub repo settings (secrets) - Steps:
- Ensure that required secrets, PATs, or
GITHUB_TOKENusage is correct; verifyactions/checkoutuses LFS fetch:actions/checkout@v2with fetch-depth: 0 and properlfsenabled.
- Ensure that required secrets, PATs, or
- Tests & checks:
- Rerun an action to ensure
git lfs ls-fileslists expected 4-5 known files and that thegh rundoes not fail with read errors.
- Rerun an action to ensure
Phased Work Plan
Phase 1 — Triage & evidence collection (2-3 days)
- Tasks:
- Run all diagnostic commands and collect output (
scripts/diagnostics/repo_health.OUTPUT). - Collect failing GH Action run logs and CodeQL run logs via
gh run view. - Determine whether the failure is reproducible or intermittent.
- Run all diagnostic commands and collect output (
- Acceptance criteria:
- Have a full diagnostic report (PAIR) that identifies a top-2 likely causes.
- Have the failing Action ID or workflow file referenced.
Phase 2 — Short-term fixes & re-run (1-2 days)
- Tasks:
- Add immediate
.gitignoreand.gitattributesprotections for known artifacts. - Add
scripts/repo_health_check.sh, a.github/workflows/repo-health.ymlworkflow, and pre-commit LFS checks. - Re-run GH Actions and index builds to check for improvement.
- Add immediate
- Acceptance criteria:
- Repo health workflow runs on PRs and schedule; the health check succeeds or reports actionable items.
- Index build does not fail due to size or missing LFS objects on re-run.
- No unexpected artifacts are present in commits.
- Pre-commit hooks block large files that are not tracked by Git LFS.
Status: Short-term fixes implemented (gitattributes, pre-commit LFS check, health script, scheduled workflow).
Phase 3 — Medium-term fixes (2-5 days)
- Tasks:
- Add
pre-commithooks and a# CrowdSec Hub Presets Sync & Apply Plan (feature/beta-release)
- Add
🚨 CI/CD Incident Report - Run 20046135423-29 (2025-12-08 23:20 UTC)
Status: ALL BUILDS FAILING on feature/beta-release
Trigger: Push of commit 571a61a (CrowdSec cscli installation)
Impact: Docker publish blocked, codecov upload failed, all integration tests skipped
Root Causes Identified
1. CRITICAL: Missing Frontend File frontend/src/data/crowdsecPresets.ts
- Evidence: TypeScript compilation fails in Docker build, frontend tests, and WAF integration
- Error:
Cannot find module '../data/crowdsecPresets' or its corresponding type declarations - Affected Jobs:
- Run 20046135429 (Docker Build) - exit code 2 at Dockerfile:47
- Run 20046135423 (Frontend Codecov) - 2 test suites failed
- Run 20046135424 (WAF Integration) - Docker build failed
- Run 20046135426 (Quality Checks - Frontend) - test failures
- Files Importing Missing Module:
frontend/src/pages/CrowdSecConfig.tsx:17frontend/src/pages/__tests__/CrowdSecConfig.spec.tsx:13frontend/src/pages/__tests__/CrowdSecConfig.test.tsx(indirect via CrowdSecConfig.tsx)
- Type Errors Cascade:
CrowdSecConfig.tsx(86,52): error TS7006: Parameter 'preset' implicitly has an 'any' typeCrowdSecConfig.tsx(92-96): error TS2339: Property 'title'|'description'|'content'|'tags'|'warning' does not exist on type '{}'- 10 TypeScript errors total prevent npm build completion
- Git History: File never existed in repository; referenced in current_spec.md line 6 but never committed
- Remediation: Create
frontend/src/data/crowdsecPresets.tswithCROWDSEC_PRESETSconstant andCrowdsecPresettype export
2. Backend Coverage Below Threshold (84.8% < 85.0% required)
- Evidence: Go test suite passes all tests but coverage check fails
- Error:
Coverage 84.8% is below required 85% (set CHARON_MIN_COVERAGE or CPM_MIN_COVERAGE to override) - Affected Job: Run 20046135423 (Backend Codecov Upload) - exit code 1
- Impact: Codecov upload skipped, quality gate not met
- Analysis: Recent commits added CrowdSec hub sync code without corresponding unit tests
- Likely Contributors:
- Remediation Options:
- Add unit tests for new CrowdSec hub sync functions to reach 85%+ coverage
- Temporarily lower threshold via
CHARON_MIN_COVERAGE=84(not recommended for merge) - Exclude untested experimental code from coverage calculation until implementation complete
3. Frontend Test Failures (2 test suites failed, 587 tests passed)
- Evidence: Vitest reports 2 failed suites due to missing module
- Affected Job: Run 20046135423 (Frontend Codecov Upload) & Run 20046135426 (Quality Checks)
- Failed Suites:
src/pages/__tests__/CrowdSecConfig.spec.tsxsrc/pages/__tests__/CrowdSecConfig.test.tsx
- Root Cause: Same as #1 - missing
crowdsecPresets.tsfile - Consequence: Frontend coverage calculation incomplete, 587 other tests pass
4. Docker Multi-Arch Build Failure (linux/amd64, linux/arm64)
- Evidence: Build canceled at frontend stage with TypeScript errors
- Affected Job: Run 20046135429 (Docker Build, Publish & Test)
- Build Stages:
- Stage
frontend-builder 6/6failed duringnpm run build - Stages
backend-buildercanceled due to frontend failure - No images pushed to ghcr.io, Trivy scan skipped
- Stage
- Root Cause: Same as #1 - TypeScript compilation blocked by missing module
- Downstream Impact: Test Docker Image job skipped (no image available)
5. WAF Integration Tests Skipped
- Evidence: Docker build failed before tests could run
- Affected Job: Run 20046135424 (WAF Integration Tests)
- Build Step Failure: Same TypeScript errors at Dockerfile:47
- Container Status:
charon-debugcontainer never created - Root Cause: Same as #1 - build precondition not met
Are These Fixed by Recent Commits?
NO - Analysis of commits since 571a61a (the triggering commit at 2025-12-08 23:19:38Z):
- Commit
8f48e03: Merge development → feature/beta-release (no fixes) - Commit
32ed8bc: Merge PR #332 development → feature/beta-release (no fixes) - Latest commit on feature/beta-release:
32ed8bc(2025-12-09 00:26:07Z) - Missing file still not present in workspace or git history
- Coverage issue unaddressed - no new tests added
Required Remediation Steps
IMMEDIATE (blocks all CI):
- Create Missing Frontend File
# Create frontend/src/data/crowdsecPresets.ts with structure: export interface CrowdsecPreset { slug: string; title: string; description: string; content: string; tags: string[]; warning?: string; } export const CROWDSEC_PRESETS: CrowdsecPreset[] = [ // Populate from backend/internal/crowdsec/presets.go or empty array ]; - Verify TypeScript Compilation
cd frontend && npm run build cd frontend && npm run test:ci
REQUIRED (for merge):
- Add Backend Unit Tests for CrowdSec Hub Sync
- Target files:
backend/internal/crowdsec/hub_sync.go,hub_cache.go - Create:
backend/internal/crowdsec/hub_sync_test.go,hub_cache_test.go - Achieve: ≥85% coverage threshold
- Target files:
- Run Full CI Validation
# Backend cd backend && go test ./... -v -coverprofile=coverage.txt # Frontend cd frontend && npm run test:ci # Docker docker build --platform linux/amd64 -t charon:test .
OPTIONAL (technical debt):
- Update Documentation
- Fix docs/plans/current_spec.md line 6 reference to non-existent file
- Add troubleshooting entry for missing preset file scenario
- Add Pre-commit Hook
- Validate TypeScript imports resolve before commit
- Block commits with missing module references
Prevention Measures
- Pre-commit validation: TypeScript type checking must pass (
tsc --noEmit) - Coverage enforcement: CI should fail immediately when coverage drops below threshold
- Integration test gating: Block merge if Docker build fails on any platform
- Module existence checks: Lint for import statements referencing non-existent files
- Test coverage for new features: Require tests in same commit as feature code
Current State (what exists today)
- Backend: backend/internal/api/handlers/crowdsec_handler.go exposes
ListPresets(returns curated list from backend/internal/crowdsec/presets.go) and a stubbedPullAndApplyPresetthat only validates slug and returns preview or HTTP 501 whenapply=true. No real hub sync or apply. - Backend uses
CommandExecutorforcscli decisionsonly; no hub pull/install logic and no cache/backups beyond file write backups inWriteFileand import flow. - Frontend: frontend/src/pages/CrowdSecConfig.tsx calls
pullAndApplyCrowdsecPresetthen falls back to localwriteCrowdsecFileapply. Preset catalog merges backend list with frontend/src/data/crowdsecPresets.ts. Errors 501/404 are surfaced as info to keep local apply working. Overview toggle/start/stop already wired tostartCrowdsec/stopCrowdsec. - Docs: docs/cerberus.md still notes CrowdSec integration is a placeholder; no hub sync described.
Recent updates (2025-12-09)
- Backed up and removed local
codeql-db*directories from the working tree todata/backups/codeql-db-backup-<timestamp>.tar.gzto avoid indexer confusion and reduce working tree noise. - Added a pre-commit hook
scripts/pre-commit-hooks/block-codeql-db-commits.shand enabled it in.pre-commit-config.yamlto prevent committingcodeql-dbartifacts. - Added a repo health check and CI safety steps; ran
scripts/repo_health_check.shlocally and confirmed it exits OK after directory removal. - Created a local backup: data/backups/codeql-db-backup-20251209T015533Z.tar.gz (not tracked).
Next steps
- Consider history cleaning if archived codeql DBs affect size or packs; recommend
git filter-repowith the--strip-blobs-bigger-thanoption and a clear backup/rewrite plan. - Run
git gc --prune=nowandgit fsck --fullto clean garbage objects after a scripted history rewrite or large object removals.
History Rewrite Plan (if required)
- Confirm the set of blobs & paths that must be removed from history:
git rev-list --objects --all | sort -k2 | rg "codeql-db|codeql-db-"andgit verify-pack -v .git/objects/pack/*.idx | sort -k3 -n | tail -n 300.
- Create a safe snapshot & announce planned rewrite to maintainers:
git branch backup/main-YYYYMMDDpush to origin as a backup branch (do not delete yet).
- Run
git filter-repoto remove the heavy blobs or paths:
- Example:
git filter-repo --invert-paths --paths codeql-db --paths backend/codeql-db --paths codeql-db-js --paths codeql-db-go - Or,
git filter-repo --strip-blobs-bigger-than 50Mto strip large blobs.
- Validate the repo after rewrite:
git count-objects -vHshows pack-shrunk sizes- Run CI checks locally:
backend: Go Test,Frontend build,pre-commit run --all-files.
- Coordinate forced push & relay steps to contributors:
git push --force --allandgit push --force --tags(advertise to collaborators).- Ask maintainers to rebase/force-pull their forks/branches.
- Ensure post-clean tasks:
- Add branch protection policy to block force pushes to
mainexcept when required with documented approvals. - Create a short script & docs
scripts/repair_after_filter_repo.shfor maintainers that describes rebase steps.
Notes:
- History rewrite is destructive; only do after explicit approval and scheduling during a low-impact window.
- If the repo has widely used forks or CI jobs referencing old commit hashes, establish a temporary redirect communication plan.
History rewrite summary (safe workflow)
For repository history cleanup to remove committed CodeQL DBs or large blobs, the repo now contains a small set of tools under scripts/history-rewrite to help plan and safely execute this action. They are:
-
scripts/history-rewrite/clean_history.sh— Preview and optionally (with--force) run a git-filter-repo history rewrite. Default is--dry-runand the script creates a timestamped backup branch namedbackup/history-YYYYMMDD-HHMMSSbefore any destructive operations. The script logs operations todata/backups/history_cleanup-YYYYMMDD-HHMMSS.logand prints next-step instructions. Do NOT run--forceonmainormasterand coordinate with maintainers before force-pushing. -
scripts/history-rewrite/preview_removals.sh— Print commit/object lists and example large blobs relevant to the paths and strip size for verification. -
scripts/history-rewrite/validate_after_rewrite.sh— Rungit fsck,git count-objects -vH,pre-commithooks, backendgo test ./..., and frontendnpm run buildto verify the repository after a rewrite.
Quick clean_history.sh usage examples
-
Dry-run:
scripts/history-rewrite/clean_history.sh --dry-run --paths 'backend/codeql-db,codeql-db' --strip-size 50- This logs what would be removed without making any changes; review
data/backups/history_cleanup-*.logfor details.
-
Preview only:
scripts/history-rewrite/preview_removals.sh --paths 'backend/codeql-db,codeql-db' --strip-size 50
-
Destructive rewrite (ONLY after approval):
scripts/history-rewrite/clean_history.sh --force --paths 'backend/codeql-db,codeql-db' --strip-size 50- It will create
backup/history-YYYYMMDD-HHMMSS, prompt for explicit confirmationI UNDERSTAND, rungit filter-repolocally, then rungit fsckandgit gc. - After rewrite, do not auto-push; perform
git push --all --forceandgit push --tags --forceonly after team approval.
Warnings & notes
- The scripts only prepare and perform the rewrite locally; they will not force-push to remote unless you do so manually.
- Avoid running
--forceonmainormaster. Use a feature branch or a controlled clone. - The rewrite is destructive; maintainers must rebase or re-clone after force-push.
- Always verify with
scripts/history-rewrite/validate_after_rewrite.shbefore any force push.
Incident Triage: CrowdSec preset pull/apply 502/500 (feature/beta-release)
- Logs to pull first: backend app/GIN logs under
/app/data/logs/charon.log(ordata/logs/charon.login dev) via backend/cmd/api/main.go; look for warnings "crowdsec preset pull failed" / "crowdsec preset apply failed" emitted in backend/internal/api/handlers/crowdsec_handler.go. Access logs will also show 502/500 for POST/api/v1/admin/crowdsec/presets/pulland/apply. - Routes and code paths: handlers
PullPresetandApplyPresetlive in backend/internal/api/handlers/crowdsec_handler.go and delegate toHubService.Pull/Applyin backend/internal/crowdsec/hub_sync.go with cache helpers in backend/internal/crowdsec/hub_cache.go. Data dir used isdata/crowdsecwith cache underdata/crowdsec/hub_cachefrom backend/internal/api/routes/routes.go. - Quick checks before repro: (1) Cerberus enabled (
feature.cerberus.enabledsetting orFEATURE_CERBERUS_ENABLED/CERBERUS_ENABLEDenv) or handler returns 404 early; (2)csclion PATH and executable (HubServiceuses real executor and callscscli version/cscli hub install); (3) outbound HTTPS to https://hub.crowdsec.net reachable (fallback aftercscli hub list); (4) cache dir writabledata/crowdsec/hub_cacheand contains per-slugmetadata.json,bundle.tgz,preview.yaml; (5) backup path writable (apply renamesdata/crowdsectodata/crowdsec.backup.<ts>). - Likely 502 on pull: hub cache unavailable or init failed (cache dir permission), invalid slug, hub index fetch errors (
cscli hub list -o jsonor direct GET/api/index.json), download blocked/size >25MiB, preview/download HTTP non-200, or cache write errors. Handler logs warning and returns 502 with error string. - Likely 500 on apply: backup rename fails,
cscliinstall fails with no cache fallback (if pull never succeeded or cache expired/missing), cache read errors (metadata.json/bundle.tgzunreadable), tar extraction rejects symlinks/unsafe paths, or rollback after extract failure. Handler writesCrowdsecPresetEvent(if DB reachable) with backup path and returns 500 withbackuphint. - Validation steps during triage: verify cache entry freshness (TTL 24h) via
metadata.jsontimestamps; confirmcscli hub install <slug>succeeds manually; if cscli missing, ensure prior pull populated cache; test hub egress with curl to hub index and archive URLs; check file ownership/permissions ondata/crowdsecanddata/crowdsec/hub_cache; confirm log lines around warnings for exact error message; inspect backup directory to restore if partial apply.
Current incident: preset apply returning "Network Error" (feature/beta-release)
- What we see: frontend reports axios "Network Error" while applying a preset. Backend logs do not yet show the apply warning, suggesting the client drops before an HTTP response arrives. Apply path runs
HubService.Applyin backend/internal/crowdsec/hub_sync.go with a 15s context; pull uses a 10s HTTP client timeout and does not follow redirects. Axios flags a network error when the TCP connection is reset/timeout rather than when a 4xx/5xx is returned. - Probable roots to verify quickly:
- Hub index/preview/archives now redirect to another host; our HTTP client forbids redirects, so FetchIndex/Pull return an error and the handler responds 502 only after the hub timeout. Long hub connect attempts can hit the 10s client timeout, causing the upstream (Caddy) or browser to drop the socket and surface a network error.
- Runtime image may be missing
cscliif the release archive layout changed; Dockerfile only moves the binaries when expected paths exist. Without cscli, Apply falls back to cache, but if Pull already failed, Apply exits with an error and no response body. Validatecscli versioninside the running container built from feature/beta-release. - Outbound egress/proxy: container must reach https://hub-data.crowdsec.net (default) from within the Docker network. Missing
HTTP(S)_PROXY/NO_PROXYor a transparent MITM can cause TLS handshake or connection timeouts that the client reports as network errors. - TLS/HTML responses: hub returning HTML (maintenance/Cloudflare) or a 3xx/302 to http is treated as an error (
hub index responded with HTML), which becomes 502. If the redirect/HTML arrives after ~10s the browser may already have given up. - Timeout budget: 10s pull / 15s apply may be too tight for hub downloads + cscli install. When the context cancels mid-stream, gin closes the connection and axios logs network error instead of an HTTP code.
- Remediation plan (no code yet):
- Confirm cscli exists in the runtime image from Dockerfile by running
cscli versioninside the failing container; if missing, adjust build or add a startup preflight that logs absence and forces HTTP hub path. - Override HUB_BASE_URL to a known JSON endpoint (e.g.,
https://hub-data.crowdsec.net/api/index.json) when redirects occur, or point to an internal mirror reachable from the Docker network; document this in env examples. - Ensure outbound 443 to hub-data is allowed or set
HTTP(S)_PROXY/NO_PROXYon the container; retry pull/apply after validatingcurl -v https://hub-data.crowdsec.net/api/index.jsoninside the runtime. - Consider raising pull/apply timeouts (and matching frontend request timeout) and log when contexts cancel so we return a 504/timeout JSON instead of a dropped socket.
- Capture docker logs for
charon-debugduring repro; look forcrowdsec preset pull/apply failedwarnings and any TLS/redirect messages from backend/internal/crowdsec/hub_sync.go.
- Confirm cscli exists in the runtime image from Dockerfile by running
Goal
Implement real CrowdSec Hub preset sync + apply on backend (using cscli or direct hub index) with caching, validation, backups, rollback, and wire the UI to new endpoints so operators can preview/apply hub items with clear status/errors.
Backend Plan (handlers, helpers, storage)
- Route adjustments (gin group under
/admin/crowdsecin backend/internal/api/handlers/crowdsec_handler.go):- Replace stub endpoint with
POST /admin/crowdsec/presets/pull→ fetch hub item and cache; returns metadata + preview + cache key/etag. - Add
POST /admin/crowdsec/presets/apply→ apply previously pulled item by cache key/slug; performs backup + cscli install + optional restart. - Keep
GET /admin/crowdsec/presetsbut include hub/etag info and whether cached locally. - Optional:
GET /admin/crowdsec/presets/cache/:slug→ raw preview/download for UI.
- Replace stub endpoint with
- Hub sync helper (new backend/internal/crowdsec/hub_sync.go):
- Provide
type HubClient interface { FetchIndex(ctx) (HubIndex, error); FetchPreset(ctx, slug) (PresetBundle, error) }with real impl using either: a)cscli hub list -o jsonandcscli hub update+cscli hub install <item>(preferred if cscli present), or b) direct fetch of https://hub.crowdsec.net/ or GitHub raw.index.json+ tarball download. - Validate downloads: size limits, tarball path traversal guard, checksum/etag compare, basic YAML validation.
- Provide
- Caching (new backend/internal/crowdsec/hub_cache.go):
- Cache pulled bundles under
${DataDir}/hub_cache/<slug>/with index metadata (etag, fetched_at, source URL) and preview YAML. - Expose
LoadCachedPreset(slug)andStorePreset(slug, bundle); evict stale on TTL (configurable, default 24h) or when etag changes.
- Cache pulled bundles under
- Apply flow (extend handler):
Pull: fetch index, resolve slug, download bundle to cache, return preview + warnings (missing cscli, requires restart, etc.).Apply: before modify, runbackupDir := DataDir + ".backup." + timestamp(mirror current write/import backups). Then: a) If cscli available:cscli hub update,cscli hub install <slug>(or collection path), maybecscli decisions listsanity check. UseCommandExecutorwith context timeout. b) If cscli absent: extract bundle into DataDir with sanitized paths; preserve permissions. c) Write audit record to DB tablecrowdsec_preset_events(new model in backend/internal/models).- On failure: restore backup (rename back), surface error + backup path.
- Status and restart:
- After apply, optionally call
h.Executor.Stop/Startif running to reload config; orcscli service reloadwhen available. Returnreload_performedflag.
- After apply, optionally call
- Validation & security hardening:
- Enforce
Cerberusenablement check (isCerberusEnabled) on all new routes. - Path sanitization with
filepath.Clean, limit tar extraction to DataDir, reject symlinks/abs paths. - Timeouts on all external calls; default 10s pull, 15s apply.
- Log with context: slug, etag, source, backup path; redact secrets.
- Enforce
- Migration of curated list:
- Keep curated presets in backend/internal/crowdsec/presets.go but add
Source: "hub"for hub-backed items and includeRequiresHubtrue when not bundled. ListPresetsshould merge curated + live hub index when available, mark availability per slug (cached, remote-only, local-bundled).
- Keep curated presets in backend/internal/crowdsec/presets.go but add
Frontend Plan (API wiring + UX)
- API client updates in frontend/src/api/presets.ts:
- Replace
pullAndApplyCrowdsecPresetwithpullCrowdsecPreset({ slug })andapplyCrowdsecPreset({ slug, cache_key }); include response typing for preview/status/errors. - Add
getCrowdsecPresetCache(slug)if backend exposes cache preview.
- Replace
- CrowdSec config page frontend/src/pages/CrowdSecConfig.tsx:
- Use new mutations:
pullto show preview + metadata (etag, fetched_at, source); disable local fallback unless backend saysapply_supported=false. - Show status strip (success/error) and backup path from apply response; surface reload flag and errors inline.
- Gate preset actions when Cerberus disabled; show tooltip if hub unreachable.
- Keep local backup + manual file apply as last-resort only when backend explicitly returns 501/NotImplemented.
- Use new mutations:
- Overview page frontend/src/pages/Security.tsx:
- No UI change except error surfacing when start/stop fails due to hub apply requiring reload; show toast from handler message.
- Import page frontend/src/pages/ImportCrowdSec.tsx:
- Add note linking to presets apply so users prefer presets over raw package imports.
Hub Fetch/Validate/Apply Flow (detailed)
- Pull
- Handler:
CrowdsecHandler.PullPreset(ctx)(new) callsHubClient.FetchPreset→HubCache.StorePreset→ returns{preset, preview_yaml, etag, cache_key, fetched_at}. - If hub unavailable, return 503 with message; UI shows retry/cached copy option.
- Handler:
- Apply
- Handler:
CrowdsecHandler.ApplyPreset(ctx)loads cache by slug/cache_key →backupCurrentConfig()→InstallPreset()(cscli or manual) → optional restart → returns{status:"applied", backup, reloaded:true/false}. - On error: restore backup, include
{status:"failed", backup, error}.
- Handler:
- Caching & rollback
- Cache directory per slug with checksum file; TTL enforced on pull; apply uses cached bundle unless
force_refetchflag. - Backups stored with timestamp; keep last N (configurable). Provide restoration note in response for UI.
- Cache directory per slug with checksum file; TTL enforced on pull; apply uses cached bundle unless
- Validation
- Tarball extraction guard: reject absolute paths,
.., symlinks; limit total size. - YAML sanity: parse key scenario/collection files to ensure readable; log warning not blocker unless parse fails.
- Require explicit
apply=trueseparate from pull; no implicit apply on pull.
- Tarball extraction guard: reject absolute paths,
Security Considerations
- Only allow these endpoints when Cerberus enabled and user authenticated to admin scope.
- Use
CommandExecutorto shell out to cscli; restrict PATH and working dir; do not pass user-controlled args without whitelist. - Network egress: if hub URL configurable, validate scheme is https and host is allowlisted (crowdsec official or configured mirror).
- Rate limit pull/apply (simple in-memory token bucket) to avoid abuse.
- Logging: include slug and etag, omit file contents; redact download URLs if they contain tokens (unlikely).
Required Tests
- Backend unit/integration:
backend/internal/api/handlers/crowdsec_handler_test.go: success and error cases forPullPreset(hub reachable/unreachable, invalid slug),ApplyPreset(cscli success, cscli missing fallback, apply fails and restores backup),ListPresetsmerging cached hub entries.backend/internal/crowdsec/hub_sync_test.go: parse index JSON, validate tar extraction guards, TTL eviction.backend/internal/crowdsec/hub_cache_test.go: store/load/evict logic and checksum verification.backend/internal/api/handlers/crowdsec_exec_test.go: ensure executor timeouts/commands constructed for cscli hub calls.
- Frontend unit/UI:
- frontend/src/pages/tests/CrowdSecConfig.test.tsx: pull shows preview, apply success shows backup path/reload flag, hub failure falls back to cached/local message, Cerberus disabled disables actions.
- frontend/src/api/tests/presets.test.ts: client hits new endpoints and maps response.
- frontend/src/pages/tests/Security.test.tsx: start/stop toasts remain correct when apply errors bubble.
Docs Updates
- Update docs/cerberus.md CrowdSec section with new hub preset flow, backup/rollback notes, and requirement for cscli availability when using hub.
- Update docs/features.md to list “CrowdSec Hub presets sync/apply (admin)” and mention offline curated fallback.
- Add short troubleshooting entry in docs/troubleshooting/crowdsec.md (new) for hub unreachable, checksum mismatch, or cscli missing.
Migration Notes
- Existing curated presets remain but are marked as bundled; UI should continue to show them even if hub unreachable.
- Stub endpoint
POST /admin/crowdsec/presets/pull/applyis replaced by separatepullandapply; frontend must switch to new API paths before backend removal to avoid 404. - Backward compatibility: keep returning 501 from old endpoint until frontend merged; remove once new routes live and tested.
Automated CI Dry-Run & PR Checklist Plan
Objective: Add a CI dry-run workflow and a PR checklist enforcement job and template to reduce release/merge regressions. The dry-run will validate build/test/lint/packaging steps without publishing or writing secrets. The PR checklist job ensures contributors used the PR template.
Files to add / edit
- Add:
.github/workflows/dry-run.yml— main dry-run workflow triggered on PR and schedule (weekly) - Add:
.github/workflows/pr-checklist.yml— PR body validation workflow to ensure PR checklist compliance - Add:
.github/PULL_REQUEST_TEMPLATE/pr_template.md— PR template with developer checklist - Add:
scripts/ci/dry_run_build.sh— wrapper script used by dry-run to orchestrate checks - Add:
scripts/ci/dry_run_goreleaser.sh— goreleaser snapshot dry-run runner - Add:
scripts/ci/check_pr_checklist.sh— standalone PR-checklist script (for local/CI usage) - Edit:
.pre-commit-config.yaml— recommend addtsc --noEmitandgolangci-lintlocal hooks (if not present as mandatory checks) - Review:
.gitignore,.gitattributes— ensurescripts/citemp artifacts are ignored and LFS/gitattribute rules still appropriate
Suggested job names & responsibilities
dry-run-backend(backend build:go build,go testwith coverage,go vet)dry-run-frontend(frontend build & tests,npm ci,npm run build,npm run test:ci)dry-run-docker(docker build only,docker build, no push; multi-platform on branches only)dry-run-goreleaser(goreleaser release in dry-run mode;--snapshotor--skip-publish)dry-run-security(Trivy fs scan of built binary and optional static scans; non-publish SARIF upload disabled for fork PRs)validate-pr-checklist(validate PR body contains required checklist items)
Workflow triggers
pull_request(types: opened, edited, synchronize, reopened)workflow_dispatch(manual run)schedule(cron — e.g.,0 3 * * 0weekly; optional daily lightweight run for repo health exists already viarepo-health.yml)
Permissions and secrets considerations
- Default minimal perms:
contents: readfor checkout; usechecks: writeto set check statuses if necessary. - Do not use
packages: writeorGITHUB_TOKENwrite duties on PRs (unless gated to branch-origin PRs). Dry-run avoids publishing and won't requirepackages: write. - Conditionally run
goreleaseror any publish checks only when PR originates from repo owner and/or is a branch on the main repo (forks cannot access secrets):if: ${{ github.event.pull_request.head.repo.full_name == github.repository }}. CHARON_TOKENor custom PAT should not be used in PR runs from forks to avoid secret leakage.
YAML job snippet samples
- Example
dry-run-backendjob
jobs:
dry-run-backend:
name: Backend Dry-Run
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v4
with: { go-version: '1.25' }
- name: Run repo health check
run: bash scripts/repo_health_check.sh
- name: Run backend tests
run: bash scripts/go-test-coverage.sh
- Example
dry-run-frontendjob
dry-run-frontend:
name: Frontend Dry-Run
runs-on: ubuntu-latest
needs: dry-run-backend
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '24' }
- name: Install + Test
working-directory: frontend
run: |
npm ci
bash ../scripts/frontend-test-coverage.sh 2>&1 | tee frontend/test-output.txt
- Example
dry-run-dockerjob
dry-run-docker:
name: Build Docker Image (No Push)
runs-on: ubuntu-latest
needs: dry-run-frontend
steps:
- uses: actions/checkout@v4
- name: Build Docker
run: docker build --platform linux/amd64 -t charon:pr-${{ github.sha }} .
- Example
dry-run-goreleaserjob protected from fork PR secrets exposure
dry-run-goreleaser:
name: GoReleaser Dry-Run
runs-on: ubuntu-latest
needs: dry-run-docker
if: ${{ github.event_name == 'workflow_dispatch' || github.repository == github.event.pull_request.head.repo.full_name }}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v4
with: { go-version: '1.25' }
- name: Run GoReleaser (dry-run)
uses: goreleaser/goreleaser-action@v6
with: { args: 'release --snapshot --rm-dist --skip-publish' }
- PR checklist validation job (using
github-scriptor local script)
jobs:
validate-pr-checklist:
name: Validate PR Checklist
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Validate checklist
uses: actions/github-script@v7
with:
script: |
const pr = await github.rest.pulls.get({owner: context.repo.owner, repo: context.repo.repo, pull_number: context.issue.number});
const body = (pr.data && pr.data.body) || '';
const required = [ 'Tests (backend/frontend) pass', 'Pre-commit checks pass', 'Lints and type checks pass', 'Changelog updated (if applicable)' ];
for (const r of required) {
if (!body.toLowerCase().includes(r.toLowerCase())) { core.setFailed('Missing checklist item: '+r); return; }
}
Expected behavior and gating strategy
- The dry-run workflow runs on PRs and fails the PR if build/test/goreleaser dry-run steps fail.
- For PRs from forks, goreleaser/publish and other actions requiring secrets must be gated and skipped.
- Maintain only lightweight jobs for PRs (run code checks, not artifact uploads). For nightly schedules, run heavier checks.
Edge cases & mitigation
- Fork PRs: secrets are not available. Skip secret-dependent steps and run only build/test/lint steps.
- Large builds or timeouts: Break checks into small jobs with
needsto fail early, set timeouts on longest-running jobs and precautions for cache usage. - Flaky tests: mark flaky steps non-blocking but create an issue/annotation when flaky tests flake.
CI Scheduler recommendation
- Keep the existing
repo-health.ymlschedule (daily). Adddry-run.ymlwith weekly schedule (0 3 * * 0) to validate the entire build pipeline weekly.
PR Template (simply create .github/PULL_REQUEST_TEMPLATE/pr_template.md) — basic checklist
## Summary
## Checklist
- [ ] Tests (backend & frontend) validated locally and CI
- [ ] All tests pass in CI (Quality Checks)
- [ ] Lint and type checks pass (pre-commit hooks run locally)
- [ ] Changelog updated (if relevant)
- [ ] Docs updated (if user-facing change)
- [ ] No sensitive info or secrets in this PR
.gitattributes, .gitignore, and pre-commit checks review
.gitattributes: ensure*.db,*.sqlite,codeql-db/**are LFS/binary as appropriate (file exists): verify no changes required; add new file patterns for scripts/ci artifacts if necessary..gitignore: add/.tmp-cior other ephemeral artifact patterns for CI scripts to avoid noise..pre-commit-config.yaml: add a localtscornpx tsc --noEmithook if not present; keepcheck-lfs-large-filesandblock-codeql-db-commitsenforced. Consider adding a new localci-dry-runhook for pre-push gating (manual stage) but mainly rely on the CI workflows.
Testing plan & acceptance criteria
- Run
dry-run.ymlviaworkflow_dispatchfor a test branch (internal) and validate the dry-run jobs pass. - Open a sample PR with an incomplete checklist and confirm
pr-checklistfails with actionable message. - Verify
goreleaserdry-run step will not run for forked PRs and is only executed for branches in the same repository. - Acceptance: job completes successfully under normal code status; PR blocking on failures for
main+feature/beta-release.
Next steps for maintainers
- Create the initial
dry-run.yml&pr-checklist.ymlworkflows and a test PR to check behavior. - Add
scripts/ci/dry_run_*scripts and link them from the new workflows. - Add the PR template file in
.github/PULL_REQUEST_TEMPLATEand updateCONTRIBUTING.mdto require PR checklist checks. - Test behavior for fork PRs and set branch protection rules to require these checks on relevant branches.
- Iterate in a small number of PRs to tune the threshold and gating behavior.
Status: Implemented
Files added and wired into CI:
.github/workflows/dry-run-history-rewrite.yml— runs a non-destructive history/large-file check on PRs and schedule..github/PULL_REQUEST_TEMPLATE/history-rewrite.md— PR checklist for history rewrite PRs.scripts/ci/dry_run_history_rewrite.sh— CI wrapper that fails when banned paths or large historical objects are found.-.github/workflows/pr-checklist.yml — validates the PR body contains required checklist items for history-rewrite PRs. Validation steps performed locally and via CI (dry-run):scripts/ci/dry_run_history_rewrite.shreturns non-zero when repo history contains objects or commits touchingbackend/codeql-dbor other listed paths.- The workflow uses
actions/checkout@v4withfetch-depth: 0to ensure history is available for the check. - PR template ensures contributors attach dry-run output and backup logs prior to destructive cleanups.
Next considerations:
- Add a
validate-pr-checklistworkflow to enforce general PR checklist items for all PRs if desired (future improvement).