37 KiB
Overview
This document outlines an investigation and remediation plan for the error message: "Could not build remote workspace index. Could not check the remote index status for this repo." It contains diagnoses, exact checks, reproductions, fixes, CI/security checks, and acceptance criteria.
Summary: Background & likely root causes
- Background: Remote workspace index builds (e.g., on GitHub or other code hosting platforms) analyze repository contents to provide search, code navigation, and other services. Index processes can fail due to repository content, metadata, workflows, or platform limitations.
- Likely root causes:
- Permission issues: index build needs read access or specific tokens (missing PAT or actions secrets, restricted branch protection blocking re-run).
- Repository size limits: excessive repo size (large binaries, codeql databases, caddy caches) triggers failures.
- Git LFS misconfiguration: large files stored in Git rather than LFS; remote indexer times out scanning big objects.
- Cyclic symlinks or malformed symlinks: indexers can hit infinite loops or unsupported references.
- Large or unsupported binary files: vendor/binaries and artifacts committed directly instead of using artifacts or releases.
- Missing or stale code scanning artifacts (e.g., codeql-db) present in repo root that confuse the indexer.
- Malformed .git/config or broken submodules: submodule references or malformed remote URL prevents indexer from checking status.
- Network issues: temporary outages between indexer service and repo host.
- GitHub Actions failures: workflows that set up the environment fail due to missing CI keys or secrets.
- Branch protection / repo policies: preventing GitHub from performing necessary indexing operations.
Files and Locations To Inspect (exact)
- Repository configuration
- .git/config (.git/config) — inspect remotes, submodules, and alternate refs
- .gitattributes (.gitattributes) — look for LFS vs tracked files
- .gitignore (.gitignore) — ensure generated artifacts and codeql-db are ignored
- CI and workflows
- .github/workflows/**/* (.github/workflows) — workflows and secrets usage
- .github/ (branch protection or other settings logged in repo web UI)
- Code scanning artifacts
- codeql-db/ (codeql-db-go/, codeql-db-js/) — ensure not committed as code
- backend/codeql-db/ or backend/data/ — artifacts that increase repo size and confuse indexer
- Build & deploy config
- Dockerfile (Dockerfile) — check for COPY of large files
- docker-compose*.yml (docker-compose.yml, docker-compose.local.yml ) — services and volumes
- Platform / metadata
- .vscode/* (workspace settings) — local workspace mapping
- go.work (go.work) — ensures workspace go settings are correct
- Frontend & package manifests
- frontend/package.json (frontend/package.json) — scripts and postinstall tasks that may run CI or build steps
- package.json (root/package.json) — library or workspace settings affecting index size
- Binary caches & artifacts
- data/ (data/), backend/data/ — DB files, caddy files, and caches
- tools/ (tools/) — large binaries or vendor files committed for local dev
- Git LFS and hooks
- .git/hooks/* — check for LFS pre-commit/pre-push hooks
- GitHub admin side
- Branch protection settings in GitHub UI (branches) — check protected branches and required status checks
- Logs & scan results
- .github/workflows action logs (in GitHub Actions run view or
ghCLI) - backend/test-output.txt (backend/test-output.txt) and other preexisting logs
Commands & Logs To Collect (reproduction & evidence gathering)
- Local repo inspection
git status --porcelain --ignored— show ignored & modified filesgit fsck --full— scan object store for corruptiongit count-objects -vH— get repo size (packed/unpacked).du -sh .anddu -sh $(git rev-parse --show-toplevel)/*— check workspace size from file systemgit rev-list --objects --all | sort -k 2 > allfiles.txtthengit rev-list --objects --all | sort -k 2 | tail -n 50— find large commits/objectsgit verify-pack -v .git/objects/pack/pack-*.idx | sort -k 3 -n | tail -n 50— inspect object packsgit lfs ls-filesandgit lfs env— check LFS tracked files and envfind . -type f -size +100M -exec ls -lh {} \;— detect large files in working treerg -S "codeql-db|codeql database|codeql" -n— ripgrep for CodeQL references
git submodule status --recursive— check submodulesgit config --list --show-origin— confirm config and values (e.g., LFS config)
-
Reproducing remote run locally / in GH Actions 12. Use
ghCLI:gh run list --repo owner/repothengh run view RUN_ID --log --repo owner/repoto collect logs 13. Rerun the failing index action:gh run rerun RUN_ID --repo owner/repo(requires permissions) 14. Recreate CodeQL database locally to test:codeql database create codeql-db --language=go --command='cd backend && go test ./... -c'thencodeql database analyze codeql-db codeql-custom-queries-goto check for broken DBs 15.docker build --no-cache -t charon:local .anddocker run --rm charon:localto catch missinggitor LFS artifacts copied into images -
Network & permissions 16. Check PAT and ACTIONS tokens:
gh auth statusandgh secret list --repo owner/repoand verify repository secrets usage in workflows 17. Verify branch protection policies viagh api repos/:owner/:repo/branches/:branchor through the GitHub UI 18. Usecurl -I https://api.github.com/repos/:owner/:repoto confirm GitHub's API reachability
Steps To Gather Evidence (sequence)
- Triage: Run the local repo inspections above (commands 1–11). Save outputs to
scripts/diagnostics/repo_health.OUTPUT. - Confirm large files or code scan DBs are present:
find . -type d -name "codeql-db*" -o -name "db*" -o -name "node_modules" -o -name "vendor". - Inspect
git lfs ls-filesandgit lfs envfor LFS misconfig. - Collect CI history and logs for failing runs using
gh run view --logfor the action(s) that produce the index error. - Verify permission and branch protections:
gh apiqueries &gh secret list --repo. - Verify Dockerfile and workflow steps that might include copying large artifacts and causing indexer timeouts.
Short-Term Mitigations (quick steps)
- Disable or re-run failing index builds with proper permissions:
gh run rerun RUN_ID --repo owner/repo. - Move large artifacts out of the repo: create a new docs/ or data/ to store metadata and put
codeql-db*,tools/,data/, andbackend/data/into.gitignoreand.gitattributesto exclude them from the index. - Add
codeql-dbto.gitignoreand create.gitattributesentries to ensure no binary files are tracked without LFS:- .gitattributes additions: `codeql-db/** !binary
- Update
.gitignore: add/codeql-db,/data,/backend/data,/.vscode/*/.
- Re-run the index build once large files are removed.
Medium-Term Fixes
- Git filter-repo/BFG: If large files are committed, remove them from history using
git filter-repoor BFG and force-push a cleaned branch:git filter-repo --strip-blobs-bigger-than 50M. - Convert large tracks to Git LFS for binary files:
git lfs track "*.iso"andgit lfs track "*.db"andgit add .gitattributes && git commit -m "Track binaries in LFS" && git push. - Prevent accidental artifacts in the future with a pre-commit hook (add to
scripts/pre-commit-hooksand reference in.git/hooksor pre-commit framework): runpre-commitrule to enforcemax-file-sizeand LFS checks. - Add GH actions health-check workflow (e.g.,
.github/workflows/repo-health.yml) that runs a small script to check for large files, LFS config, and codeql-db folders and opens an issue if thresholds are exceeded. - Document LFS / large file policy in
docs/(e.g.,docs/getting-started.mdanddocs/features.md) and add codeowner references.
Longer-Term Hardening
- Automate periodic repo health checks (monthly) via GH Actions to run
git count-objects -v,find -size +and warn maintainers. - Add a repository dashboard for
codeql-dband other artifacts usingscripts/and a GitHub Action to report stats to PRs or issues. - Harden the remote indexing process by requesting GitHub support if the issue is intermittent and caused by GitHub's indexer failure.
- Add a
scripts/ci_pre_check.shscript to run on PR open that checksgit fsck, LFS, and ensurescodeql-dbis not committed. - Add a
scripts/repo_health_check.shfile and include it in.vscode/tasks.jsonand as a GH Action to be optionally invoked by maintainers.
Recommended Fixes (exact files & edits, tests to run, CI checks)
-
Add
.gitignoreand.gitattributesupdates- Files to edit:
.gitignore,.gitattributes. - Steps:
- Add
/codeql-db,/codeql-db-*,/.vscode/,/backend/data,/data,/node_modulesto.gitignore. - Add
*.db filter=lfs diff=lfs merge=lfs -textto.gitattributesfor DB files andcodeql-db/** binaryto mark codeql files as binary. - Commit and push.
- Add
- Tests & checks:
- Locally run
git status --ignoredto ensure these are now ignored. - Run
git lfs ls-filesto ensure large files are LFS tracked if intended. - CI: Ensure
pre-commitchecks pass, rungh run viewto verify indexing.
- Locally run
- Files to edit:
-
Add small repo-health GH Action
- Files to add:
.github/workflows/repo-health.yml,scripts/repo_health_check.sh. - Steps:
- Implement
scripts/repo_health_check.shthat runsgit count-objects -vH,find . -size +100Mandgit fsckand prints a short JSON summary. - Add
repo-health.ymlwith a scheduled trigger and PR check to run the script.
- Implement
- Tests & checks:
- Run
bash scripts/repo_health_check.shlocally. Ensure it exits 0 when checks are clean. - CI: Ensure the workflow runs and reports results in the Actions tab.
- Run
- Files to add:
-
Build-time protections for codeql artifacts
- Files to edit:
.github/workflows/ci.yml(or equivalent CI) and.gitattributes/.gitignore. - Steps:
- Remove
codeql-dbdirectories from CI cache and artifact paths; don't commit them. - Ensure CodeQL analysis workflow uses the
actions/cacheandactions/upload-artifactcorrectly, not storing DBs in the repo.
- Remove
- Tests & checks:
- Re-run
gh actionsCodeQL workflow:gh run rerun RUN_ID --repo owner/repoand verify action no longer stores DBs as code.
- Re-run
- Files to edit:
-
Pre-commit hook for large files & LFS enforcement
- Files to add:
scripts/pre-commit-hooks/check-large-file.sh, and enable via.pre-commit-config.yamlorscripts/pre-commit-install.sh. - Steps:
- Implement a hook to fail commits larger than 50MB unless tracked in LFS.
- Add to
pre-commitconfig and install in the repo.
- Tests & checks:
- Attempt to commit a test large file > 50MB to verify the commit is rejected unless LFS tracked.
- CI: Add a PR check running
pre-committo ensure commits follow policy.
- Files to add:
-
CI policy verification for branches
- Files / settings to revise: .github/workflows for runner permissions, branch protection settings via GitHub UI
- Steps:
- Confirm user-level or organization-level
actionsandworkflowspermissions allow required actions to run indexers. - Modify workflow triggers: ensure that
pull_requestandpushdo not include large artifacts or directories.
- Confirm user-level or organization-level
- Tests & checks:
- Open PR with
scripts/changes to trigger the updated workflows; verify that they run and pass.
- Open PR with
-
Automated Monitoring & Alerts
- Files to add:
.github/workflows/monitor-repo.yml,scripts/repo-monitor.sh. - Steps:
- Implement periodic monitoring workflow to run repo health checks and open an issue or send a slack message when thresholds crossed.
- Tests & checks:
- Local run and scheduled run for the workflow to prove the alert state.
- Files to add:
-
Documentation updates
- Files to update:
docs/getting-started.md,docs/features.md,docs/security.md, CONtributing.md - Steps:
- Add guidelines for how to store large artifacts, a policy to use Git LFS, instructions on running
scripts/repo_health_check.sh.
- Add guidelines for how to store large artifacts, a policy to use Git LFS, instructions on running
- Tests & checks:
- Verify that docs references builds and CI pass with updated instructions.
- Files to update:
-
CI Integration Tests to validate fixes
- Files to add/edit:
.github/workflows/ci.yml,backendbuild scripts,frontendscript checks - Steps:
- Add a
ci.ymlstep to runbash scripts/repo_health_check.sh,go test ./...andnpm run build(frontend) as a gating check.
- Add a
- Tests & checks:
- Ensure
go build,go test, andnpm run buildpass in CI after changes.
- Ensure
- Files to add/edit:
-
Forced cleanup and migration of large objects (if necessary)
- Files to change: none—these are history-edit operations
- Steps:
- If large files are present and the repo will not admit LFS for them, use
git filter-repo --strip-blobs-bigger-than 50Mor BFG and push to a cleaned branch. - Recreate workflows or branch references as needed after forced push.
- If large files are present and the repo will not admit LFS for them, use
- Tests & checks:
- Run
git count-objects -vHbefore/after and verify the pack size decreases significantly.
- Run
-
Validate GH Actions & secrets
- Files to inspect:
.github/workflows/*and GitHub repo settings (secrets) - Steps:
- Ensure that required secrets, PATs, or
GITHUB_TOKENusage is correct; verifyactions/checkoutuses LFS fetch:actions/checkout@v2with fetch-depth: 0 and properlfsenabled.
- Ensure that required secrets, PATs, or
- Tests & checks:
- Rerun an action to ensure
git lfs ls-fileslists expected 4-5 known files and that thegh rundoes not fail with read errors.
- Rerun an action to ensure
Phased Work Plan
Phase 1 — Triage & evidence collection (2-3 days)
- Tasks:
- Run all diagnostic commands and collect output (
scripts/diagnostics/repo_health.OUTPUT). - Collect failing GH Action run logs and CodeQL run logs via
gh run view. - Determine whether the failure is reproducible or intermittent.
- Run all diagnostic commands and collect output (
- Acceptance criteria:
- Have a full diagnostic report (PAIR) that identifies a top-2 likely causes.
- Have the failing Action ID or workflow file referenced.
Phase 2 — Short-term fixes & re-run (1-2 days)
- Tasks:
- Add immediate
.gitignoreand.gitattributesprotections for known artifacts. - Add
scripts/repo_health_check.sh, a.github/workflows/repo-health.ymlworkflow, and pre-commit LFS checks. - Re-run GH Actions and index builds to check for improvement.
- Add immediate
- Acceptance criteria:
- Repo health workflow runs on PRs and schedule; the health check succeeds or reports actionable items.
- Index build does not fail due to size or missing LFS objects on re-run.
- No unexpected artifacts are present in commits.
- Pre-commit hooks block large files that are not tracked by Git LFS.
Status: Short-term fixes implemented (gitattributes, pre-commit LFS check, health script, scheduled workflow).
Phase 3 — Medium-term fixes (2-5 days)
- Tasks:
- Add
pre-commithooks and a# CrowdSec Hub Presets Sync & Apply Plan (feature/beta-release)
- Add
🚨 CI/CD Incident Report - Run 20046135423-29 (2025-12-08 23:20 UTC)
Status: ALL BUILDS FAILING on feature/beta-release
Trigger: Push of commit 571a61a (CrowdSec cscli installation)
Impact: Docker publish blocked, codecov upload failed, all integration tests skipped
Root Causes Identified
1. CRITICAL: Missing Frontend File frontend/src/data/crowdsecPresets.ts
- Evidence: TypeScript compilation fails in Docker build, frontend tests, and WAF integration
- Error:
Cannot find module '../data/crowdsecPresets' or its corresponding type declarations - Affected Jobs:
- Run 20046135429 (Docker Build) - exit code 2 at Dockerfile:47
- Run 20046135423 (Frontend Codecov) - 2 test suites failed
- Run 20046135424 (WAF Integration) - Docker build failed
- Run 20046135426 (Quality Checks - Frontend) - test failures
- Files Importing Missing Module:
frontend/src/pages/CrowdSecConfig.tsx:17frontend/src/pages/__tests__/CrowdSecConfig.spec.tsx:13frontend/src/pages/__tests__/CrowdSecConfig.test.tsx(indirect via CrowdSecConfig.tsx)
- Type Errors Cascade:
CrowdSecConfig.tsx(86,52): error TS7006: Parameter 'preset' implicitly has an 'any' typeCrowdSecConfig.tsx(92-96): error TS2339: Property 'title'|'description'|'content'|'tags'|'warning' does not exist on type '{}'- 10 TypeScript errors total prevent npm build completion
- Git History: File never existed in repository; referenced in current_spec.md line 6 but never committed
- Remediation: Create
frontend/src/data/crowdsecPresets.tswithCROWDSEC_PRESETSconstant andCrowdsecPresettype export
2. Backend Coverage Below Threshold (84.8% < 85.0% required)
- Evidence: Go test suite passes all tests but coverage check fails
- Error:
Coverage 84.8% is below required 85% (set CHARON_MIN_COVERAGE or CPM_MIN_COVERAGE to override) - Affected Job: Run 20046135423 (Backend Codecov Upload) - exit code 1
- Impact: Codecov upload skipped, quality gate not met
- Analysis: Recent commits added CrowdSec hub sync code without corresponding unit tests
- Likely Contributors:
- Remediation Options:
- Add unit tests for new CrowdSec hub sync functions to reach 85%+ coverage
- Temporarily lower threshold via
CHARON_MIN_COVERAGE=84(not recommended for merge) - Exclude untested experimental code from coverage calculation until implementation complete
3. Frontend Test Failures (2 test suites failed, 587 tests passed)
- Evidence: Vitest reports 2 failed suites due to missing module
- Affected Job: Run 20046135423 (Frontend Codecov Upload) & Run 20046135426 (Quality Checks)
- Failed Suites:
src/pages/__tests__/CrowdSecConfig.spec.tsxsrc/pages/__tests__/CrowdSecConfig.test.tsx
- Root Cause: Same as #1 - missing
crowdsecPresets.tsfile - Consequence: Frontend coverage calculation incomplete, 587 other tests pass
4. Docker Multi-Arch Build Failure (linux/amd64, linux/arm64)
- Evidence: Build canceled at frontend stage with TypeScript errors
- Affected Job: Run 20046135429 (Docker Build, Publish & Test)
- Build Stages:
- Stage
frontend-builder 6/6failed duringnpm run build - Stages
backend-buildercanceled due to frontend failure - No images pushed to ghcr.io, Trivy scan skipped
- Stage
- Root Cause: Same as #1 - TypeScript compilation blocked by missing module
- Downstream Impact: Test Docker Image job skipped (no image available)
5. WAF Integration Tests Skipped
- Evidence: Docker build failed before tests could run
- Affected Job: Run 20046135424 (WAF Integration Tests)
- Build Step Failure: Same TypeScript errors at Dockerfile:47
- Container Status:
charon-debugcontainer never created - Root Cause: Same as #1 - build precondition not met
Are These Fixed by Recent Commits?
NO - Analysis of commits since 571a61a (the triggering commit at 2025-12-08 23:19:38Z):
- Commit
8f48e03: Merge development → feature/beta-release (no fixes) - Commit
32ed8bc: Merge PR #332 development → feature/beta-release (no fixes) - Latest commit on feature/beta-release:
32ed8bc(2025-12-09 00:26:07Z) - Missing file still not present in workspace or git history
- Coverage issue unaddressed - no new tests added
Required Remediation Steps
IMMEDIATE (blocks all CI):
- Create Missing Frontend File
# Create frontend/src/data/crowdsecPresets.ts with structure: export interface CrowdsecPreset { slug: string; title: string; description: string; content: string; tags: string[]; warning?: string; } export const CROWDSEC_PRESETS: CrowdsecPreset[] = [ // Populate from backend/internal/crowdsec/presets.go or empty array ]; - Verify TypeScript Compilation
cd frontend && npm run build cd frontend && npm run test:ci
REQUIRED (for merge):
- Add Backend Unit Tests for CrowdSec Hub Sync
- Target files:
backend/internal/crowdsec/hub_sync.go,hub_cache.go - Create:
backend/internal/crowdsec/hub_sync_test.go,hub_cache_test.go - Achieve: ≥85% coverage threshold
- Target files:
- Run Full CI Validation
# Backend cd backend && go test ./... -v -coverprofile=coverage.txt # Frontend cd frontend && npm run test:ci # Docker docker build --platform linux/amd64 -t charon:test .
OPTIONAL (technical debt):
- Update Documentation
- Fix docs/plans/current_spec.md line 6 reference to non-existent file
- Add troubleshooting entry for missing preset file scenario
- Add Pre-commit Hook
- Validate TypeScript imports resolve before commit
- Block commits with missing module references
Prevention Measures
- Pre-commit validation: TypeScript type checking must pass (
tsc --noEmit) - Coverage enforcement: CI should fail immediately when coverage drops below threshold
- Integration test gating: Block merge if Docker build fails on any platform
- Module existence checks: Lint for import statements referencing non-existent files
- Test coverage for new features: Require tests in same commit as feature code
Current State (what exists today)
- Backend: backend/internal/api/handlers/crowdsec_handler.go exposes
ListPresets(returns curated list from backend/internal/crowdsec/presets.go) and a stubbedPullAndApplyPresetthat only validates slug and returns preview or HTTP 501 whenapply=true. No real hub sync or apply. - Backend uses
CommandExecutorforcscli decisionsonly; no hub pull/install logic and no cache/backups beyond file write backups inWriteFileand import flow. - Frontend: frontend/src/pages/CrowdSecConfig.tsx calls
pullAndApplyCrowdsecPresetthen falls back to localwriteCrowdsecFileapply. Preset catalog merges backend list with frontend/src/data/crowdsecPresets.ts. Errors 501/404 are surfaced as info to keep local apply working. Overview toggle/start/stop already wired tostartCrowdsec/stopCrowdsec. - Docs: docs/cerberus.md still notes CrowdSec integration is a placeholder; no hub sync described.
Incident Triage: CrowdSec preset pull/apply 502/500 (feature/beta-release)
- Logs to pull first: backend app/GIN logs under
/app/data/logs/charon.log(ordata/logs/charon.login dev) via backend/cmd/api/main.go; look for warnings "crowdsec preset pull failed" / "crowdsec preset apply failed" emitted in backend/internal/api/handlers/crowdsec_handler.go. Access logs will also show 502/500 for POST/api/v1/admin/crowdsec/presets/pulland/apply. - Routes and code paths: handlers
PullPresetandApplyPresetlive in backend/internal/api/handlers/crowdsec_handler.go and delegate toHubService.Pull/Applyin backend/internal/crowdsec/hub_sync.go with cache helpers in backend/internal/crowdsec/hub_cache.go. Data dir used isdata/crowdsecwith cache underdata/crowdsec/hub_cachefrom backend/internal/api/routes/routes.go. - Quick checks before repro: (1) Cerberus enabled (
feature.cerberus.enabledsetting orFEATURE_CERBERUS_ENABLED/CERBERUS_ENABLEDenv) or handler returns 404 early; (2)csclion PATH and executable (HubServiceuses real executor and callscscli version/cscli hub install); (3) outbound HTTPS to https://hub.crowdsec.net reachable (fallback aftercscli hub list); (4) cache dir writabledata/crowdsec/hub_cacheand contains per-slugmetadata.json,bundle.tgz,preview.yaml; (5) backup path writable (apply renamesdata/crowdsectodata/crowdsec.backup.<ts>). - Likely 502 on pull: hub cache unavailable or init failed (cache dir permission), invalid slug, hub index fetch errors (
cscli hub list -o jsonor direct GET/api/index.json), download blocked/size >25MiB, preview/download HTTP non-200, or cache write errors. Handler logs warning and returns 502 with error string. - Likely 500 on apply: backup rename fails,
cscliinstall fails with no cache fallback (if pull never succeeded or cache expired/missing), cache read errors (metadata.json/bundle.tgzunreadable), tar extraction rejects symlinks/unsafe paths, or rollback after extract failure. Handler writesCrowdsecPresetEvent(if DB reachable) with backup path and returns 500 withbackuphint. - Validation steps during triage: verify cache entry freshness (TTL 24h) via
metadata.jsontimestamps; confirmcscli hub install <slug>succeeds manually; if cscli missing, ensure prior pull populated cache; test hub egress with curl to hub index and archive URLs; check file ownership/permissions ondata/crowdsecanddata/crowdsec/hub_cache; confirm log lines around warnings for exact error message; inspect backup directory to restore if partial apply.
Current incident: preset apply returning "Network Error" (feature/beta-release)
- What we see: frontend reports axios "Network Error" while applying a preset. Backend logs do not yet show the apply warning, suggesting the client drops before an HTTP response arrives. Apply path runs
HubService.Applyin backend/internal/crowdsec/hub_sync.go with a 15s context; pull uses a 10s HTTP client timeout and does not follow redirects. Axios flags a network error when the TCP connection is reset/timeout rather than when a 4xx/5xx is returned. - Probable roots to verify quickly:
- Hub index/preview/archives now redirect to another host; our HTTP client forbids redirects, so FetchIndex/Pull return an error and the handler responds 502 only after the hub timeout. Long hub connect attempts can hit the 10s client timeout, causing the upstream (Caddy) or browser to drop the socket and surface a network error.
- Runtime image may be missing
cscliif the release archive layout changed; Dockerfile only moves the binaries when expected paths exist. Without cscli, Apply falls back to cache, but if Pull already failed, Apply exits with an error and no response body. Validatecscli versioninside the running container built from feature/beta-release. - Outbound egress/proxy: container must reach https://hub-data.crowdsec.net (default) from within the Docker network. Missing
HTTP(S)_PROXY/NO_PROXYor a transparent MITM can cause TLS handshake or connection timeouts that the client reports as network errors. - TLS/HTML responses: hub returning HTML (maintenance/Cloudflare) or a 3xx/302 to http is treated as an error (
hub index responded with HTML), which becomes 502. If the redirect/HTML arrives after ~10s the browser may already have given up. - Timeout budget: 10s pull / 15s apply may be too tight for hub downloads + cscli install. When the context cancels mid-stream, gin closes the connection and axios logs network error instead of an HTTP code.
- Remediation plan (no code yet):
- Confirm cscli exists in the runtime image from Dockerfile by running
cscli versioninside the failing container; if missing, adjust build or add a startup preflight that logs absence and forces HTTP hub path. - Override HUB_BASE_URL to a known JSON endpoint (e.g.,
https://hub-data.crowdsec.net/api/index.json) when redirects occur, or point to an internal mirror reachable from the Docker network; document this in env examples. - Ensure outbound 443 to hub-data is allowed or set
HTTP(S)_PROXY/NO_PROXYon the container; retry pull/apply after validatingcurl -v https://hub-data.crowdsec.net/api/index.jsoninside the runtime. - Consider raising pull/apply timeouts (and matching frontend request timeout) and log when contexts cancel so we return a 504/timeout JSON instead of a dropped socket.
- Capture docker logs for
charon-debugduring repro; look forcrowdsec preset pull/apply failedwarnings and any TLS/redirect messages from backend/internal/crowdsec/hub_sync.go.
- Confirm cscli exists in the runtime image from Dockerfile by running
Goal
Implement real CrowdSec Hub preset sync + apply on backend (using cscli or direct hub index) with caching, validation, backups, rollback, and wire the UI to new endpoints so operators can preview/apply hub items with clear status/errors.
Backend Plan (handlers, helpers, storage)
- Route adjustments (gin group under
/admin/crowdsecin backend/internal/api/handlers/crowdsec_handler.go):- Replace stub endpoint with
POST /admin/crowdsec/presets/pull→ fetch hub item and cache; returns metadata + preview + cache key/etag. - Add
POST /admin/crowdsec/presets/apply→ apply previously pulled item by cache key/slug; performs backup + cscli install + optional restart. - Keep
GET /admin/crowdsec/presetsbut include hub/etag info and whether cached locally. - Optional:
GET /admin/crowdsec/presets/cache/:slug→ raw preview/download for UI.
- Replace stub endpoint with
- Hub sync helper (new backend/internal/crowdsec/hub_sync.go):
- Provide
type HubClient interface { FetchIndex(ctx) (HubIndex, error); FetchPreset(ctx, slug) (PresetBundle, error) }with real impl using either: a)cscli hub list -o jsonandcscli hub update+cscli hub install <item>(preferred if cscli present), or b) direct fetch of https://hub.crowdsec.net/ or GitHub raw.index.json+ tarball download. - Validate downloads: size limits, tarball path traversal guard, checksum/etag compare, basic YAML validation.
- Provide
- Caching (new backend/internal/crowdsec/hub_cache.go):
- Cache pulled bundles under
${DataDir}/hub_cache/<slug>/with index metadata (etag, fetched_at, source URL) and preview YAML. - Expose
LoadCachedPreset(slug)andStorePreset(slug, bundle); evict stale on TTL (configurable, default 24h) or when etag changes.
- Cache pulled bundles under
- Apply flow (extend handler):
Pull: fetch index, resolve slug, download bundle to cache, return preview + warnings (missing cscli, requires restart, etc.).Apply: before modify, runbackupDir := DataDir + ".backup." + timestamp(mirror current write/import backups). Then: a) If cscli available:cscli hub update,cscli hub install <slug>(or collection path), maybecscli decisions listsanity check. UseCommandExecutorwith context timeout. b) If cscli absent: extract bundle into DataDir with sanitized paths; preserve permissions. c) Write audit record to DB tablecrowdsec_preset_events(new model in backend/internal/models).- On failure: restore backup (rename back), surface error + backup path.
- Status and restart:
- After apply, optionally call
h.Executor.Stop/Startif running to reload config; orcscli service reloadwhen available. Returnreload_performedflag.
- After apply, optionally call
- Validation & security hardening:
- Enforce
Cerberusenablement check (isCerberusEnabled) on all new routes. - Path sanitization with
filepath.Clean, limit tar extraction to DataDir, reject symlinks/abs paths. - Timeouts on all external calls; default 10s pull, 15s apply.
- Log with context: slug, etag, source, backup path; redact secrets.
- Enforce
- Migration of curated list:
- Keep curated presets in backend/internal/crowdsec/presets.go but add
Source: "hub"for hub-backed items and includeRequiresHubtrue when not bundled. ListPresetsshould merge curated + live hub index when available, mark availability per slug (cached, remote-only, local-bundled).
- Keep curated presets in backend/internal/crowdsec/presets.go but add
Frontend Plan (API wiring + UX)
- API client updates in frontend/src/api/presets.ts:
- Replace
pullAndApplyCrowdsecPresetwithpullCrowdsecPreset({ slug })andapplyCrowdsecPreset({ slug, cache_key }); include response typing for preview/status/errors. - Add
getCrowdsecPresetCache(slug)if backend exposes cache preview.
- Replace
- CrowdSec config page frontend/src/pages/CrowdSecConfig.tsx:
- Use new mutations:
pullto show preview + metadata (etag, fetched_at, source); disable local fallback unless backend saysapply_supported=false. - Show status strip (success/error) and backup path from apply response; surface reload flag and errors inline.
- Gate preset actions when Cerberus disabled; show tooltip if hub unreachable.
- Keep local backup + manual file apply as last-resort only when backend explicitly returns 501/NotImplemented.
- Use new mutations:
- Overview page frontend/src/pages/Security.tsx:
- No UI change except error surfacing when start/stop fails due to hub apply requiring reload; show toast from handler message.
- Import page frontend/src/pages/ImportCrowdSec.tsx:
- Add note linking to presets apply so users prefer presets over raw package imports.
Hub Fetch/Validate/Apply Flow (detailed)
- Pull
- Handler:
CrowdsecHandler.PullPreset(ctx)(new) callsHubClient.FetchPreset→HubCache.StorePreset→ returns{preset, preview_yaml, etag, cache_key, fetched_at}. - If hub unavailable, return 503 with message; UI shows retry/cached copy option.
- Handler:
- Apply
- Handler:
CrowdsecHandler.ApplyPreset(ctx)loads cache by slug/cache_key →backupCurrentConfig()→InstallPreset()(cscli or manual) → optional restart → returns{status:"applied", backup, reloaded:true/false}. - On error: restore backup, include
{status:"failed", backup, error}.
- Handler:
- Caching & rollback
- Cache directory per slug with checksum file; TTL enforced on pull; apply uses cached bundle unless
force_refetchflag. - Backups stored with timestamp; keep last N (configurable). Provide restoration note in response for UI.
- Cache directory per slug with checksum file; TTL enforced on pull; apply uses cached bundle unless
- Validation
- Tarball extraction guard: reject absolute paths,
.., symlinks; limit total size. - YAML sanity: parse key scenario/collection files to ensure readable; log warning not blocker unless parse fails.
- Require explicit
apply=trueseparate from pull; no implicit apply on pull.
- Tarball extraction guard: reject absolute paths,
Security Considerations
- Only allow these endpoints when Cerberus enabled and user authenticated to admin scope.
- Use
CommandExecutorto shell out to cscli; restrict PATH and working dir; do not pass user-controlled args without whitelist. - Network egress: if hub URL configurable, validate scheme is https and host is allowlisted (crowdsec official or configured mirror).
- Rate limit pull/apply (simple in-memory token bucket) to avoid abuse.
- Logging: include slug and etag, omit file contents; redact download URLs if they contain tokens (unlikely).
Required Tests
- Backend unit/integration:
backend/internal/api/handlers/crowdsec_handler_test.go: success and error cases forPullPreset(hub reachable/unreachable, invalid slug),ApplyPreset(cscli success, cscli missing fallback, apply fails and restores backup),ListPresetsmerging cached hub entries.backend/internal/crowdsec/hub_sync_test.go: parse index JSON, validate tar extraction guards, TTL eviction.backend/internal/crowdsec/hub_cache_test.go: store/load/evict logic and checksum verification.backend/internal/api/handlers/crowdsec_exec_test.go: ensure executor timeouts/commands constructed for cscli hub calls.
- Frontend unit/UI:
- frontend/src/pages/tests/CrowdSecConfig.test.tsx: pull shows preview, apply success shows backup path/reload flag, hub failure falls back to cached/local message, Cerberus disabled disables actions.
- frontend/src/api/tests/presets.test.ts: client hits new endpoints and maps response.
- frontend/src/pages/tests/Security.test.tsx: start/stop toasts remain correct when apply errors bubble.
Docs Updates
- Update docs/cerberus.md CrowdSec section with new hub preset flow, backup/rollback notes, and requirement for cscli availability when using hub.
- Update docs/features.md to list “CrowdSec Hub presets sync/apply (admin)” and mention offline curated fallback.
- Add short troubleshooting entry in docs/troubleshooting/crowdsec.md (new) for hub unreachable, checksum mismatch, or cscli missing.
Migration Notes
- Existing curated presets remain but are marked as bundled; UI should continue to show them even if hub unreachable.
- Stub endpoint
POST /admin/crowdsec/presets/pull/applyis replaced by separatepullandapply; frontend must switch to new API paths before backend removal to avoid 404. - Backward compatibility: keep returning 501 from old endpoint until frontend merged; remove once new routes live and tested.