7.3 KiB
7.3 KiB
CrowdSec Preset Apply Cache Miss — Bot Mitigation Essentials
Date: December 11, 2025
Incident: CrowdSec preset add error: Apply failed: load cache: load cache for bot-mitigation-essentials: cache miss. Backup created at data/crowdsec.backup.20251210-193359
Context Snapshot
- Observed error path:
HubService.Apply()→loadCacheMeta()→HubCache.Load()returnsErrCacheMiss, while apply already created a backup atdata/crowdsec.backup.*, indicating we fell through the cscli path and then the manual cache path without a cached bundle. - Key components in play:
- Cache layer: backend/internal/crowdsec/hub_cache.go (
Store,Load,List,Exists,Touch) - Hub orchestration: backend/internal/crowdsec/hub_sync.go (
Pull,Apply,loadCacheMeta,runCSCLI,extractTarGz) - HTTP surface: backend/internal/api/handlers/crowdsec_handler.go (
PullPreset,ApplyPreset,ListPresets,GetCachedPreset) - Coverage and repro baselines: backend/internal/crowdsec/hub_pull_apply_test.go, backend/internal/api/handlers/crowdsec_pull_apply_integration_test.go
- Cache layer: backend/internal/crowdsec/hub_cache.go (
- Hypotheses to validate:
- Cache never created for slug
bot-mitigation-essentials(e.g., hub index didn’t contain slug, slug mismatch, or pull failure masked by fallback logging). - Cache existed but expired/evicted (24h TTL default in
NewHubCache,ErrCacheExpiredtreated as miss) before apply. - cscli path failed and manual path fell back to cache that was missing; backup already created → rollback not restoring correctly on miss.
- Slug naming drift between curated presets and hub index (e.g.,
crowdsecurity/bot-mitigation-essentialsvsbot-mitigation-essentials).
- Cache never created for slug
Plan (phased; minimize requests)
Phase 1 — Fast Forensics (no new mutations)
- Inspect logs for the failing apply to capture:
crowdsec preset apply failedentries in backend/internal/api/handlers/crowdsec_handler.go (ensure we logcache_key,backup_path,hub_base_url).- Prior
preset pulled and cached successfullyentries for the same slug to see if pull ever succeeded.
- Check cache filesystem state without new pulls:
- List
data/hub_cache/andbackend/data/hub_cache/forbot-mitigation-essentialsto confirm presence ofmetadata.json,bundle.tgz,preview.yaml. - Read
metadata.jsonto confirmretrieved_atvs TTL andcache_key.
- List
- Confirm whether curated presets include the slug:
- Inspect
ListCuratedPresets()in backend/internal/crowdsec/presets.go (if present) and compare to hub index slugs.
- Inspect
Phase 2 — Reproduce with Minimal Requests
- Execute one controlled pull + apply sequence for
bot-mitigation-essentialsonly:POST /api/v1/admin/crowdsec/presets/pull {slug}— capture responsecache_key,etag, and verify cache files written.POST /api/v1/admin/crowdsec/presets/apply {slug}— watch for fallback messageload cache for ... cache miss.
- Capture logs around these calls to see which path ran:
HubService.Apply()branch (hasCSCLI,runCSCLIsuccess/fail, thenloadCacheMeta).HubCache.Load()result (hit/expired/miss).
- Validate backup rollback: ensure
data/crowdsec.backup.*is restored when cache miss occurs.
Phase 3 — Code Fix Design (targeted, low-risk)
- Cache resilience:
- In
HubService.Apply(), whenrunCSCLIfails andloadCacheMetareturnsErrCacheMiss, attempt a singlePull()retry (hub available) before failing, but guard with context and size limits. - When
ErrCacheExpired, auto-evict + repull once to refresh.
- In
- Slug correctness & curated mapping:
- Ensure curated preset slug list includes
crowdsecurity/bot-mitigation-essentials(verify file backend/internal/crowdsec/presets.go). - In
findIndexEntry(hub_sync.go), consider accepting slug without namespace by matching suffix when unique to avoid hub miss.
- Ensure curated preset slug list includes
- Better guidance and rollback:
- In
ApplyPresethandler, if cache miss occurs after backup creation, ensure rollback succeeds and returnbackup+ actionable guidance (e.g., "Pull preset again; cache missing"). - Add explicit log when rollback triggers due to cache miss, including backup path and slug.
- In
- TTL visibility:
- Add
retrieved_atand TTL remaining toGetCachedPresetandListPresetsoutputs to help UI warn about expired cache.
- Add
- CSCLI guardrails:
- If
cscliis not found or returns non-zero, include stderr in logs and surface a friendlier hint in the error payload.
- If
Phase 4 — Tests & Repro Harness
- Add regression tests:
HubServiceunit:ApplywithErrCacheMisstriggers single repull then succeeds (mock HTTP + cache).
- Integration handler: simulate missing cache after pull (evict between pull/apply) → expect repull or clear error and rollback confirmed.
- Slug normalization test:
bot-mitigation-essentials(no namespace) maps tocrowdsecurity/bot-mitigation-essentialswhen hub index only has the namespaced entry. - Backup rollback test: ensure
data/crowdsecrestored on cache-miss failure.
- Slug normalization test:
- Extend logging assertions in existing tests to validate
cache_keyandbackuppresence in error responses.
Phase 5 — Observability & UX polish
- Add a lightweight cache status endpoint or extend
ListPresetsto includecache_state: [hit|expired|miss]per slug. - Frontend (CrowdSecConfig.tsx) follow-up (future PR): surface cache age, "repull" CTA on cache miss, and show backup path when apply fails. (Keep frontend changes out of this fix unless necessary.)
Phase 6 — Verification Checklist (one pass)
go test ./backend/internal/crowdsec ./backend/internal/api/handlers -run Pull|Apply -v(or focused test names added above).cd backend && go test ./...to ensure no regressions.- Manual: pull + apply
crowdsecurity/bot-mitigation-essentialstwice; second apply should hit cache without backup churn. - Confirm logs show cache hit and no
cache misswarnings; backup directory not recreated on cache hit. - Validate data directories remain git-ignored (
/data/,/backend/data/, backups under/data/backups/).
Config File Review
- .gitignore — already ignores
/data/and/data/backups/; covers cache/backup artifacts (backend/data/). No change needed. - .dockerignore — excludes
data/andbackend/data/, keeping hub cache/backup out of build context. No change needed. - .codecov.yml — excludes
backend/data/**; cache/backup coverage not expected. No change needed. - Dockerfile — installs
cscli; ensure version is recent enough for hub pulls (currentlyCROWDSEC_VERSION=1.7.4). No adjustments required for this fix, but verify the image still includes cscli after build.
Deliverables
- Patch for cache-miss resilience and slug normalization in
HubService.Apply()and helpers. - Error/logging improvements in
ApplyPresethandler. - Regression tests covering cache-miss + repull, slug normalization, and rollback behavior.
- Optional: cache-status enrichment for UI consumption (if small and low-risk).