26 KiB
CrowdSec Preset Apply Failure - Fix Plan
Date: December 12, 2025 Status: Analysis Complete - Ready for Implementation Severity: High
Issue Summary
User reported error when applying a CrowdSec preset:
Apply failed: read archive: open /app/data/crowdsec/hub_cache/crowdsecurity/caddy/bundle.tgz: no such file or directory. Backup created at /app/data/crowdsec.backup.20251211-194408
Root Cause Analysis
The Bug
The Apply() function in hub_sync.go has a fatal ordering bug that destroys the cache before reading from it.
Detailed Flow
-
Pull Phase (Works Correctly)
- User pulls preset
crowdsecurity/caddy HubCache.Store()writes to:/app/data/crowdsec/hub_cache/crowdsecurity/caddy/bundle.tgzCachedPreset.ArchivePathstores this absolute path
- User pulls preset
-
Apply Phase (Bug Occurs)
Step 1: loadCacheMeta() → Returns meta.ArchivePath = "/app/data/crowdsec/hub_cache/.../bundle.tgz" Step 2: backupExisting() → RENAMES "/app/data/crowdsec" to "/app/data/crowdsec.backup.TIMESTAMP" ⚠️ THIS MOVES THE CACHE TOO! hub_cache is INSIDE crowdsec/ Step 3: cscli fails (not available or preset not in hub) Step 4: os.ReadFile(meta.ArchivePath) → FILE NOT FOUND! The path still points to "/app/data/crowdsec/..." but that directory was renamed!
Visual Representation
Before Backup:
/app/data/crowdsec/
├── hub_cache/
│ └── crowdsecurity/
│ └── caddy/
│ ├── bundle.tgz ← meta.ArchivePath points here
│ ├── preview.yaml
│ └── metadata.json
├── config.yaml
└── other_files/
After backupExisting() (line 535):
/app/data/crowdsec.backup.20251211-194408/ ← Renamed!
├── hub_cache/
│ └── crowdsecurity/
│ └── caddy/
│ ├── bundle.tgz ← File is now HERE
│ ├── preview.yaml
│ └── metadata.json
├── config.yaml
└── other_files/
/app/data/crowdsec/ ← Directory no longer exists!
Result: os.ReadFile(meta.ArchivePath) fails because the path /app/data/crowdsec/hub_cache/.../bundle.tgz no longer exists.
Why This Wasn't Caught Earlier
- Tests use temp directories - Each test creates fresh directories, so the race condition doesn't manifest
- cscli path succeeds in CI - When
cscliis available and works, the code returns early before hitting the bug - Recent changes to backup logic - The copy-based fallback and backup improvements may have introduced this ordering issue
- Cache directory nested inside DataDir - The architecture decision to put
hub_cacheinsideDataDir(crowdsec config) creates this coupling
Fix Options
Option A: Read Archive Before Backup (Recommended)
Rationale: Simple, minimal change, maintains existing backup behavior.
File: backend/internal/crowdsec/hub_sync.go
Changes:
func (s *HubService) Apply(ctx context.Context, slug string) (ApplyResult, error) {
// ... existing validation code ...
result := ApplyResult{AppliedPreset: cleanSlug, Status: "failed"}
meta, metaErr := s.loadCacheMeta(applyCtx, cleanSlug)
if metaErr == nil {
result.CacheKey = meta.CacheKey
}
hasCS := s.hasCSCLI(applyCtx)
// === NEW: Read archive BEFORE backup ===
var archive []byte
var archiveErr error
if metaErr == nil {
archive, archiveErr = os.ReadFile(meta.ArchivePath)
if archiveErr != nil {
logger.Log().WithError(archiveErr).WithField("archive_path", meta.ArchivePath).Warn("failed to read cached archive before backup")
}
}
// === END NEW ===
backupPath := filepath.Clean(s.DataDir) + ".backup." + time.Now().Format("20060102-150405")
if err := s.backupExisting(backupPath); err != nil {
return result, fmt.Errorf("backup: %w", err)
}
result.BackupPath = backupPath
// Try cscli first
if hasCS {
cscliErr := s.runCSCLI(applyCtx, cleanSlug)
if cscliErr == nil {
result.Status = "applied"
result.ReloadHint = true
result.UsedCSCLI = true
return result, nil
}
logger.Log().WithField("slug", cleanSlug).WithError(cscliErr).Warn("cscli install failed; attempting cache fallback")
}
// === MODIFIED: Use pre-loaded archive or refresh ===
if metaErr != nil || archiveErr != nil {
refreshed, refreshErr := s.refreshCache(applyCtx, cleanSlug, metaErr)
if refreshErr != nil {
_ = s.rollback(backupPath)
return result, fmt.Errorf("load cache for %s: %w", cleanSlug, refreshErr)
}
meta = refreshed
result.CacheKey = meta.CacheKey
// Re-read archive from refreshed cache location
archive, archiveErr = os.ReadFile(meta.ArchivePath)
if archiveErr != nil {
_ = s.rollback(backupPath)
return result, fmt.Errorf("read archive: %w", archiveErr)
}
}
// Use the pre-loaded archive bytes
if err := s.extractTarGz(applyCtx, archive, s.DataDir); err != nil {
_ = s.rollback(backupPath)
return result, fmt.Errorf("extract: %w", err)
}
// === END MODIFIED ===
result.Status = "applied"
result.ReloadHint = true
result.UsedCSCLI = false
return result, nil
}
Option B: Move Cache Outside DataDir
Rationale: Architectural fix - separates transient cache from operational config.
Files to modify:
- backend/internal/api/handlers/crowdsec_handler.go - Change cache location
- backend/internal/crowdsec/hub_sync.go - Add cache dir parameter
Changes:
// In NewCrowdsecHandler:
// BEFORE:
cacheDir := filepath.Join(dataDir, "hub_cache")
// AFTER:
cacheDir := filepath.Join(filepath.Dir(dataDir), "hub_cache")
// Results in: /app/data/hub_cache (sibling of crowdsec, not child)
Pros: Clean separation, cache survives config resets Cons: Breaking change for existing installs, requires migration
Option C: Selective Backup (Exclude Cache)
Rationale: Only backup config files, not cache.
Changes to backupExisting():
func (s *HubService) backupExisting(backupPath string) error {
// ... existing checks ...
// Skip hub_cache during backup - it's transient
return filepath.WalkDir(s.DataDir, func(path string, d fs.DirEntry, err error) error {
if strings.Contains(path, "hub_cache") {
return filepath.SkipDir
}
// ... copy logic ...
})
}
Pros: Faster backups, cache preserved Cons: More complex, backup is no longer complete snapshot
Recommended Implementation
Choose Option A for these reasons:
- Minimal code change - Single function modification
- No breaking changes - Existing cache paths remain valid
- No migration needed - Works immediately
- Maintains complete backups - Backup still captures full state
- Easy to test - Clear before/after behavior
Files to Modify
| File | Change |
|---|---|
| backend/internal/crowdsec/hub_sync.go | Reorder archive read before backup in Apply() |
| backend/internal/crowdsec/hub_sync_test.go | Add test for apply with backup scenario |
| backend/internal/crowdsec/hub_pull_apply_test.go | Add regression test |
Specific Code Changes
Change 1: hub_sync.go - Apply() Function
Location: Lines 514-580
Before:
func (s *HubService) Apply(ctx context.Context, slug string) (ApplyResult, error) {
cleanSlug := sanitizeSlug(slug)
// ... validation ...
result := ApplyResult{AppliedPreset: cleanSlug, Status: "failed"}
meta, metaErr := s.loadCacheMeta(applyCtx, cleanSlug)
if metaErr == nil {
result.CacheKey = meta.CacheKey
}
hasCS := s.hasCSCLI(applyCtx)
backupPath := filepath.Clean(s.DataDir) + ".backup." + time.Now().Format("20060102-150405")
if err := s.backupExisting(backupPath); err != nil {
return result, fmt.Errorf("backup: %w", err)
}
result.BackupPath = backupPath
// Try cscli first
if hasCS {
// ... cscli logic ...
}
if metaErr != nil {
// ... refresh cache logic ...
}
archive, err := os.ReadFile(meta.ArchivePath) // ❌ FAILS - file moved by backup!
if err != nil {
_ = s.rollback(backupPath)
return result, fmt.Errorf("read archive: %w", err)
}
// ...
}
After:
func (s *HubService) Apply(ctx context.Context, slug string) (ApplyResult, error) {
cleanSlug := sanitizeSlug(slug)
// ... validation ...
result := ApplyResult{AppliedPreset: cleanSlug, Status: "failed"}
meta, metaErr := s.loadCacheMeta(applyCtx, cleanSlug)
if metaErr == nil {
result.CacheKey = meta.CacheKey
}
hasCS := s.hasCSCLI(applyCtx)
// ✅ NEW: Read archive into memory BEFORE backup moves the files
var archive []byte
var archiveReadErr error
if metaErr == nil {
archive, archiveReadErr = os.ReadFile(meta.ArchivePath)
if archiveReadErr != nil {
logger.Log().WithError(archiveReadErr).WithField("archive_path", meta.ArchivePath).
Warn("failed to read cached archive before backup")
}
}
backupPath := filepath.Clean(s.DataDir) + ".backup." + time.Now().Format("20060102-150405")
if err := s.backupExisting(backupPath); err != nil {
return result, fmt.Errorf("backup: %w", err)
}
result.BackupPath = backupPath
// Try cscli first
if hasCS {
cscliErr := s.runCSCLI(applyCtx, cleanSlug)
if cscliErr == nil {
result.Status = "applied"
result.ReloadHint = true
result.UsedCSCLI = true
return result, nil
}
logger.Log().WithField("slug", cleanSlug).WithError(cscliErr).
Warn("cscli install failed; attempting cache fallback")
}
// ✅ MODIFIED: Handle cache miss OR failed archive read
if metaErr != nil || archiveReadErr != nil {
// Need to refresh cache (either wasn't cached or file was unreadable)
originalErr := metaErr
if originalErr == nil {
originalErr = archiveReadErr
}
refreshed, refreshErr := s.refreshCache(applyCtx, cleanSlug, originalErr)
if refreshErr != nil {
_ = s.rollback(backupPath)
logger.Log().WithError(refreshErr).WithField("slug", cleanSlug).
WithField("backup_path", backupPath).
Warn("cache refresh failed; rolled back backup")
result.ErrorMessage = fmt.Sprintf("load cache for %s: %v", cleanSlug, refreshErr)
return result, fmt.Errorf("load cache for %s: %w", cleanSlug, refreshErr)
}
meta = refreshed
result.CacheKey = meta.CacheKey
// Read from the newly refreshed cache
archive, archiveReadErr = os.ReadFile(meta.ArchivePath)
if archiveReadErr != nil {
_ = s.rollback(backupPath)
return result, fmt.Errorf("read archive after refresh: %w", archiveReadErr)
}
}
// ✅ Use pre-loaded archive bytes (no file read here)
if err := s.extractTarGz(applyCtx, archive, s.DataDir); err != nil {
_ = s.rollback(backupPath)
return result, fmt.Errorf("extract: %w", err)
}
result.Status = "applied"
result.ReloadHint = true
result.UsedCSCLI = false
return result, nil
}
Change 2: Add Regression Test
File: backend/internal/crowdsec/hub_pull_apply_test.go
New test:
func TestApplyReadsArchiveBeforeBackup(t *testing.T) {
// This test verifies the fix for the bug where Apply() would:
// 1. Load cache metadata (getting archive path)
// 2. Backup DataDir (moving the cache!)
// 3. Try to read archive from original path (FAIL!)
baseDir := t.TempDir()
dataDir := filepath.Join(baseDir, "crowdsec")
cacheDir := filepath.Join(dataDir, "hub_cache")
// Create cache
cache, err := NewHubCache(cacheDir, time.Hour)
require.NoError(t, err)
// Create a mock hub server
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if strings.Contains(r.URL.Path, ".tgz") {
// Return a valid tar.gz
var buf bytes.Buffer
gw := gzip.NewWriter(&buf)
tw := tar.NewWriter(gw)
content := []byte("test: value\n")
tw.WriteHeader(&tar.Header{Name: "test.yaml", Size: int64(len(content)), Mode: 0644})
tw.Write(content)
tw.Close()
gw.Close()
w.Write(buf.Bytes())
return
}
if strings.Contains(r.URL.Path, ".yaml") {
w.Write([]byte("preview: content"))
return
}
// Index
w.Write([]byte(`{"items":[{"name":"test/preset","version":"1.0"}]}`))
}))
defer server.Close()
hub := &HubService{
Cache: cache,
DataDir: dataDir,
HTTPClient: server.Client(),
HubBaseURL: server.URL,
MirrorBaseURL: server.URL,
PullTimeout: 10 * time.Second,
ApplyTimeout: 10 * time.Second,
}
ctx := context.Background()
// Pull to populate cache
_, err = hub.Pull(ctx, "test/preset")
require.NoError(t, err, "pull should succeed")
// Verify cache exists
_, err = cache.Load(ctx, "test/preset")
require.NoError(t, err, "cache should exist after pull")
// Add some extra files to DataDir to make backup more realistic
require.NoError(t, os.WriteFile(filepath.Join(dataDir, "config.yaml"), []byte("test: config"), 0644))
// Apply - this should NOT fail with "read archive: no such file"
result, err := hub.Apply(ctx, "test/preset")
require.NoError(t, err, "apply should succeed - archive should be read before backup")
assert.Equal(t, "applied", result.Status)
assert.NotEmpty(t, result.BackupPath)
// Verify backup was created
_, err = os.Stat(result.BackupPath)
assert.NoError(t, err, "backup should exist")
}
Edge Cases to Consider
| Scenario | Current Behavior | Fixed Behavior |
|---|---|---|
| First-time apply (no cache) | Fails with cache miss | Attempts refresh, same behavior |
| cscli available and works | Returns early, never hits bug | Same - returns early |
| cscli fails, cache exists | FAILS - archive moved | Succeeds - archive pre-loaded |
| Archive file corrupted | Fails on read | Same - fails on read, but before backup |
| Network down during refresh | Fails | Same - fails with clear error |
| Large archive (>25MB) | Limited by maxArchiveSize | Same - memory is fine for 25MB |
| Concurrent applies | Potential race | Still potential race (separate issue) |
Testing Plan
-
Unit Tests
TestApplyReadsArchiveBeforeBackup- New regression test- Existing
TestPullThenApplyFlowshould still pass TestApplyWithoutPullFailsshould still pass
-
Integration Tests
- Manual test in Docker container
- Pull preset via UI
- Apply preset via UI
- Verify no "read archive" error
-
Edge Case Tests
- Apply with expired cache (should refresh)
- Apply with network failure (should error gracefully)
- Apply with cscli available (should use cscli path)
Rollout Plan
- Implement fix in
hub_sync.go - Add regression test in
hub_pull_apply_test.go - Run full test suite:
go test ./... - Run pre-commit:
pre-commit run --all-files - Build and test locally:
docker build -t charon:local . - Manual verification in container
- Commit with:
fix: read archive before backup in CrowdSec preset apply
Related Files Reference
| File | Purpose |
|---|---|
| hub_sync.go | HubService.Apply() - main fix location |
| hub_cache.go | Cache storage, stores ArchivePath |
| crowdsec_handler.go | HTTP handler, initializes cache |
| routes.go | Sets crowdsecDataDir from config |
| config.go | CrowdSecConfigDir default |
Summary
Root Cause: The Apply() function backs up the entire DataDir (which includes the cache) before reading the cached archive, resulting in a "file not found" error.
Fix: Read the archive into memory before creating the backup.
Impact: Low risk - the fix only changes the order of operations and doesn't affect the backup or extraction logic.
Effort: ~30 minutes implementation + testing | 1 | Cerberus shows ON by default on first load (should be OFF) | High | | 2 | Cerberus dashboard header shows "disabled" even when enabled | Medium | | 3 | CrowdSec toggle auto-enables when Cerberus is enabled | Medium | | 4 | CrowdSec toggle unresponsive + Config button grayed out | High |
Root Cause Analysis
Issue 1: Cerberus Shows ON by Default
Root Cause: The feature_flags_handler.go has a default value of true for all feature flags including feature.cerberus.enabled.
File: backend/internal/api/handlers/feature_flags_handler.go#L39-L42
// Line 39-42
for _, key := range defaultFlags {
defaultVal := true // <-- THIS IS THE BUG
if v, ok := defaultFlagValues[key]; ok {
defaultVal = v
}
Problem: The code sets defaultVal := true for all flags, then only overrides it if the key exists in defaultFlagValues. However, feature.cerberus.enabled is NOT in defaultFlagValues:
// Line 29-31
var defaultFlagValues = map[string]bool{
"feature.crowdsec.console_enrollment": false,
}
Result: On first load with an empty database, feature.cerberus.enabled defaults to true instead of false.
Additional Context:
- The backend/internal/config/config.go#L60 correctly defaults
CerberusEnabledtofalse:CerberusEnabled: getEnvAny("false", "CERBERUS_SECURITY_CERBERUS_ENABLED", ...) == "true" - However, the feature flags handler ignores this config and uses its own default.
Issue 2: Dashboard Header Shows "Disabled" Even When Enabled
Root Cause: The header banner logic in Security.tsx checks status.cerberus?.enabled which comes from the security status API, but there's a data source mismatch.
Files:
- frontend/src/pages/Security.tsx#L141-L153 - Header banner logic
- backend/internal/api/handlers/security_handler.go#L35-L49 - Security status API
Problem Flow:
- Security.tsx checks
status.cerberus?.enabledfrom/api/v1/security/status - security_handler.go reads from config AND settings table:
// Line 36-48 enabled := h.cfg.CerberusEnabled var settingKey = "security.cerberus.enabled" // <-- WRONG KEY! if h.db != nil { var setting struct{ Value string } if err := h.db.Raw("SELECT value FROM settings WHERE key = ? LIMIT 1", settingKey).Scan(&setting).Error; ... - SystemSettings.tsx toggles
feature.cerberus.enabled(via feature flags API)
The Mismatch:
| Component | Key Used |
|---|---|
| SystemSettings toggle | feature.cerberus.enabled |
| Security status API | security.cerberus.enabled |
The toggle writes to feature.cerberus.enabled but the security status reads from security.cerberus.enabled - two different keys!
Issue 3: CrowdSec Auto-Enables When Cerberus is Enabled
Root Cause: The docker-compose.override.yml and docker-compose.local.yml both set CHARON_SECURITY_CROWDSEC_MODE=local:
File: docker-compose.override.yml#L21
- CHARON_SECURITY_CROWDSEC_MODE=local
Problem: When the container starts:
- Config loads with
CrowdSecMode: "local"from env var - Security status API returns
crowdsec.enabled: truebecause mode is "local" - Frontend shows CrowdSec as enabled
File: backend/internal/api/handlers/security_handler.go#L59-L62
// Allow runtime override for CrowdSec enabled flag via settings table
crowdsecEnabled := mode == "local" // <-- Auto-true if mode is "local"
Issue 4: CrowdSec Toggle Unresponsive + Config Button Grayed Out
Root Cause: Multiple issues combine to break the toggle:
A. Toggle Disabled Logic:
File: frontend/src/pages/Security.tsx#L127
const crowdsecToggleDisabled = cerberusDisabled || crowdsecPowerMutation.isPending
File: frontend/src/pages/Security.tsx#L126
const cerberusDisabled = !status.cerberus?.enabled
Since status.cerberus?.enabled is false due to Issue 2 (wrong settings key), cerberusDisabled is true, making the toggle disabled.
B. Config Button Disabled:
File: frontend/src/pages/Security.tsx#L128
const crowdsecControlsDisabled = cerberusDisabled || crowdsecPowerMutation.isPending
Same logic - the controls are disabled because Cerberus appears disabled.
C. Switch Component Event Handling:
File: frontend/src/components/ui/Switch.tsx#L17-L20
The Switch component passes disabled to the native checkbox input, which prevents click events. This is correct behavior - the issue is the disabled prop is incorrectly true.
Recommended Fixes
Fix 1: Update Feature Flag Defaults
File: backend/internal/api/handlers/feature_flags_handler.go
// Change defaultFlagValues to include cerberus.enabled as false
var defaultFlagValues = map[string]bool{
"feature.cerberus.enabled": false, // ADD THIS
"feature.crowdsec.console_enrollment": false,
"feature.uptime.enabled": true, // Uptime can default ON
}
Fix 2: Align Settings Keys
Option A (Recommended): Update security_handler.go to read from feature flags key
File: backend/internal/api/handlers/security_handler.go
// Line 37: Change from
var settingKey = "security.cerberus.enabled"
// To
var settingKey = "feature.cerberus.enabled"
Option B: Create a sync mechanism between feature flags and security settings
Fix 3: Remove CrowdSec Mode Override from Docker Compose
Files:
docker-compose.override.ymldocker-compose.local.yml
# Remove or comment out:
# - CHARON_SECURITY_CROWDSEC_MODE=local
# Or change to:
- CHARON_SECURITY_CROWDSEC_MODE=disabled
Fix 4: No Additional Fix Needed
Issue 4 is a symptom of Issues 1-2. Once those are fixed:
cerberusDisabledwill befalsewhen Cerberus is enabledcrowdsecToggleDisabledwill befalsecrowdsecControlsDisabledwill befalse- Toggle and Config button will be interactive
Test Scenarios
Test 1: Fresh Install Default State
Given: Clean database, no env vars set
When: User loads the Settings > System page
Then: Cerberus toggle should be OFF
And: /api/v1/feature-flags returns { "feature.cerberus.enabled": false }
Test 2: Cerberus Toggle Sync
Given: User is on Settings > System page
When: User enables Cerberus toggle
Then: /api/v1/security/status returns { "cerberus": { "enabled": true } }
And: Security dashboard header banner is NOT displayed
Test 3: CrowdSec Toggle Interaction
Given: Cerberus is enabled
And: User is on Security dashboard
When: User clicks CrowdSec toggle
Then: Toggle should respond to click
And: CrowdSec enabled state should change
And: Toast notification should appear
Test 4: CrowdSec Config Button
Given: Cerberus is enabled
And: User is on Security dashboard
When: User clicks CrowdSec "Config" button
Then: User should navigate to /security/crowdsec
And: Button should NOT be grayed out
Test 5: Environment Variable Override
Given: CERBERUS_SECURITY_CERBERUS_ENABLED=true set
When: User loads Settings > System (fresh DB)
Then: Cerberus toggle should be ON (env override)
Implementation Priority
| Priority | Fix | Effort | Impact |
|---|---|---|---|
| P0 | Fix 2 (Key alignment) | Low | High - Fixes Issues 2, 4 |
| P1 | Fix 1 (Default values) | Low | High - Fixes Issue 1 |
| P2 | Fix 3 (Docker compose) | Low | Medium - Fixes Issue 3 |
Files to Modify
- backend/internal/api/handlers/feature_flags_handler.go - Add default value for cerberus
- backend/internal/api/handlers/security_handler.go - Change settings key to
feature.cerberus.enabled - docker-compose.override.yml - Remove or change CrowdSec mode
- docker-compose.local.yml - Remove or change CrowdSec mode
Additional Observations
-
Dual Control Systems: There are two overlapping control systems:
- Feature flags (
feature.cerberus.enabled) - toggled in SystemSettings.tsx - Security config (
SecurityConfig.Enabledin DB) - used by Enable/Disable endpoints
Consider consolidating to one source of truth.
- Feature flags (
-
Config vs Settings: The
config.SecurityConfigstruct loaded from env vars is separate from DB-backedSecurityConfigmodel. This creates confusion about which takes precedence. -
No Migration: When updating default values, existing users may need a migration or reset to see the new defaults.
Code Reference Summary
| File | Line | Purpose |
|---|---|---|
feature_flags_handler.go |
L29-31 | Missing cerberus default |
feature_flags_handler.go |
L39 | defaultVal := true bug |
security_handler.go |
L37 | Wrong settings key |
Security.tsx |
L126-128 | Disabled state logic |
SystemSettings.tsx |
L99-105 | Feature toggle UI |
docker-compose.override.yml |
L21 | CrowdSec mode env var |
config.go |
L60 | Correct cerberus default |