Files
Charon/docs/plans/current_spec.md

26 KiB

CrowdSec Preset Apply Failure - Fix Plan

Date: December 12, 2025 Status: Analysis Complete - Ready for Implementation Severity: High


Issue Summary

User reported error when applying a CrowdSec preset:

Apply failed: read archive: open /app/data/crowdsec/hub_cache/crowdsecurity/caddy/bundle.tgz: no such file or directory. Backup created at /app/data/crowdsec.backup.20251211-194408

Root Cause Analysis

The Bug

The Apply() function in hub_sync.go has a fatal ordering bug that destroys the cache before reading from it.

Detailed Flow

  1. Pull Phase (Works Correctly)

    • User pulls preset crowdsecurity/caddy
    • HubCache.Store() writes to: /app/data/crowdsec/hub_cache/crowdsecurity/caddy/bundle.tgz
    • CachedPreset.ArchivePath stores this absolute path
  2. Apply Phase (Bug Occurs)

    Step 1: loadCacheMeta() → Returns meta.ArchivePath = "/app/data/crowdsec/hub_cache/.../bundle.tgz"
    Step 2: backupExisting() → RENAMES "/app/data/crowdsec" to "/app/data/crowdsec.backup.TIMESTAMP"
            ⚠️ THIS MOVES THE CACHE TOO! hub_cache is INSIDE crowdsec/
    Step 3: cscli fails (not available or preset not in hub)
    Step 4: os.ReadFile(meta.ArchivePath) → FILE NOT FOUND!
            The path still points to "/app/data/crowdsec/..." but that directory was renamed!
    

Visual Representation

Before Backup:

/app/data/crowdsec/
├── hub_cache/
│   └── crowdsecurity/
│       └── caddy/
│           ├── bundle.tgz      ← meta.ArchivePath points here
│           ├── preview.yaml
│           └── metadata.json
├── config.yaml
└── other_files/

After backupExisting() (line 535):

/app/data/crowdsec.backup.20251211-194408/  ← Renamed!
├── hub_cache/
│   └── crowdsecurity/
│       └── caddy/
│           ├── bundle.tgz      ← File is now HERE
│           ├── preview.yaml
│           └── metadata.json
├── config.yaml
└── other_files/

/app/data/crowdsec/                         ← Directory no longer exists!

Result: os.ReadFile(meta.ArchivePath) fails because the path /app/data/crowdsec/hub_cache/.../bundle.tgz no longer exists.


Why This Wasn't Caught Earlier

  1. Tests use temp directories - Each test creates fresh directories, so the race condition doesn't manifest
  2. cscli path succeeds in CI - When cscli is available and works, the code returns early before hitting the bug
  3. Recent changes to backup logic - The copy-based fallback and backup improvements may have introduced this ordering issue
  4. Cache directory nested inside DataDir - The architecture decision to put hub_cache inside DataDir (crowdsec config) creates this coupling

Fix Options

Rationale: Simple, minimal change, maintains existing backup behavior.

File: backend/internal/crowdsec/hub_sync.go

Changes:

func (s *HubService) Apply(ctx context.Context, slug string) (ApplyResult, error) {
    // ... existing validation code ...

    result := ApplyResult{AppliedPreset: cleanSlug, Status: "failed"}
    meta, metaErr := s.loadCacheMeta(applyCtx, cleanSlug)
    if metaErr == nil {
        result.CacheKey = meta.CacheKey
    }
    hasCS := s.hasCSCLI(applyCtx)

    // === NEW: Read archive BEFORE backup ===
    var archive []byte
    var archiveErr error
    if metaErr == nil {
        archive, archiveErr = os.ReadFile(meta.ArchivePath)
        if archiveErr != nil {
            logger.Log().WithError(archiveErr).WithField("archive_path", meta.ArchivePath).Warn("failed to read cached archive before backup")
        }
    }
    // === END NEW ===

    backupPath := filepath.Clean(s.DataDir) + ".backup." + time.Now().Format("20060102-150405")
    if err := s.backupExisting(backupPath); err != nil {
        return result, fmt.Errorf("backup: %w", err)
    }
    result.BackupPath = backupPath

    // Try cscli first
    if hasCS {
        cscliErr := s.runCSCLI(applyCtx, cleanSlug)
        if cscliErr == nil {
            result.Status = "applied"
            result.ReloadHint = true
            result.UsedCSCLI = true
            return result, nil
        }
        logger.Log().WithField("slug", cleanSlug).WithError(cscliErr).Warn("cscli install failed; attempting cache fallback")
    }

    // === MODIFIED: Use pre-loaded archive or refresh ===
    if metaErr != nil || archiveErr != nil {
        refreshed, refreshErr := s.refreshCache(applyCtx, cleanSlug, metaErr)
        if refreshErr != nil {
            _ = s.rollback(backupPath)
            return result, fmt.Errorf("load cache for %s: %w", cleanSlug, refreshErr)
        }
        meta = refreshed
        result.CacheKey = meta.CacheKey
        // Re-read archive from refreshed cache location
        archive, archiveErr = os.ReadFile(meta.ArchivePath)
        if archiveErr != nil {
            _ = s.rollback(backupPath)
            return result, fmt.Errorf("read archive: %w", archiveErr)
        }
    }

    // Use the pre-loaded archive bytes
    if err := s.extractTarGz(applyCtx, archive, s.DataDir); err != nil {
        _ = s.rollback(backupPath)
        return result, fmt.Errorf("extract: %w", err)
    }
    // === END MODIFIED ===

    result.Status = "applied"
    result.ReloadHint = true
    result.UsedCSCLI = false
    return result, nil
}

Option B: Move Cache Outside DataDir

Rationale: Architectural fix - separates transient cache from operational config.

Files to modify:

Changes:

// In NewCrowdsecHandler:
// BEFORE:
cacheDir := filepath.Join(dataDir, "hub_cache")

// AFTER:
cacheDir := filepath.Join(filepath.Dir(dataDir), "hub_cache")
// Results in: /app/data/hub_cache (sibling of crowdsec, not child)

Pros: Clean separation, cache survives config resets Cons: Breaking change for existing installs, requires migration

Option C: Selective Backup (Exclude Cache)

Rationale: Only backup config files, not cache.

Changes to backupExisting():

func (s *HubService) backupExisting(backupPath string) error {
    // ... existing checks ...

    // Skip hub_cache during backup - it's transient
    return filepath.WalkDir(s.DataDir, func(path string, d fs.DirEntry, err error) error {
        if strings.Contains(path, "hub_cache") {
            return filepath.SkipDir
        }
        // ... copy logic ...
    })
}

Pros: Faster backups, cache preserved Cons: More complex, backup is no longer complete snapshot


Choose Option A for these reasons:

  1. Minimal code change - Single function modification
  2. No breaking changes - Existing cache paths remain valid
  3. No migration needed - Works immediately
  4. Maintains complete backups - Backup still captures full state
  5. Easy to test - Clear before/after behavior

Files to Modify

File Change
backend/internal/crowdsec/hub_sync.go Reorder archive read before backup in Apply()
backend/internal/crowdsec/hub_sync_test.go Add test for apply with backup scenario
backend/internal/crowdsec/hub_pull_apply_test.go Add regression test

Specific Code Changes

Change 1: hub_sync.go - Apply() Function

Location: Lines 514-580

Before:

func (s *HubService) Apply(ctx context.Context, slug string) (ApplyResult, error) {
    cleanSlug := sanitizeSlug(slug)
    // ... validation ...

    result := ApplyResult{AppliedPreset: cleanSlug, Status: "failed"}
    meta, metaErr := s.loadCacheMeta(applyCtx, cleanSlug)
    if metaErr == nil {
        result.CacheKey = meta.CacheKey
    }
    hasCS := s.hasCSCLI(applyCtx)

    backupPath := filepath.Clean(s.DataDir) + ".backup." + time.Now().Format("20060102-150405")
    if err := s.backupExisting(backupPath); err != nil {
        return result, fmt.Errorf("backup: %w", err)
    }
    result.BackupPath = backupPath

    // Try cscli first
    if hasCS {
        // ... cscli logic ...
    }

    if metaErr != nil {
        // ... refresh cache logic ...
    }

    archive, err := os.ReadFile(meta.ArchivePath)  // ❌ FAILS - file moved by backup!
    if err != nil {
        _ = s.rollback(backupPath)
        return result, fmt.Errorf("read archive: %w", err)
    }
    // ...
}

After:

func (s *HubService) Apply(ctx context.Context, slug string) (ApplyResult, error) {
    cleanSlug := sanitizeSlug(slug)
    // ... validation ...

    result := ApplyResult{AppliedPreset: cleanSlug, Status: "failed"}
    meta, metaErr := s.loadCacheMeta(applyCtx, cleanSlug)
    if metaErr == nil {
        result.CacheKey = meta.CacheKey
    }
    hasCS := s.hasCSCLI(applyCtx)

    // ✅ NEW: Read archive into memory BEFORE backup moves the files
    var archive []byte
    var archiveReadErr error
    if metaErr == nil {
        archive, archiveReadErr = os.ReadFile(meta.ArchivePath)
        if archiveReadErr != nil {
            logger.Log().WithError(archiveReadErr).WithField("archive_path", meta.ArchivePath).
                Warn("failed to read cached archive before backup")
        }
    }

    backupPath := filepath.Clean(s.DataDir) + ".backup." + time.Now().Format("20060102-150405")
    if err := s.backupExisting(backupPath); err != nil {
        return result, fmt.Errorf("backup: %w", err)
    }
    result.BackupPath = backupPath

    // Try cscli first
    if hasCS {
        cscliErr := s.runCSCLI(applyCtx, cleanSlug)
        if cscliErr == nil {
            result.Status = "applied"
            result.ReloadHint = true
            result.UsedCSCLI = true
            return result, nil
        }
        logger.Log().WithField("slug", cleanSlug).WithError(cscliErr).
            Warn("cscli install failed; attempting cache fallback")
    }

    // ✅ MODIFIED: Handle cache miss OR failed archive read
    if metaErr != nil || archiveReadErr != nil {
        // Need to refresh cache (either wasn't cached or file was unreadable)
        originalErr := metaErr
        if originalErr == nil {
            originalErr = archiveReadErr
        }
        refreshed, refreshErr := s.refreshCache(applyCtx, cleanSlug, originalErr)
        if refreshErr != nil {
            _ = s.rollback(backupPath)
            logger.Log().WithError(refreshErr).WithField("slug", cleanSlug).
                WithField("backup_path", backupPath).
                Warn("cache refresh failed; rolled back backup")
            result.ErrorMessage = fmt.Sprintf("load cache for %s: %v", cleanSlug, refreshErr)
            return result, fmt.Errorf("load cache for %s: %w", cleanSlug, refreshErr)
        }
        meta = refreshed
        result.CacheKey = meta.CacheKey

        // Read from the newly refreshed cache
        archive, archiveReadErr = os.ReadFile(meta.ArchivePath)
        if archiveReadErr != nil {
            _ = s.rollback(backupPath)
            return result, fmt.Errorf("read archive after refresh: %w", archiveReadErr)
        }
    }

    // ✅ Use pre-loaded archive bytes (no file read here)
    if err := s.extractTarGz(applyCtx, archive, s.DataDir); err != nil {
        _ = s.rollback(backupPath)
        return result, fmt.Errorf("extract: %w", err)
    }

    result.Status = "applied"
    result.ReloadHint = true
    result.UsedCSCLI = false
    return result, nil
}

Change 2: Add Regression Test

File: backend/internal/crowdsec/hub_pull_apply_test.go

New test:

func TestApplyReadsArchiveBeforeBackup(t *testing.T) {
    // This test verifies the fix for the bug where Apply() would:
    // 1. Load cache metadata (getting archive path)
    // 2. Backup DataDir (moving the cache!)
    // 3. Try to read archive from original path (FAIL!)

    baseDir := t.TempDir()
    dataDir := filepath.Join(baseDir, "crowdsec")
    cacheDir := filepath.Join(dataDir, "hub_cache")

    // Create cache
    cache, err := NewHubCache(cacheDir, time.Hour)
    require.NoError(t, err)

    // Create a mock hub server
    server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        if strings.Contains(r.URL.Path, ".tgz") {
            // Return a valid tar.gz
            var buf bytes.Buffer
            gw := gzip.NewWriter(&buf)
            tw := tar.NewWriter(gw)
            content := []byte("test: value\n")
            tw.WriteHeader(&tar.Header{Name: "test.yaml", Size: int64(len(content)), Mode: 0644})
            tw.Write(content)
            tw.Close()
            gw.Close()
            w.Write(buf.Bytes())
            return
        }
        if strings.Contains(r.URL.Path, ".yaml") {
            w.Write([]byte("preview: content"))
            return
        }
        // Index
        w.Write([]byte(`{"items":[{"name":"test/preset","version":"1.0"}]}`))
    }))
    defer server.Close()

    hub := &HubService{
        Cache:         cache,
        DataDir:       dataDir,
        HTTPClient:    server.Client(),
        HubBaseURL:    server.URL,
        MirrorBaseURL: server.URL,
        PullTimeout:   10 * time.Second,
        ApplyTimeout:  10 * time.Second,
    }

    ctx := context.Background()

    // Pull to populate cache
    _, err = hub.Pull(ctx, "test/preset")
    require.NoError(t, err, "pull should succeed")

    // Verify cache exists
    _, err = cache.Load(ctx, "test/preset")
    require.NoError(t, err, "cache should exist after pull")

    // Add some extra files to DataDir to make backup more realistic
    require.NoError(t, os.WriteFile(filepath.Join(dataDir, "config.yaml"), []byte("test: config"), 0644))

    // Apply - this should NOT fail with "read archive: no such file"
    result, err := hub.Apply(ctx, "test/preset")
    require.NoError(t, err, "apply should succeed - archive should be read before backup")
    assert.Equal(t, "applied", result.Status)
    assert.NotEmpty(t, result.BackupPath)

    // Verify backup was created
    _, err = os.Stat(result.BackupPath)
    assert.NoError(t, err, "backup should exist")
}

Edge Cases to Consider

Scenario Current Behavior Fixed Behavior
First-time apply (no cache) Fails with cache miss Attempts refresh, same behavior
cscli available and works Returns early, never hits bug Same - returns early
cscli fails, cache exists FAILS - archive moved Succeeds - archive pre-loaded
Archive file corrupted Fails on read Same - fails on read, but before backup
Network down during refresh Fails Same - fails with clear error
Large archive (>25MB) Limited by maxArchiveSize Same - memory is fine for 25MB
Concurrent applies Potential race Still potential race (separate issue)

Testing Plan

  1. Unit Tests

    • TestApplyReadsArchiveBeforeBackup - New regression test
    • Existing TestPullThenApplyFlow should still pass
    • TestApplyWithoutPullFails should still pass
  2. Integration Tests

    • Manual test in Docker container
    • Pull preset via UI
    • Apply preset via UI
    • Verify no "read archive" error
  3. Edge Case Tests

    • Apply with expired cache (should refresh)
    • Apply with network failure (should error gracefully)
    • Apply with cscli available (should use cscli path)

Rollout Plan

  1. Implement fix in hub_sync.go
  2. Add regression test in hub_pull_apply_test.go
  3. Run full test suite: go test ./...
  4. Run pre-commit: pre-commit run --all-files
  5. Build and test locally: docker build -t charon:local .
  6. Manual verification in container
  7. Commit with: fix: read archive before backup in CrowdSec preset apply

File Purpose
hub_sync.go HubService.Apply() - main fix location
hub_cache.go Cache storage, stores ArchivePath
crowdsec_handler.go HTTP handler, initializes cache
routes.go Sets crowdsecDataDir from config
config.go CrowdSecConfigDir default

Summary

Root Cause: The Apply() function backs up the entire DataDir (which includes the cache) before reading the cached archive, resulting in a "file not found" error.

Fix: Read the archive into memory before creating the backup.

Impact: Low risk - the fix only changes the order of operations and doesn't affect the backup or extraction logic.

Effort: ~30 minutes implementation + testing | 1 | Cerberus shows ON by default on first load (should be OFF) | High | | 2 | Cerberus dashboard header shows "disabled" even when enabled | Medium | | 3 | CrowdSec toggle auto-enables when Cerberus is enabled | Medium | | 4 | CrowdSec toggle unresponsive + Config button grayed out | High |


Root Cause Analysis

Issue 1: Cerberus Shows ON by Default

Root Cause: The feature_flags_handler.go has a default value of true for all feature flags including feature.cerberus.enabled.

File: backend/internal/api/handlers/feature_flags_handler.go#L39-L42

// Line 39-42
for _, key := range defaultFlags {
    defaultVal := true  // <-- THIS IS THE BUG
    if v, ok := defaultFlagValues[key]; ok {
        defaultVal = v
    }

Problem: The code sets defaultVal := true for all flags, then only overrides it if the key exists in defaultFlagValues. However, feature.cerberus.enabled is NOT in defaultFlagValues:

// Line 29-31
var defaultFlagValues = map[string]bool{
    "feature.crowdsec.console_enrollment": false,
}

Result: On first load with an empty database, feature.cerberus.enabled defaults to true instead of false.

Additional Context:

  • The backend/internal/config/config.go#L60 correctly defaults CerberusEnabled to false:
    CerberusEnabled: getEnvAny("false", "CERBERUS_SECURITY_CERBERUS_ENABLED", ...) == "true"
    
  • However, the feature flags handler ignores this config and uses its own default.

Issue 2: Dashboard Header Shows "Disabled" Even When Enabled

Root Cause: The header banner logic in Security.tsx checks status.cerberus?.enabled which comes from the security status API, but there's a data source mismatch.

Files:

Problem Flow:

  1. Security.tsx checks status.cerberus?.enabled from /api/v1/security/status
  2. security_handler.go reads from config AND settings table:
    // Line 36-48
    enabled := h.cfg.CerberusEnabled
    var settingKey = "security.cerberus.enabled"  // <-- WRONG KEY!
    if h.db != nil {
        var setting struct{ Value string }
        if err := h.db.Raw("SELECT value FROM settings WHERE key = ? LIMIT 1", settingKey).Scan(&setting).Error; ...
    
  3. SystemSettings.tsx toggles feature.cerberus.enabled (via feature flags API)

The Mismatch:

Component Key Used
SystemSettings toggle feature.cerberus.enabled
Security status API security.cerberus.enabled

The toggle writes to feature.cerberus.enabled but the security status reads from security.cerberus.enabled - two different keys!


Issue 3: CrowdSec Auto-Enables When Cerberus is Enabled

Root Cause: The docker-compose.override.yml and docker-compose.local.yml both set CHARON_SECURITY_CROWDSEC_MODE=local:

File: docker-compose.override.yml#L21

- CHARON_SECURITY_CROWDSEC_MODE=local

Problem: When the container starts:

  1. Config loads with CrowdSecMode: "local" from env var
  2. Security status API returns crowdsec.enabled: true because mode is "local"
  3. Frontend shows CrowdSec as enabled

File: backend/internal/api/handlers/security_handler.go#L59-L62

// Allow runtime override for CrowdSec enabled flag via settings table
crowdsecEnabled := mode == "local"  // <-- Auto-true if mode is "local"

Issue 4: CrowdSec Toggle Unresponsive + Config Button Grayed Out

Root Cause: Multiple issues combine to break the toggle:

A. Toggle Disabled Logic:

File: frontend/src/pages/Security.tsx#L127

const crowdsecToggleDisabled = cerberusDisabled || crowdsecPowerMutation.isPending

File: frontend/src/pages/Security.tsx#L126

const cerberusDisabled = !status.cerberus?.enabled

Since status.cerberus?.enabled is false due to Issue 2 (wrong settings key), cerberusDisabled is true, making the toggle disabled.

B. Config Button Disabled:

File: frontend/src/pages/Security.tsx#L128

const crowdsecControlsDisabled = cerberusDisabled || crowdsecPowerMutation.isPending

Same logic - the controls are disabled because Cerberus appears disabled.

C. Switch Component Event Handling:

File: frontend/src/components/ui/Switch.tsx#L17-L20

The Switch component passes disabled to the native checkbox input, which prevents click events. This is correct behavior - the issue is the disabled prop is incorrectly true.


Fix 1: Update Feature Flag Defaults

File: backend/internal/api/handlers/feature_flags_handler.go

// Change defaultFlagValues to include cerberus.enabled as false
var defaultFlagValues = map[string]bool{
    "feature.cerberus.enabled":            false, // ADD THIS
    "feature.crowdsec.console_enrollment": false,
    "feature.uptime.enabled":              true,  // Uptime can default ON
}

Fix 2: Align Settings Keys

Option A (Recommended): Update security_handler.go to read from feature flags key

File: backend/internal/api/handlers/security_handler.go

// Line 37: Change from
var settingKey = "security.cerberus.enabled"
// To
var settingKey = "feature.cerberus.enabled"

Option B: Create a sync mechanism between feature flags and security settings

Fix 3: Remove CrowdSec Mode Override from Docker Compose

Files:

  • docker-compose.override.yml
  • docker-compose.local.yml
# Remove or comment out:
# - CHARON_SECURITY_CROWDSEC_MODE=local
# Or change to:
- CHARON_SECURITY_CROWDSEC_MODE=disabled

Fix 4: No Additional Fix Needed

Issue 4 is a symptom of Issues 1-2. Once those are fixed:

  • cerberusDisabled will be false when Cerberus is enabled
  • crowdsecToggleDisabled will be false
  • crowdsecControlsDisabled will be false
  • Toggle and Config button will be interactive

Test Scenarios

Test 1: Fresh Install Default State

Given: Clean database, no env vars set
When: User loads the Settings > System page
Then: Cerberus toggle should be OFF
And: /api/v1/feature-flags returns { "feature.cerberus.enabled": false }

Test 2: Cerberus Toggle Sync

Given: User is on Settings > System page
When: User enables Cerberus toggle
Then: /api/v1/security/status returns { "cerberus": { "enabled": true } }
And: Security dashboard header banner is NOT displayed

Test 3: CrowdSec Toggle Interaction

Given: Cerberus is enabled
And: User is on Security dashboard
When: User clicks CrowdSec toggle
Then: Toggle should respond to click
And: CrowdSec enabled state should change
And: Toast notification should appear

Test 4: CrowdSec Config Button

Given: Cerberus is enabled
And: User is on Security dashboard
When: User clicks CrowdSec "Config" button
Then: User should navigate to /security/crowdsec
And: Button should NOT be grayed out

Test 5: Environment Variable Override

Given: CERBERUS_SECURITY_CERBERUS_ENABLED=true set
When: User loads Settings > System (fresh DB)
Then: Cerberus toggle should be ON (env override)

Implementation Priority

Priority Fix Effort Impact
P0 Fix 2 (Key alignment) Low High - Fixes Issues 2, 4
P1 Fix 1 (Default values) Low High - Fixes Issue 1
P2 Fix 3 (Docker compose) Low Medium - Fixes Issue 3

Files to Modify

  1. backend/internal/api/handlers/feature_flags_handler.go - Add default value for cerberus
  2. backend/internal/api/handlers/security_handler.go - Change settings key to feature.cerberus.enabled
  3. docker-compose.override.yml - Remove or change CrowdSec mode
  4. docker-compose.local.yml - Remove or change CrowdSec mode

Additional Observations

  1. Dual Control Systems: There are two overlapping control systems:

    • Feature flags (feature.cerberus.enabled) - toggled in SystemSettings.tsx
    • Security config (SecurityConfig.Enabled in DB) - used by Enable/Disable endpoints

    Consider consolidating to one source of truth.

  2. Config vs Settings: The config.SecurityConfig struct loaded from env vars is separate from DB-backed SecurityConfig model. This creates confusion about which takes precedence.

  3. No Migration: When updating default values, existing users may need a migration or reset to see the new defaults.


Code Reference Summary

File Line Purpose
feature_flags_handler.go L29-31 Missing cerberus default
feature_flags_handler.go L39 defaultVal := true bug
security_handler.go L37 Wrong settings key
Security.tsx L126-128 Disabled state logic
SystemSettings.tsx L99-105 Feature toggle UI
docker-compose.override.yml L21 CrowdSec mode env var
config.go L60 Correct cerberus default