Files

GitHub Actions 5b2724a2ba Refactor code structure for improved readability and maintainability

2025-12-15 07:48:28 +00:00

37 KiB

Raw Blame History

CrowdSec Toggle Integration Fix Plan

Date: December 15, 2025 Issue: CrowdSec toggle stuck ON, reconciliation silently exits, process not starting Root Cause: Database disconnect between frontend (Settings table) and reconciliation (SecurityConfig table)

Executive Summary

The CrowdSec toggle shows "ON" but the process is NOT running. The reconciliation function silently exits without starting CrowdSec because:

Frontend writes to Settings table (security.crowdsec.enabled)
Backend reconciliation reads from SecurityConfig table (crowdsec_mode = "local")
No synchronization between the two tables
Auto-initialization code EXISTS (lines 46-71 in crowdsec_startup.go) but creates config with crowdsec_mode = "disabled"
Reconciliation sees "disabled" and exits silently with no logs

Root Cause Analysis (DETAILED)

Evidence Trail

Container Logs Show Silent Exit:

{"bin_path":"crowdsec","data_dir":"/app/data/crowdsec","level":"info","msg":"CrowdSec reconciliation: starting startup check","time":"2025-12-14T23:32:33-05:00"}
[NO FURTHER LOGS - Function exited here]

Database State on Fresh Start:

SELECT * FROM security_configs → record not found
{"level":"info","msg":"CrowdSec reconciliation: no SecurityConfig found, creating default config"}

Process Check:

$ docker exec charon ps aux | grep -i crowdsec
[NO RESULTS - Process not running]

Why Reconciliation Exits Silently

FILE: backend/internal/services/crowdsec_startup.go

Execution Flow:

1. User clicks toggle ON in Security.tsx
2. Frontend calls updateSetting('security.crowdsec.enabled', 'true')
3. Settings table updated → security.crowdsec.enabled = "true"
4. Frontend calls startCrowdsec() → Handler updates SecurityConfig
5. CrowdSec starts successfully, toggle shows ON
6. Container restarts (docker restart or reboot)
7. ReconcileCrowdSecOnStartup() executes at line 26:

   Line 44: db.First(&cfg) → returns gorm.ErrRecordNotFound

   Lines 46-71: Auto-initialization block executes:
     - Creates SecurityConfig with crowdsec_mode = "disabled"
     - Logs "default SecurityConfig created successfully"
     - Returns early (line 70) WITHOUT checking Settings table
     - CrowdSec is NEVER started

   Result: Toggle shows "ON" (Settings table), but process is "OFF" (not running)

THE BUG (Lines 46-71):

if err == gorm.ErrRecordNotFound {
    // AUTO-INITIALIZE: Create default SecurityConfig on first startup
    logger.Log().Info("CrowdSec reconciliation: no SecurityConfig found, creating default config")

    defaultCfg := models.SecurityConfig{
        UUID:             "default",
        Name:             "Default Security Config",
        Enabled:          false,
        CrowdSecMode:     "disabled",  // ← PROBLEM: Ignores Settings table state
        WAFMode:          "disabled",
        WAFParanoiaLevel: 1,
        RateLimitMode:    "disabled",
        RateLimitBurst:   10,
        RateLimitRequests: 100,
        RateLimitWindowSec: 60,
    }

    if err := db.Create(&defaultCfg).Error; err != nil {
        logger.Log().WithError(err).Error("CrowdSec reconciliation: failed to create default SecurityConfig")
        return
    }

    logger.Log().Info("CrowdSec reconciliation: default SecurityConfig created successfully")
    // Don't start CrowdSec on fresh install - user must enable via UI
    return  // ← EXITS WITHOUT checking Settings table or starting process
}

Why This Causes the Issue:

First Container Start: User enables CrowdSec via toggle
- Settings: security.crowdsec.enabled = "true" ✅
- SecurityConfig: crowdsec_mode = "local" ✅ (via Start handler)
- Process: Running ✅
Container Restart: Database persists but SecurityConfig table may be empty (migration issue or corruption)
- Reconciliation runs
- SecurityConfig table: EMPTY (record lost or never migrated)
- Auto-init creates SecurityConfig with crowdsec_mode = "disabled"
- Returns early without checking Settings table
- Settings: Still shows "true" (UI says ON)
- SecurityConfig: Says "disabled" (reconciliation source)
- Process: NOT started ❌
Result: State Mismatch
- Frontend toggle: ON (reads Settings table)
- Backend reconciliation: OFF (reads SecurityConfig table)
- Process: NOT RUNNING (reconciliation didn't start it)

Current Code Analysis

1. Reconciliation Function (crowdsec_startup.go)

Location: backend/internal/services/crowdsec_startup.go

Lines 44-71 (Auto-initialization - THE BUG):

var cfg models.SecurityConfig
if err := db.First(&cfg).Error; err != nil {
    if err == gorm.ErrRecordNotFound {
        // AUTO-INITIALIZE: Create default SecurityConfig on first startup
        logger.Log().Info("CrowdSec reconciliation: no SecurityConfig found, creating default config")

        defaultCfg := models.SecurityConfig{
            UUID:             "default",
            Name:             "Default Security Config",
            Enabled:          false,
            CrowdSecMode:     "disabled",  // ← IGNORES Settings table
            WAFMode:          "disabled",
            WAFParanoiaLevel: 1,
            RateLimitMode:    "disabled",
            RateLimitBurst:   10,
            RateLimitRequests: 100,
            RateLimitWindowSec: 60,
        }

        if err := db.Create(&defaultCfg).Error; err != nil {
            logger.Log().WithError(err).Error("CrowdSec reconciliation: failed to create default SecurityConfig")
            return
        }

        logger.Log().Info("CrowdSec reconciliation: default SecurityConfig created successfully")
        // Don't start CrowdSec on fresh install - user must enable via UI
        return  // ← EARLY EXIT - Never checks Settings table
    }
    logger.Log().WithError(err).Warn("CrowdSec reconciliation: failed to read SecurityConfig")
    return
}

Lines 74-90 (Runtime Setting Override - UNREACHABLE after auto-init):

// Also check for runtime setting override in settings table
var settingOverride struct{ Value string }
crowdSecEnabled := false
if err := db.Raw("SELECT value FROM settings WHERE key = ? LIMIT 1", "security.crowdsec.enabled").Scan(&settingOverride).Error; err == nil && settingOverride.Value != "" {
    crowdSecEnabled = strings.EqualFold(settingOverride.Value, "true")
    logger.Log().WithFields(map[string]interface{}{
        "setting_value":    settingOverride.Value,
        "crowdsec_enabled": crowdSecEnabled,
    }).Debug("CrowdSec reconciliation: found runtime setting override")
}

This code is NEVER REACHED when SecurityConfig doesn't exist because line 70 returns early!

Lines 91-98 (Decision Logic):

// Only auto-start if CrowdSecMode is "local" OR runtime setting is enabled
if cfg.CrowdSecMode != "local" && !crowdSecEnabled {
    logger.Log().WithFields(map[string]interface{}{
        "db_mode":         cfg.CrowdSecMode,
        "setting_enabled": crowdSecEnabled,
    }).Debug("CrowdSec reconciliation skipped: mode is not 'local' and setting not enabled")
    return
}

Also UNREACHABLE during auto-init scenario!

2. Start Handler (crowdsec_handler.go)

Location: backend/internal/api/handlers/crowdsec_handler.go

Lines 167-192 - CORRECT IMPLEMENTATION:

func (h *CrowdsecHandler) Start(c *gin.Context) {
    ctx := c.Request.Context()

    // UPDATE SecurityConfig to persist user's intent
    var cfg models.SecurityConfig
    if err := h.DB.First(&cfg).Error; err != nil {
        if err == gorm.ErrRecordNotFound {
            // Create default config with CrowdSec enabled
            cfg = models.SecurityConfig{
                UUID:         "default",
                Name:         "Default Security Config",
                Enabled:      true,
                CrowdSecMode: "local",  // ← CORRECT: Sets mode to "local"
            }
            if err := h.DB.Create(&cfg).Error; err != nil {
                logger.Log().WithError(err).Error("Failed to create SecurityConfig")
                c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to persist configuration"})
                return
            }
        } else {
            logger.Log().WithError(err).Error("Failed to read SecurityConfig")
            c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to read configuration"})
            return
        }
    } else {
        // Update existing config
        cfg.CrowdSecMode = "local"
        cfg.Enabled = true
        if err := h.DB.Save(&cfg).Error; err != nil {
            logger.Log().WithError(err).Error("Failed to update SecurityConfig")
            c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to persist configuration"})
            return
        }
    }

    // Start the process...
}

Analysis: This is CORRECT. The Start handler properly updates SecurityConfig when user clicks "Start" from the CrowdSec config page (/security/crowdsec).

3. Frontend Toggle (Security.tsx)

Location: frontend/src/pages/Security.tsx

Lines 64-120 - THE DISCONNECT:

const crowdsecPowerMutation = useMutation({
  mutationFn: async (enabled: boolean) => {
    // Step 1: Update Settings table
    await updateSetting('security.crowdsec.enabled', enabled ? 'true' : 'false', 'security', 'bool')

    if (enabled) {
      // Step 2: Call Start() which updates SecurityConfig
      const result = await startCrowdsec()

      // Step 3: Verify running
      const status = await statusCrowdsec()
      if (!status.running) {
        await updateSetting('security.crowdsec.enabled', 'false', 'security', 'bool')
        throw new Error('CrowdSec process failed to start')
      }

      return result
    } else {
      // Step 2: Call Stop() which DOES NOT update SecurityConfig!
      await stopCrowdsec()

      // Step 3: Verify stopped
      await new Promise(resolve => setTimeout(resolve, 500))
      const status = await statusCrowdsec()
      if (status.running) {
        throw new Error('CrowdSec process still running')
      }

      return { enabled: false }
    }
  },
})

Analysis:

Enable Path: Updates Settings → Calls Start() → Start() updates SecurityConfig → ✅ Both tables synced
Disable Path: Updates Settings → Calls Stop() → Stop() does NOT always update SecurityConfig → ❌ Tables out of sync

Looking at the Stop handler:

func (h *CrowdsecHandler) Stop(c *gin.Context) {
    ctx := c.Request.Context()
    if err := h.Executor.Stop(ctx, h.DataDir); err != nil {
        c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
        return
    }

    // UPDATE SecurityConfig to persist user's intent
    var cfg models.SecurityConfig
    if err := h.DB.First(&cfg).Error; err == nil {
        cfg.CrowdSecMode = "disabled"
        cfg.Enabled = false
        if err := h.DB.Save(&cfg).Error; err != nil {
            logger.Log().WithError(err).Warn("Failed to update SecurityConfig after stopping CrowdSec")
        }
    }

    c.JSON(http.StatusOK, gin.H{"status": "stopped"})
}

This IS CORRECT - Stop() handler updates SecurityConfig when it can find it. BUT:

Scenario Where It Fails:

SecurityConfig table gets corrupted/cleared/migrated incorrectly
User clicks toggle OFF
Stop() tries to update SecurityConfig → record not found → skips update
Settings table still updated to "false"
Container restarts → auto-init creates SecurityConfig with "disabled"
Both tables say "disabled" but UI might show stale state

Comprehensive Fix Strategy

Phase 1: Fix Auto-Initialization (CRITICAL - IMMEDIATE)

FILE: backend/internal/services/crowdsec_startup.go

CHANGE: Lines 46-71 (auto-initialization block)

AFTER (with Settings table check):

if err == gorm.ErrRecordNotFound {
    // AUTO-INITIALIZE: Create default SecurityConfig by checking Settings table
    logger.Log().Info("CrowdSec reconciliation: no SecurityConfig found, checking Settings table for user preference")

    // Check if user has already enabled CrowdSec via Settings table (from toggle or legacy config)
    var settingOverride struct{ Value string }
    crowdSecEnabledInSettings := false
    if err := db.Raw("SELECT value FROM settings WHERE key = ? LIMIT 1", "security.crowdsec.enabled").Scan(&settingOverride).Error; err == nil && settingOverride.Value != "" {
        crowdSecEnabledInSettings = strings.EqualFold(settingOverride.Value, "true")
        logger.Log().WithFields(map[string]interface{}{
            "setting_value": settingOverride.Value,
            "enabled":       crowdSecEnabledInSettings,
        }).Info("CrowdSec reconciliation: found existing Settings table preference")
    }

    // Create SecurityConfig that matches Settings table state
    crowdSecMode := "disabled"
    if crowdSecEnabledInSettings {
        crowdSecMode = "local"
    }

    defaultCfg := models.SecurityConfig{
        UUID:               "default",
        Name:               "Default Security Config",
        Enabled:            crowdSecEnabledInSettings,
        CrowdSecMode:       crowdSecMode,  // ← NOW RESPECTS Settings table
        WAFMode:            "disabled",
        WAFParanoiaLevel:   1,
        RateLimitMode:      "disabled",
        RateLimitBurst:     10,
        RateLimitRequests:  100,
        RateLimitWindowSec: 60,
    }

    if err := db.Create(&defaultCfg).Error; err != nil {
        logger.Log().WithError(err).Error("CrowdSec reconciliation: failed to create default SecurityConfig")
        return
    }

    logger.Log().WithFields(map[string]interface{}{
        "crowdsec_mode": defaultCfg.CrowdSecMode,
        "enabled":       defaultCfg.Enabled,
        "source":        "settings_table",
    }).Info("CrowdSec reconciliation: default SecurityConfig created from Settings preference")

    // Continue to process the config (DON'T return early)
    cfg = defaultCfg
}

KEY CHANGES:

Check Settings table during auto-initialization
Create SecurityConfig matching Settings state (not hardcoded "disabled")
Don't return early - let the rest of the function process the config
Assign to cfg variable so flow continues to line 74+

Phase 2: Enhance Logging (IMMEDIATE)

FILE: backend/internal/services/crowdsec_startup.go

CHANGE: Lines 91-98 (decision logic - better logging)

AFTER:

// Start when EITHER SecurityConfig has mode="local" OR Settings table has enabled=true
// Exit only when BOTH are disabled
if cfg.CrowdSecMode != "local" && !crowdSecEnabled {
    logger.Log().WithFields(map[string]interface{}{
        "db_mode":         cfg.CrowdSecMode,
        "setting_enabled": crowdSecEnabled,
    }).Info("CrowdSec reconciliation skipped: both SecurityConfig and Settings indicate disabled")
    return
}

// Log which source triggered the start
if cfg.CrowdSecMode == "local" {
    logger.Log().WithField("mode", cfg.CrowdSecMode).Info("CrowdSec reconciliation: starting based on SecurityConfig mode='local'")
} else if crowdSecEnabled {
    logger.Log().WithField("setting", "true").Info("CrowdSec reconciliation: starting based on Settings table override")
}

KEY CHANGES:

Change log level from Debug to Info (so we see it in logs)
Add source attribution (which table triggered the start)
Clarify condition (exit only when BOTH are disabled)

Phase 3: Add Unified Toggle Endpoint (OPTIONAL BUT RECOMMENDED)

WHY: Currently the toggle updates Settings, then calls Start/Stop which updates SecurityConfig. This creates potential race conditions. A unified endpoint is safer.

FILE: backend/internal/api/handlers/crowdsec_handler.go

ADD: New method (after Stop(), around line 260)

// ToggleCrowdSec enables or disables CrowdSec, synchronizing Settings and SecurityConfig atomically
func (h *CrowdsecHandler) ToggleCrowdSec(c *gin.Context) {
    var payload struct {
        Enabled bool `json:"enabled"`
    }
    if err := c.ShouldBindJSON(&payload); err != nil {
        c.JSON(http.StatusBadRequest, gin.H{"error": "invalid payload"})
        return
    }

    logger.Log().WithField("enabled", payload.Enabled).Info("CrowdSec toggle: received request")

    // Use a transaction to ensure Settings and SecurityConfig stay in sync
    tx := h.DB.Begin()
    defer func() {
        if r := recover(); r != nil {
            tx.Rollback()
        }
    }()

    // STEP 1: Update Settings table
    settingKey := "security.crowdsec.enabled"
    settingValue := "false"
    if payload.Enabled {
        settingValue = "true"
    }

    var settingModel models.Setting
    if err := tx.Where("key = ?", settingKey).FirstOrCreate(&settingModel, models.Setting{
        Key:      settingKey,
        Value:    settingValue,
        Type:     "bool",
        Category: "security",
    }).Error; err != nil {
        tx.Rollback()
        logger.Log().WithError(err).Error("CrowdSec toggle: failed to update Settings table")
        c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to update settings"})
        return
    }
    settingModel.Value = settingValue
    if err := tx.Save(&settingModel).Error; err != nil {
        tx.Rollback()
        logger.Log().WithError(err).Error("CrowdSec toggle: failed to save Settings table")
        c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to update settings"})
        return
    }

    // STEP 2: Update SecurityConfig table
    var cfg models.SecurityConfig
    if err := tx.First(&cfg).Error; err != nil {
        if err == gorm.ErrRecordNotFound {
            // Create config matching toggle state
            crowdSecMode := "disabled"
            if payload.Enabled {
                crowdSecMode = "local"
            }

            cfg = models.SecurityConfig{
                UUID:               "default",
                Name:               "Default Security Config",
                Enabled:            payload.Enabled,
                CrowdSecMode:       crowdSecMode,
                WAFMode:            "disabled",
                WAFParanoiaLevel:   1,
                RateLimitMode:      "disabled",
                RateLimitBurst:     10,
                RateLimitRequests:  100,
                RateLimitWindowSec: 60,
            }
            if err := tx.Create(&cfg).Error; err != nil {
                tx.Rollback()
                logger.Log().WithError(err).Error("CrowdSec toggle: failed to create SecurityConfig")
                c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to persist configuration"})
                return
            }
        } else {
            tx.Rollback()
            logger.Log().WithError(err).Error("CrowdSec toggle: failed to read SecurityConfig")
            c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to read configuration"})
            return
        }
    } else {
        // Update existing config
        if payload.Enabled {
            cfg.CrowdSecMode = "local"
            cfg.Enabled = true
        } else {
            cfg.CrowdSecMode = "disabled"
            cfg.Enabled = false
        }
        if err := tx.Save(&cfg).Error; err != nil {
            tx.Rollback()
            logger.Log().WithError(err).Error("CrowdSec toggle: failed to update SecurityConfig")
            c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to persist configuration"})
            return
        }
    }

    // Commit the transaction before starting/stopping process
    if err := tx.Commit().Error; err != nil {
        logger.Log().WithError(err).Error("CrowdSec toggle: transaction commit failed")
        c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to commit changes"})
        return
    }

    logger.Log().WithFields(map[string]interface{}{
        "enabled":       cfg.Enabled,
        "crowdsec_mode": cfg.CrowdSecMode,
    }).Info("CrowdSec toggle: synchronized Settings and SecurityConfig successfully")

    // STEP 3: Start or stop the process
    ctx := c.Request.Context()
    if payload.Enabled {
        // Start CrowdSec
        pid, err := h.Executor.Start(ctx, h.BinPath, h.DataDir)
        if err != nil {
            logger.Log().WithError(err).Error("CrowdSec toggle: failed to start process, reverting DB changes")

            // Revert both tables (in new transaction)
            revertTx := h.DB.Begin()
            cfg.CrowdSecMode = "disabled"
            cfg.Enabled = false
            revertTx.Save(&cfg)
            settingModel.Value = "false"
            revertTx.Save(&settingModel)
            revertTx.Commit()

            c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
            return
        }

        // Wait for LAPI readiness
        lapiReady := false
        maxWait := 30 * time.Second
        pollInterval := 500 * time.Millisecond
        deadline := time.Now().Add(maxWait)

        for time.Now().Before(deadline) {
            args := []string{"lapi", "status"}
            if _, err := os.Stat(filepath.Join(h.DataDir, "config.yaml")); err == nil {
                args = append([]string{"-c", filepath.Join(h.DataDir, "config.yaml")}, args...)
            }

            checkCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
            _, err := h.CmdExec.Execute(checkCtx, "cscli", args...)
            cancel()

            if err == nil {
                lapiReady = true
                break
            }

            time.Sleep(pollInterval)
        }

        logger.Log().WithFields(map[string]interface{}{
            "pid":        pid,
            "lapi_ready": lapiReady,
        }).Info("CrowdSec toggle: started successfully")

        c.JSON(http.StatusOK, gin.H{
            "enabled":    true,
            "pid":        pid,
            "lapi_ready": lapiReady,
        })
        return
    } else {
        // Stop CrowdSec
        if err := h.Executor.Stop(ctx, h.DataDir); err != nil {
            logger.Log().WithError(err).Error("CrowdSec toggle: failed to stop process")
            c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
            return
        }

        logger.Log().Info("CrowdSec toggle: stopped successfully")
        c.JSON(http.StatusOK, gin.H{"enabled": false})
        return
    }
}

Register Route:

// In RegisterRoutes() method
rg.POST("/admin/crowdsec/toggle", h.ToggleCrowdSec)

Frontend API Client (frontend/src/api/crowdsec.ts):

export async function toggleCrowdsec(enabled: boolean): Promise<{ enabled: boolean; pid?: number; lapi_ready?: boolean }> {
  const response = await client.post('/admin/crowdsec/toggle', { enabled })
  return response.data
}

Frontend Toggle Update (frontend/src/pages/Security.tsx):

const crowdsecPowerMutation = useMutation({
  mutationFn: async (enabled: boolean) => {
    if (enabled) {
      toast.info('Starting CrowdSec... This may take up to 30 seconds')
    }

    // Use unified toggle endpoint (handles Settings + SecurityConfig + Process)
    const result = await toggleCrowdsec(enabled)

    // Backend already verified state, just do final status check
    const status = await statusCrowdsec()
    if (enabled && !status.running) {
      throw new Error('CrowdSec process failed to start. Check server logs for details.')
    }
    if (!enabled && status.running) {
      throw new Error('CrowdSec process still running. Check server logs for details.')
    }

    return result
  },
  // ... rest remains the same
})

Testing Plan

Test 1: Fresh Install

Scenario: Brand new Charon installation

Start container: docker compose up -d
Navigate to Security page
Verify CrowdSec toggle shows OFF
Check status: curl http://localhost:8080/api/v1/admin/crowdsec/status
- Expected: {"running": false}
Check logs: docker logs charon 2>&1 | grep "reconciliation"
- Expected: "no SecurityConfig found, checking Settings table"
- Expected: "default SecurityConfig created from Settings preference"
- Expected: "crowdsec_mode: disabled"

Test 2: Toggle ON → Container Restart

Scenario: User enables CrowdSec, then restarts container

Enable toggle in UI (click ON)
Verify CrowdSec starts
Check status: {"running": true, "pid": xxx}
Restart: docker restart charon
Wait 10 seconds
Check status again: {"running": true, "pid": xxx} (NEW PID)
Check logs:
- Expected: "starting based on SecurityConfig mode='local'"

Test 3: Legacy Migration (Settings Table Only)

Scenario: Existing install with Settings table but no SecurityConfig

Manually set: INSERT INTO settings (key, value, type, category) VALUES ('security.crowdsec.enabled', 'true', 'bool', 'security');
Delete SecurityConfig: DELETE FROM security_configs;
Restart container
Check logs:
- Expected: "found existing Settings table preference"
- Expected: "default SecurityConfig created from Settings preference"
- Expected: "crowdsec_mode: local"
Check status: {"running": true}

Test 4: Toggle OFF → Container Restart

Scenario: User disables CrowdSec, then restarts container

Start with CrowdSec enabled and running
Click toggle OFF in UI
Verify process stops
Restart: docker restart charon
Wait 10 seconds
Check status: {"running": false}
Verify toggle still shows OFF

Test 5: Corrupted SecurityConfig Recovery

Scenario: SecurityConfig gets deleted but Settings exists

Enable CrowdSec via UI
Manually delete SecurityConfig: DELETE FROM security_configs;
Restart container
Verify auto-init recreates SecurityConfig matching Settings table
Verify CrowdSec auto-starts

Verification Checklist

Phase 1 (Auto-Initialization Fix)

Modified crowdsec_startup.go lines 46-71
Auto-init checks Settings table for existing preference
Auto-init creates SecurityConfig matching Settings state
Auto-init does NOT return early (continues to line 74+)
Test 1 (Fresh Install) passes
Test 3 (Legacy Migration) passes

Phase 2 (Logging Enhancement)

Modified crowdsec_startup.go lines 91-98
Changed log level from Debug to Info
Added source attribution logging
Test 2 (Toggle ON → Restart) shows correct log
Test 4 (Toggle OFF → Restart) shows correct log

Phase 3 (Unified Toggle - Optional)

Added ToggleCrowdSec() method to crowdsec_handler.go
Registered /admin/crowdsec/toggle route
Added toggleCrowdsec() to crowdsec.ts
Updated crowdsecPowerMutation in Security.tsx
Test 4 (Toggle synchronization) passes
Test 5 (Corrupted recovery) passes

Pre-Deployment

Pre-commit linters pass: pre-commit run --all-files
Backend tests pass: cd backend && go test ./...
Frontend tests pass: cd frontend && npm run test
Docker build succeeds: docker build -t charon:local .
Integration test passes: scripts/crowdsec_integration.sh

Success Criteria

✅ Fix is complete when:

Toggle shows correct state (ON = running, OFF = stopped)
Toggle persists across container restarts
Reconciliation logs clearly show decision reason
Auto-initialization respects Settings table preference
No "stuck toggle" scenarios
All 5 test cases pass
Pre-commit checks pass
No regressions in existing CrowdSec functionality

Risk Assessment

Change	Risk Level	Mitigation
Phase 1 (Auto-init)	Low	Only affects fresh installs or corrupted state recovery
Phase 2 (Logging)	Very Low	Only changes log output, no logic changes
Phase 3 (Unified toggle)	Medium	New endpoint, requires thorough testing, but backward compatible

Rollback Plan

If issues arise:

Immediate Revert: git revert <commit-hash> (no DB changes needed)

Manual Fix (if toggle stuck):

-- Reset SecurityConfig
UPDATE security_configs
SET crowdsec_mode = 'disabled', enabled = 0
WHERE uuid = 'default';

-- Reset Settings
UPDATE settings
SET value = 'false'
WHERE key = 'security.crowdsec.enabled';

Force Stop CrowdSec: docker exec charon pkill -SIGTERM crowdsec

Dependency Impact Analysis

Phase 1: Auto-Initialization Changes (crowdsec_startup.go)

Files Directly Modified

backend/internal/services/crowdsec_startup.go (lines 46-71)

Dependencies and Required Updates

1. Unit Tests - MUST BE UPDATED

File: backend/internal/services/crowdsec_startup_test.go
Impact: Test TestReconcileCrowdSecOnStartup_NoSecurityConfig expects the function to skip/return early when no SecurityConfig exists
Required Change: Update test to:
- Create a Settings table entry with security.crowdsec.enabled = 'true'
- Verify that SecurityConfig is auto-created with crowdsec_mode = "local"
- Verify that CrowdSec process is started (not skipped)
Additional Tests Needed:
- TestReconcileCrowdSecOnStartup_NoSecurityConfig_SettingsDisabled - Settings='false' → creates config with mode="disabled", does NOT start
- TestReconcileCrowdSecOnStartup_NoSecurityConfig_SettingsEnabled - Settings='true' → creates config with mode="local", DOES start
- TestReconcileCrowdSecOnStartup_NoSecurityConfig_NoSettingsEntry - No Settings entry → creates config with mode="disabled", does NOT start

2. Integration Tests - VERIFICATION NEEDED

Files:
- scripts/crowdsec_integration.sh
- scripts/crowdsec_startup_test.sh
- scripts/crowdsec_decision_integration.sh
Impact: These scripts may assume specific startup behavior
Verification Required:
- Do any scripts pre-populate Settings table?
- Do any scripts expect reconciliation to skip on fresh DB?
- Do any scripts verify log output from reconciliation?
Action: Review scripts for assumptions about auto-initialization behavior

3. Migration/Upgrade Path - DATABASE CONCERN

Scenario: Existing installations with Settings='true' but missing SecurityConfig
Impact: After upgrade, reconciliation will auto-create SecurityConfig from Settings (POSITIVE)
Risk: Low - this is the intended fix
Documentation: Should document this as expected behavior in migration guide

4. Models - NO CHANGES REQUIRED

File: backend/internal/models/security_config.go
Analysis: SecurityConfig model structure unchanged
File: backend/internal/models/setting.go
Analysis: Setting model structure unchanged

5. Route Registration - NO CHANGES REQUIRED

File: backend/internal/api/routes/routes.go (line 360)
Analysis: Already calls ReconcileCrowdSecOnStartup, no signature changes

6. Handler Dependencies - NO CHANGES REQUIRED

File: backend/internal/api/handlers/crowdsec_handler.go
Analysis: Start/Stop handlers operate independently, no coupling to reconciliation logic

Phase 2: Logging Enhancement Changes (crowdsec_startup.go)

Files Directly Modified

backend/internal/services/crowdsec_startup.go (lines 91-98)

Dependencies and Required Updates

1. Log Aggregation/Parsing - DOCUMENTATION UPDATE

Concern: Changing log level from Debug → Info increases log volume
Impact:
- Logs will now appear in production (Info is default minimum level)
- Log aggregation tools may need filter updates if they parse specific messages
Required: Update any log parsing scripts or documentation about expected log output

2. Integration Tests - POTENTIAL GREP PATTERNS

Files: scripts/crowdsec_*.sh
Impact: If scripts grep for specific log messages, they may need updates
Action: Search for log message expectations in scripts

3. Documentation - UPDATE REQUIRED

File: docs/features.md
Section: CrowdSec Integration (line 167+)

Required Change: Add note about reconciliation behavior:

#### Startup Behavior

CrowdSec automatically starts on container restart if:
- SecurityConfig has `crowdsec_mode = "local"` OR
- Settings table has `security.crowdsec.enabled = "true"`

Check container logs for reconciliation decisions:
- "CrowdSec reconciliation: starting based on SecurityConfig mode='local'"
- "CrowdSec reconciliation: starting based on Settings table override"
- "CrowdSec reconciliation skipped: both SecurityConfig and Settings indicate disabled"

4. Troubleshooting Guide - UPDATE RECOMMENDED

File: docs/troubleshooting/ (if exists) or docs/security.md
Required Change: Add section on "CrowdSec Not Starting After Restart"
- Explain reconciliation logic
- Show how to check Settings and SecurityConfig tables
- Show example log output

Phase 3: Unified Toggle Endpoint (OPTIONAL)

Files Directly Modified

backend/internal/api/handlers/crowdsec_handler.go (new method)
backend/internal/api/handlers/crowdsec_handler.go (RegisterRoutes)
frontend/src/api/crowdsec.ts (new function)
frontend/src/pages/Security.tsx (mutation update)

Dependencies and Required Updates

1. Handler Tests - NEW TESTS REQUIRED

File: backend/internal/api/handlers/crowdsec_handler_test.go
Required Tests:
- TestCrowdsecHandler_Toggle_EnableSuccess
- TestCrowdsecHandler_Toggle_DisableSuccess
- TestCrowdsecHandler_Toggle_TransactionRollback (if Start fails)
- TestCrowdsecHandler_Toggle_VerifyBothTablesUpdated

2. Existing Handlers - DEPRECATION CONSIDERATION

Files:
- Start handler (line ~167 in crowdsec_handler.go)
- Stop handler (line ~260 in crowdsec_handler.go)
Impact: New toggle endpoint duplicates Start/Stop functionality
Decision Required:
- Option A: Keep both for backward compatibility (RECOMMENDED)
- Option B: Deprecate Start/Stop, add deprecation warnings
- Option C: Remove Start/Stop entirely (BREAKING CHANGE - NOT RECOMMENDED)
Recommendation: Keep Start/Stop handlers unchanged, document toggle as "preferred method"

3. Frontend API Layer - MIGRATION PATH

File: frontend/src/api/crowdsec.ts
Current Exports: startCrowdsec, stopCrowdsec, statusCrowdsec
After Change: Add toggleCrowdsec to exports (line 75)
Backward Compatibility: Keep existing functions, don't remove them

4. Frontend Component - LIMITED SCOPE

File: frontend/src/pages/Security.tsx
Impact: Only crowdsecPowerMutation needs updating (lines 86-125)
Other Components: No other components import these functions (verified)
Risk: Low - isolated change

5. API Documentation - NEW ENDPOINT

File: docs/api.md (if exists)
Required Addition: Document /admin/crowdsec/toggle endpoint

6. Integration Tests - NEW TEST CASE

Files: scripts/crowdsec_integration.sh
Required Addition: Test toggle endpoint directly

7. Backward Compatibility - ANALYSIS

Frontend: Existing /admin/crowdsec/start and /admin/crowdsec/stop endpoints remain functional
API Consumers: External tools using Start/Stop continue to work
Risk: None - purely additive change

Cross-Cutting Concerns

Database Migration

No schema changes required - both Settings and SecurityConfig tables already exist
Data migration: None needed - changes are behavioral only

Configuration Files

No changes required - no new environment variables or config files

Docker/Deployment

No Dockerfile changes - all changes are code-level
No docker-compose changes - no new services or volumes

Security Implications

Phase 1: Improves security by respecting user's intent across restarts
Phase 2: No security impact (logging only)
Phase 3: Transaction safety prevents partial updates (improvement)

Performance Considerations

Phase 1: Adds one SQL query during auto-initialization (one-time, on startup)
Phase 2: Minimal - only adds log statements
Phase 3: Minimal - wraps existing logic in transaction

Rollback Safety

All phases: No database schema changes, can be rolled back via git revert
Data safety: No data loss risk - only affects process startup behavior

Summary of Required File Updates

Phase	Files to Modify	Files to Create	Tests to Add	Docs to Update
Phase 1	`crowdsec_startup.go`	None	3 new unit tests	None (covered in Phase 2)
Phase 2	`crowdsec_startup.go`	None	None	`features.md`, troubleshooting docs
Phase 3	`crowdsec_handler.go`, `crowdsec.ts`, `Security.tsx`	None	4 new handler tests	`api.md` (if exists)

Testing Matrix

Scenario	Phase 1	Phase 2	Phase 3
Fresh install → toggle ON → restart	✅ Fixes	✅ Better logs	✅ Cleaner code
Existing install with Settings='true', missing SecurityConfig	✅ Fixes	✅ Better logs	N/A
Toggle ON → restart → verify logs	✅ Works	✅ MUST verify new messages	✅ Works
Toggle OFF → restart → verify logs	✅ Works	✅ MUST verify new messages	✅ Works
Start/Stop handlers (backward compat)	N/A	N/A	✅ MUST verify still work

Missing from Original Plan

The original plan DID NOT explicitly mention:

Unit test updates required - Critical for Phase 1 (TestReconcileCrowdSecOnStartup_NoSecurityConfig needs major refactoring)
Integration script verification - May break if they expect specific behavior
Documentation updates - Features and troubleshooting guides need new reconciliation behavior documented
Backward compatibility analysis for Phase 3 - Need explicit decision on Start/Stop handler fate
API documentation - New endpoint needs docs
Testing matrix for all three phases together - Need to verify they work in combination

END OF SPECIFICATION

37 KiB Raw Blame History

CrowdSec Toggle Integration Fix Plan

Executive Summary

Root Cause Analysis (DETAILED)

Evidence Trail

Why Reconciliation Exits Silently

Current Code Analysis

1. Reconciliation Function (crowdsec_startup.go)

2. Start Handler (crowdsec_handler.go)

3. Frontend Toggle (Security.tsx)

Comprehensive Fix Strategy

Phase 1: Fix Auto-Initialization (CRITICAL - IMMEDIATE)

Phase 2: Enhance Logging (IMMEDIATE)

Phase 3: Add Unified Toggle Endpoint (OPTIONAL BUT RECOMMENDED)

Testing Plan

Test 1: Fresh Install

Test 2: Toggle ON → Container Restart

Test 3: Legacy Migration (Settings Table Only)

Test 4: Toggle OFF → Container Restart

Test 5: Corrupted SecurityConfig Recovery

Verification Checklist

Phase 1 (Auto-Initialization Fix)

Phase 2 (Logging Enhancement)

Phase 3 (Unified Toggle - Optional)

Pre-Deployment

Success Criteria

Risk Assessment

Rollback Plan

Dependency Impact Analysis

Phase 1: Auto-Initialization Changes (crowdsec_startup.go)

Files Directly Modified

Dependencies and Required Updates

Phase 2: Logging Enhancement Changes (crowdsec_startup.go)

Files Directly Modified

Dependencies and Required Updates

Phase 3: Unified Toggle Endpoint (OPTIONAL)

Files Directly Modified

Dependencies and Required Updates

Cross-Cutting Concerns

Database Migration

Configuration Files

Docker/Deployment

Security Implications

Performance Considerations

Rollback Safety

Summary of Required File Updates

Testing Matrix

Missing from Original Plan

37 KiB

Raw Blame History