# CrowdSec Toggle Integration Fix Plan **Date**: December 15, 2025 **Issue**: CrowdSec toggle stuck ON, reconciliation silently exits, process not starting **Root Cause**: Database disconnect between frontend (Settings table) and reconciliation (SecurityConfig table) --- ## Executive Summary The CrowdSec toggle shows "ON" but the process is NOT running. The reconciliation function silently exits without starting CrowdSec because: 1. **Frontend writes to Settings table** (`security.crowdsec.enabled`) 2. **Backend reconciliation reads from SecurityConfig table** (`crowdsec_mode = "local"`) 3. **No synchronization** between the two tables 4. **Auto-initialization code EXISTS** (lines 46-71 in crowdsec_startup.go) but creates config with `crowdsec_mode = "disabled"` 5. **Reconciliation sees "disabled"** and exits silently with no logs --- ## Root Cause Analysis (DETAILED) ### Evidence Trail **Container Logs Show Silent Exit**: ``` {"bin_path":"crowdsec","data_dir":"/app/data/crowdsec","level":"info","msg":"CrowdSec reconciliation: starting startup check","time":"2025-12-14T23:32:33-05:00"} [NO FURTHER LOGS - Function exited here] ``` **Database State on Fresh Start**: ``` SELECT * FROM security_configs → record not found {"level":"info","msg":"CrowdSec reconciliation: no SecurityConfig found, creating default config"} ``` **Process Check**: ```bash $ docker exec charon ps aux | grep -i crowdsec [NO RESULTS - Process not running] ``` ### Why Reconciliation Exits Silently **FILE**: `backend/internal/services/crowdsec_startup.go` **Execution Flow**: ``` 1. User clicks toggle ON in Security.tsx 2. Frontend calls updateSetting('security.crowdsec.enabled', 'true') 3. Settings table updated → security.crowdsec.enabled = "true" 4. Frontend calls startCrowdsec() → Handler updates SecurityConfig 5. CrowdSec starts successfully, toggle shows ON 6. Container restarts (docker restart or reboot) 7. ReconcileCrowdSecOnStartup() executes at line 26: Line 44: db.First(&cfg) → returns gorm.ErrRecordNotFound Lines 46-71: Auto-initialization block executes: - Creates SecurityConfig with crowdsec_mode = "disabled" - Logs "default SecurityConfig created successfully" - Returns early (line 70) WITHOUT checking Settings table - CrowdSec is NEVER started Result: Toggle shows "ON" (Settings table), but process is "OFF" (not running) ``` **THE BUG (Lines 46-71)**: ```go if err == gorm.ErrRecordNotFound { // AUTO-INITIALIZE: Create default SecurityConfig on first startup logger.Log().Info("CrowdSec reconciliation: no SecurityConfig found, creating default config") defaultCfg := models.SecurityConfig{ UUID: "default", Name: "Default Security Config", Enabled: false, CrowdSecMode: "disabled", // ← PROBLEM: Ignores Settings table state WAFMode: "disabled", WAFParanoiaLevel: 1, RateLimitMode: "disabled", RateLimitBurst: 10, RateLimitRequests: 100, RateLimitWindowSec: 60, } if err := db.Create(&defaultCfg).Error; err != nil { logger.Log().WithError(err).Error("CrowdSec reconciliation: failed to create default SecurityConfig") return } logger.Log().Info("CrowdSec reconciliation: default SecurityConfig created successfully") // Don't start CrowdSec on fresh install - user must enable via UI return // ← EXITS WITHOUT checking Settings table or starting process } ``` **Why This Causes the Issue**: 1. **First Container Start**: User enables CrowdSec via toggle - Settings: `security.crowdsec.enabled = "true"` ✅ - SecurityConfig: `crowdsec_mode = "local"` ✅ (via Start handler) - Process: Running ✅ 2. **Container Restart**: Database persists but SecurityConfig table may be empty (migration issue or corruption) - Reconciliation runs - SecurityConfig table: **EMPTY** (record lost or never migrated) - Auto-init creates SecurityConfig with `crowdsec_mode = "disabled"` - Returns early without checking Settings table - Settings: Still shows `"true"` (UI says ON) - SecurityConfig: Says `"disabled"` (reconciliation source) - Process: NOT started ❌ 3. **Result**: **State Mismatch** - Frontend toggle: **ON** (reads Settings table) - Backend reconciliation: **OFF** (reads SecurityConfig table) - Process: **NOT RUNNING** (reconciliation didn't start it) --- ## Current Code Analysis ### 1. Reconciliation Function (crowdsec_startup.go) **Location**: `backend/internal/services/crowdsec_startup.go` **Lines 44-71 (Auto-initialization - THE BUG)**: ```go var cfg models.SecurityConfig if err := db.First(&cfg).Error; err != nil { if err == gorm.ErrRecordNotFound { // AUTO-INITIALIZE: Create default SecurityConfig on first startup logger.Log().Info("CrowdSec reconciliation: no SecurityConfig found, creating default config") defaultCfg := models.SecurityConfig{ UUID: "default", Name: "Default Security Config", Enabled: false, CrowdSecMode: "disabled", // ← IGNORES Settings table WAFMode: "disabled", WAFParanoiaLevel: 1, RateLimitMode: "disabled", RateLimitBurst: 10, RateLimitRequests: 100, RateLimitWindowSec: 60, } if err := db.Create(&defaultCfg).Error; err != nil { logger.Log().WithError(err).Error("CrowdSec reconciliation: failed to create default SecurityConfig") return } logger.Log().Info("CrowdSec reconciliation: default SecurityConfig created successfully") // Don't start CrowdSec on fresh install - user must enable via UI return // ← EARLY EXIT - Never checks Settings table } logger.Log().WithError(err).Warn("CrowdSec reconciliation: failed to read SecurityConfig") return } ``` **Lines 74-90 (Runtime Setting Override - UNREACHABLE after auto-init)**: ```go // Also check for runtime setting override in settings table var settingOverride struct{ Value string } crowdSecEnabled := false if err := db.Raw("SELECT value FROM settings WHERE key = ? LIMIT 1", "security.crowdsec.enabled").Scan(&settingOverride).Error; err == nil && settingOverride.Value != "" { crowdSecEnabled = strings.EqualFold(settingOverride.Value, "true") logger.Log().WithFields(map[string]interface{}{ "setting_value": settingOverride.Value, "crowdsec_enabled": crowdSecEnabled, }).Debug("CrowdSec reconciliation: found runtime setting override") } ``` **This code is NEVER REACHED** when SecurityConfig doesn't exist because line 70 returns early! **Lines 91-98 (Decision Logic)**: ```go // Only auto-start if CrowdSecMode is "local" OR runtime setting is enabled if cfg.CrowdSecMode != "local" && !crowdSecEnabled { logger.Log().WithFields(map[string]interface{}{ "db_mode": cfg.CrowdSecMode, "setting_enabled": crowdSecEnabled, }).Debug("CrowdSec reconciliation skipped: mode is not 'local' and setting not enabled") return } ``` **Also UNREACHABLE** during auto-init scenario! ### 2. Start Handler (crowdsec_handler.go) **Location**: `backend/internal/api/handlers/crowdsec_handler.go` **Lines 167-192 - CORRECT IMPLEMENTATION**: ```go func (h *CrowdsecHandler) Start(c *gin.Context) { ctx := c.Request.Context() // UPDATE SecurityConfig to persist user's intent var cfg models.SecurityConfig if err := h.DB.First(&cfg).Error; err != nil { if err == gorm.ErrRecordNotFound { // Create default config with CrowdSec enabled cfg = models.SecurityConfig{ UUID: "default", Name: "Default Security Config", Enabled: true, CrowdSecMode: "local", // ← CORRECT: Sets mode to "local" } if err := h.DB.Create(&cfg).Error; err != nil { logger.Log().WithError(err).Error("Failed to create SecurityConfig") c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to persist configuration"}) return } } else { logger.Log().WithError(err).Error("Failed to read SecurityConfig") c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to read configuration"}) return } } else { // Update existing config cfg.CrowdSecMode = "local" cfg.Enabled = true if err := h.DB.Save(&cfg).Error; err != nil { logger.Log().WithError(err).Error("Failed to update SecurityConfig") c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to persist configuration"}) return } } // Start the process... } ``` **Analysis**: This is CORRECT. The Start handler properly updates SecurityConfig when user clicks "Start" from the CrowdSec config page (/security/crowdsec). ### 3. Frontend Toggle (Security.tsx) **Location**: `frontend/src/pages/Security.tsx` **Lines 64-120 - THE DISCONNECT**: ```tsx const crowdsecPowerMutation = useMutation({ mutationFn: async (enabled: boolean) => { // Step 1: Update Settings table await updateSetting('security.crowdsec.enabled', enabled ? 'true' : 'false', 'security', 'bool') if (enabled) { // Step 2: Call Start() which updates SecurityConfig const result = await startCrowdsec() // Step 3: Verify running const status = await statusCrowdsec() if (!status.running) { await updateSetting('security.crowdsec.enabled', 'false', 'security', 'bool') throw new Error('CrowdSec process failed to start') } return result } else { // Step 2: Call Stop() which DOES NOT update SecurityConfig! await stopCrowdsec() // Step 3: Verify stopped await new Promise(resolve => setTimeout(resolve, 500)) const status = await statusCrowdsec() if (status.running) { throw new Error('CrowdSec process still running') } return { enabled: false } } }, }) ``` **Analysis**: - **Enable Path**: Updates Settings → Calls Start() → Start() updates SecurityConfig → ✅ Both tables synced - **Disable Path**: Updates Settings → Calls Stop() → Stop() **does NOT always update SecurityConfig** → ❌ Tables out of sync Looking at the Stop handler: ```go func (h *CrowdsecHandler) Stop(c *gin.Context) { ctx := c.Request.Context() if err := h.Executor.Stop(ctx, h.DataDir); err != nil { c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()}) return } // UPDATE SecurityConfig to persist user's intent var cfg models.SecurityConfig if err := h.DB.First(&cfg).Error; err == nil { cfg.CrowdSecMode = "disabled" cfg.Enabled = false if err := h.DB.Save(&cfg).Error; err != nil { logger.Log().WithError(err).Warn("Failed to update SecurityConfig after stopping CrowdSec") } } c.JSON(http.StatusOK, gin.H{"status": "stopped"}) } ``` **This IS CORRECT** - Stop() handler updates SecurityConfig when it can find it. BUT: **Scenario Where It Fails**: 1. SecurityConfig table gets corrupted/cleared/migrated incorrectly 2. User clicks toggle OFF 3. Stop() tries to update SecurityConfig → record not found → skips update 4. Settings table still updated to "false" 5. Container restarts → auto-init creates SecurityConfig with "disabled" 6. Both tables say "disabled" but UI might show stale state --- ## Comprehensive Fix Strategy ### Phase 1: Fix Auto-Initialization (CRITICAL - IMMEDIATE) **FILE**: `backend/internal/services/crowdsec_startup.go` **CHANGE**: Lines 46-71 (auto-initialization block) **AFTER** (with Settings table check): ```go if err == gorm.ErrRecordNotFound { // AUTO-INITIALIZE: Create default SecurityConfig by checking Settings table logger.Log().Info("CrowdSec reconciliation: no SecurityConfig found, checking Settings table for user preference") // Check if user has already enabled CrowdSec via Settings table (from toggle or legacy config) var settingOverride struct{ Value string } crowdSecEnabledInSettings := false if err := db.Raw("SELECT value FROM settings WHERE key = ? LIMIT 1", "security.crowdsec.enabled").Scan(&settingOverride).Error; err == nil && settingOverride.Value != "" { crowdSecEnabledInSettings = strings.EqualFold(settingOverride.Value, "true") logger.Log().WithFields(map[string]interface{}{ "setting_value": settingOverride.Value, "enabled": crowdSecEnabledInSettings, }).Info("CrowdSec reconciliation: found existing Settings table preference") } // Create SecurityConfig that matches Settings table state crowdSecMode := "disabled" if crowdSecEnabledInSettings { crowdSecMode = "local" } defaultCfg := models.SecurityConfig{ UUID: "default", Name: "Default Security Config", Enabled: crowdSecEnabledInSettings, CrowdSecMode: crowdSecMode, // ← NOW RESPECTS Settings table WAFMode: "disabled", WAFParanoiaLevel: 1, RateLimitMode: "disabled", RateLimitBurst: 10, RateLimitRequests: 100, RateLimitWindowSec: 60, } if err := db.Create(&defaultCfg).Error; err != nil { logger.Log().WithError(err).Error("CrowdSec reconciliation: failed to create default SecurityConfig") return } logger.Log().WithFields(map[string]interface{}{ "crowdsec_mode": defaultCfg.CrowdSecMode, "enabled": defaultCfg.Enabled, "source": "settings_table", }).Info("CrowdSec reconciliation: default SecurityConfig created from Settings preference") // Continue to process the config (DON'T return early) cfg = defaultCfg } ``` **KEY CHANGES**: 1. **Check Settings table** during auto-initialization 2. **Create SecurityConfig matching Settings state** (not hardcoded "disabled") 3. **Don't return early** - let the rest of the function process the config 4. **Assign to cfg variable** so flow continues to line 74+ ### Phase 2: Enhance Logging (IMMEDIATE) **FILE**: `backend/internal/services/crowdsec_startup.go` **CHANGE**: Lines 91-98 (decision logic - better logging) **AFTER**: ```go // Start when EITHER SecurityConfig has mode="local" OR Settings table has enabled=true // Exit only when BOTH are disabled if cfg.CrowdSecMode != "local" && !crowdSecEnabled { logger.Log().WithFields(map[string]interface{}{ "db_mode": cfg.CrowdSecMode, "setting_enabled": crowdSecEnabled, }).Info("CrowdSec reconciliation skipped: both SecurityConfig and Settings indicate disabled") return } // Log which source triggered the start if cfg.CrowdSecMode == "local" { logger.Log().WithField("mode", cfg.CrowdSecMode).Info("CrowdSec reconciliation: starting based on SecurityConfig mode='local'") } else if crowdSecEnabled { logger.Log().WithField("setting", "true").Info("CrowdSec reconciliation: starting based on Settings table override") } ``` **KEY CHANGES**: 1. **Change log level** from Debug to Info (so we see it in logs) 2. **Add source attribution** (which table triggered the start) 3. **Clarify condition** (exit only when BOTH are disabled) ### Phase 3: Add Unified Toggle Endpoint (OPTIONAL BUT RECOMMENDED) **WHY**: Currently the toggle updates Settings, then calls Start/Stop which updates SecurityConfig. This creates potential race conditions. A unified endpoint is safer. **FILE**: `backend/internal/api/handlers/crowdsec_handler.go` **ADD**: New method (after Stop(), around line 260) ```go // ToggleCrowdSec enables or disables CrowdSec, synchronizing Settings and SecurityConfig atomically func (h *CrowdsecHandler) ToggleCrowdSec(c *gin.Context) { var payload struct { Enabled bool `json:"enabled"` } if err := c.ShouldBindJSON(&payload); err != nil { c.JSON(http.StatusBadRequest, gin.H{"error": "invalid payload"}) return } logger.Log().WithField("enabled", payload.Enabled).Info("CrowdSec toggle: received request") // Use a transaction to ensure Settings and SecurityConfig stay in sync tx := h.DB.Begin() defer func() { if r := recover(); r != nil { tx.Rollback() } }() // STEP 1: Update Settings table settingKey := "security.crowdsec.enabled" settingValue := "false" if payload.Enabled { settingValue = "true" } var settingModel models.Setting if err := tx.Where("key = ?", settingKey).FirstOrCreate(&settingModel, models.Setting{ Key: settingKey, Value: settingValue, Type: "bool", Category: "security", }).Error; err != nil { tx.Rollback() logger.Log().WithError(err).Error("CrowdSec toggle: failed to update Settings table") c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to update settings"}) return } settingModel.Value = settingValue if err := tx.Save(&settingModel).Error; err != nil { tx.Rollback() logger.Log().WithError(err).Error("CrowdSec toggle: failed to save Settings table") c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to update settings"}) return } // STEP 2: Update SecurityConfig table var cfg models.SecurityConfig if err := tx.First(&cfg).Error; err != nil { if err == gorm.ErrRecordNotFound { // Create config matching toggle state crowdSecMode := "disabled" if payload.Enabled { crowdSecMode = "local" } cfg = models.SecurityConfig{ UUID: "default", Name: "Default Security Config", Enabled: payload.Enabled, CrowdSecMode: crowdSecMode, WAFMode: "disabled", WAFParanoiaLevel: 1, RateLimitMode: "disabled", RateLimitBurst: 10, RateLimitRequests: 100, RateLimitWindowSec: 60, } if err := tx.Create(&cfg).Error; err != nil { tx.Rollback() logger.Log().WithError(err).Error("CrowdSec toggle: failed to create SecurityConfig") c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to persist configuration"}) return } } else { tx.Rollback() logger.Log().WithError(err).Error("CrowdSec toggle: failed to read SecurityConfig") c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to read configuration"}) return } } else { // Update existing config if payload.Enabled { cfg.CrowdSecMode = "local" cfg.Enabled = true } else { cfg.CrowdSecMode = "disabled" cfg.Enabled = false } if err := tx.Save(&cfg).Error; err != nil { tx.Rollback() logger.Log().WithError(err).Error("CrowdSec toggle: failed to update SecurityConfig") c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to persist configuration"}) return } } // Commit the transaction before starting/stopping process if err := tx.Commit().Error; err != nil { logger.Log().WithError(err).Error("CrowdSec toggle: transaction commit failed") c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to commit changes"}) return } logger.Log().WithFields(map[string]interface{}{ "enabled": cfg.Enabled, "crowdsec_mode": cfg.CrowdSecMode, }).Info("CrowdSec toggle: synchronized Settings and SecurityConfig successfully") // STEP 3: Start or stop the process ctx := c.Request.Context() if payload.Enabled { // Start CrowdSec pid, err := h.Executor.Start(ctx, h.BinPath, h.DataDir) if err != nil { logger.Log().WithError(err).Error("CrowdSec toggle: failed to start process, reverting DB changes") // Revert both tables (in new transaction) revertTx := h.DB.Begin() cfg.CrowdSecMode = "disabled" cfg.Enabled = false revertTx.Save(&cfg) settingModel.Value = "false" revertTx.Save(&settingModel) revertTx.Commit() c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()}) return } // Wait for LAPI readiness lapiReady := false maxWait := 30 * time.Second pollInterval := 500 * time.Millisecond deadline := time.Now().Add(maxWait) for time.Now().Before(deadline) { args := []string{"lapi", "status"} if _, err := os.Stat(filepath.Join(h.DataDir, "config.yaml")); err == nil { args = append([]string{"-c", filepath.Join(h.DataDir, "config.yaml")}, args...) } checkCtx, cancel := context.WithTimeout(ctx, 2*time.Second) _, err := h.CmdExec.Execute(checkCtx, "cscli", args...) cancel() if err == nil { lapiReady = true break } time.Sleep(pollInterval) } logger.Log().WithFields(map[string]interface{}{ "pid": pid, "lapi_ready": lapiReady, }).Info("CrowdSec toggle: started successfully") c.JSON(http.StatusOK, gin.H{ "enabled": true, "pid": pid, "lapi_ready": lapiReady, }) return } else { // Stop CrowdSec if err := h.Executor.Stop(ctx, h.DataDir); err != nil { logger.Log().WithError(err).Error("CrowdSec toggle: failed to stop process") c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()}) return } logger.Log().Info("CrowdSec toggle: stopped successfully") c.JSON(http.StatusOK, gin.H{"enabled": false}) return } } ``` **Register Route**: ```go // In RegisterRoutes() method rg.POST("/admin/crowdsec/toggle", h.ToggleCrowdSec) ``` **Frontend API Client** (`frontend/src/api/crowdsec.ts`): ```typescript export async function toggleCrowdsec(enabled: boolean): Promise<{ enabled: boolean; pid?: number; lapi_ready?: boolean }> { const response = await client.post('/admin/crowdsec/toggle', { enabled }) return response.data } ``` **Frontend Toggle Update** (`frontend/src/pages/Security.tsx`): ```tsx const crowdsecPowerMutation = useMutation({ mutationFn: async (enabled: boolean) => { if (enabled) { toast.info('Starting CrowdSec... This may take up to 30 seconds') } // Use unified toggle endpoint (handles Settings + SecurityConfig + Process) const result = await toggleCrowdsec(enabled) // Backend already verified state, just do final status check const status = await statusCrowdsec() if (enabled && !status.running) { throw new Error('CrowdSec process failed to start. Check server logs for details.') } if (!enabled && status.running) { throw new Error('CrowdSec process still running. Check server logs for details.') } return result }, // ... rest remains the same }) ``` --- ## Testing Plan ### Test 1: Fresh Install **Scenario**: Brand new Charon installation 1. Start container: `docker compose up -d` 2. Navigate to Security page 3. Verify CrowdSec toggle shows OFF 4. Check status: `curl http://localhost:8080/api/v1/admin/crowdsec/status` - Expected: `{"running": false}` 5. Check logs: `docker logs charon 2>&1 | grep "reconciliation"` - Expected: "no SecurityConfig found, checking Settings table" - Expected: "default SecurityConfig created from Settings preference" - Expected: "crowdsec_mode: disabled" ### Test 2: Toggle ON → Container Restart **Scenario**: User enables CrowdSec, then restarts container 1. Enable toggle in UI (click ON) 2. Verify CrowdSec starts 3. Check status: `{"running": true, "pid": xxx}` 4. Restart: `docker restart charon` 5. Wait 10 seconds 6. Check status again: `{"running": true, "pid": xxx}` (NEW PID) 7. Check logs: - Expected: "starting based on SecurityConfig mode='local'" ### Test 3: Legacy Migration (Settings Table Only) **Scenario**: Existing install with Settings table but no SecurityConfig 1. Manually set: `INSERT INTO settings (key, value, type, category) VALUES ('security.crowdsec.enabled', 'true', 'bool', 'security');` 2. Delete SecurityConfig: `DELETE FROM security_configs;` 3. Restart container 4. Check logs: - Expected: "found existing Settings table preference" - Expected: "default SecurityConfig created from Settings preference" - Expected: "crowdsec_mode: local" 5. Check status: `{"running": true}` ### Test 4: Toggle OFF → Container Restart **Scenario**: User disables CrowdSec, then restarts container 1. Start with CrowdSec enabled and running 2. Click toggle OFF in UI 3. Verify process stops 4. Restart: `docker restart charon` 5. Wait 10 seconds 6. Check status: `{"running": false}` 7. Verify toggle still shows OFF ### Test 5: Corrupted SecurityConfig Recovery **Scenario**: SecurityConfig gets deleted but Settings exists 1. Enable CrowdSec via UI 2. Manually delete SecurityConfig: `DELETE FROM security_configs;` 3. Restart container 4. Verify auto-init recreates SecurityConfig matching Settings table 5. Verify CrowdSec auto-starts --- ## Verification Checklist ### Phase 1 (Auto-Initialization Fix) - [ ] Modified `crowdsec_startup.go` lines 46-71 - [ ] Auto-init checks Settings table for existing preference - [ ] Auto-init creates SecurityConfig matching Settings state - [ ] Auto-init does NOT return early (continues to line 74+) - [ ] Test 1 (Fresh Install) passes - [ ] Test 3 (Legacy Migration) passes ### Phase 2 (Logging Enhancement) - [ ] Modified `crowdsec_startup.go` lines 91-98 - [ ] Changed log level from Debug to Info - [ ] Added source attribution logging - [ ] Test 2 (Toggle ON → Restart) shows correct log - [ ] Test 4 (Toggle OFF → Restart) shows correct log ### Phase 3 (Unified Toggle - Optional) - [ ] Added `ToggleCrowdSec()` method to `crowdsec_handler.go` - [ ] Registered `/admin/crowdsec/toggle` route - [ ] Added `toggleCrowdsec()` to `crowdsec.ts` - [ ] Updated `crowdsecPowerMutation` in `Security.tsx` - [ ] Test 4 (Toggle synchronization) passes - [ ] Test 5 (Corrupted recovery) passes ### Pre-Deployment - [ ] Pre-commit linters pass: `pre-commit run --all-files` - [ ] Backend tests pass: `cd backend && go test ./...` - [ ] Frontend tests pass: `cd frontend && npm run test` - [ ] Docker build succeeds: `docker build -t charon:local .` - [ ] Integration test passes: `scripts/crowdsec_integration.sh` --- ## Success Criteria ✅ **Fix is complete when**: 1. Toggle shows correct state (ON = running, OFF = stopped) 2. Toggle persists across container restarts 3. Reconciliation logs clearly show decision reason 4. Auto-initialization respects Settings table preference 5. No "stuck toggle" scenarios 6. All 5 test cases pass 7. Pre-commit checks pass 8. No regressions in existing CrowdSec functionality --- ## Risk Assessment | Change | Risk Level | Mitigation | |--------|------------|------------| | Phase 1 (Auto-init) | **Low** | Only affects fresh installs or corrupted state recovery | | Phase 2 (Logging) | **Very Low** | Only changes log output, no logic changes | | Phase 3 (Unified toggle) | **Medium** | New endpoint, requires thorough testing, but backward compatible | --- ## Rollback Plan If issues arise: 1. **Immediate Revert**: `git revert ` (no DB changes needed) 2. **Manual Fix** (if toggle stuck): ```sql -- Reset SecurityConfig UPDATE security_configs SET crowdsec_mode = 'disabled', enabled = 0 WHERE uuid = 'default'; -- Reset Settings UPDATE settings SET value = 'false' WHERE key = 'security.crowdsec.enabled'; ``` 3. **Force Stop CrowdSec**: `docker exec charon pkill -SIGTERM crowdsec` --- ## Dependency Impact Analysis ### Phase 1: Auto-Initialization Changes (crowdsec_startup.go) #### Files Directly Modified - `backend/internal/services/crowdsec_startup.go` (lines 46-71) #### Dependencies and Required Updates **1. Unit Tests - MUST BE UPDATED** - **File**: `backend/internal/services/crowdsec_startup_test.go` - **Impact**: Test `TestReconcileCrowdSecOnStartup_NoSecurityConfig` expects the function to skip/return early when no SecurityConfig exists - **Required Change**: Update test to: - Create a Settings table entry with `security.crowdsec.enabled = 'true'` - Verify that SecurityConfig is auto-created with `crowdsec_mode = "local"` - Verify that CrowdSec process is started (not skipped) - **Additional Tests Needed**: - `TestReconcileCrowdSecOnStartup_NoSecurityConfig_SettingsDisabled` - Settings='false' → creates config with mode="disabled", does NOT start - `TestReconcileCrowdSecOnStartup_NoSecurityConfig_SettingsEnabled` - Settings='true' → creates config with mode="local", DOES start - `TestReconcileCrowdSecOnStartup_NoSecurityConfig_NoSettingsEntry` - No Settings entry → creates config with mode="disabled", does NOT start **2. Integration Tests - VERIFICATION NEEDED** - **Files**: - `scripts/crowdsec_integration.sh` - `scripts/crowdsec_startup_test.sh` - `scripts/crowdsec_decision_integration.sh` - **Impact**: These scripts may assume specific startup behavior - **Verification Required**: - Do any scripts pre-populate Settings table? - Do any scripts expect reconciliation to skip on fresh DB? - Do any scripts verify log output from reconciliation? - **Action**: Review scripts for assumptions about auto-initialization behavior **3. Migration/Upgrade Path - DATABASE CONCERN** - **Scenario**: Existing installations with Settings='true' but missing SecurityConfig - **Impact**: After upgrade, reconciliation will auto-create SecurityConfig from Settings (POSITIVE) - **Risk**: Low - this is the intended fix - **Documentation**: Should document this as expected behavior in migration guide **4. Models - NO CHANGES REQUIRED** - **File**: `backend/internal/models/security_config.go` - **Analysis**: SecurityConfig model structure unchanged - **File**: `backend/internal/models/setting.go` - **Analysis**: Setting model structure unchanged **5. Route Registration - NO CHANGES REQUIRED** - **File**: `backend/internal/api/routes/routes.go` (line 360) - **Analysis**: Already calls `ReconcileCrowdSecOnStartup`, no signature changes **6. Handler Dependencies - NO CHANGES REQUIRED** - **File**: `backend/internal/api/handlers/crowdsec_handler.go` - **Analysis**: Start/Stop handlers operate independently, no coupling to reconciliation logic ### Phase 2: Logging Enhancement Changes (crowdsec_startup.go) #### Files Directly Modified - `backend/internal/services/crowdsec_startup.go` (lines 91-98) #### Dependencies and Required Updates **1. Log Aggregation/Parsing - DOCUMENTATION UPDATE** - **Concern**: Changing log level from Debug → Info increases log volume - **Impact**: - Logs will now appear in production (Info is default minimum level) - Log aggregation tools may need filter updates if they parse specific messages - **Required**: Update any log parsing scripts or documentation about expected log output **2. Integration Tests - POTENTIAL GREP PATTERNS** - **Files**: `scripts/crowdsec_*.sh` - **Impact**: If scripts `grep` for specific log messages, they may need updates - **Action**: Search for log message expectations in scripts **3. Documentation - UPDATE REQUIRED** - **File**: `docs/features.md` - **Section**: CrowdSec Integration (line 167+) - **Required Change**: Add note about reconciliation behavior: ```markdown #### Startup Behavior CrowdSec automatically starts on container restart if: - SecurityConfig has `crowdsec_mode = "local"` OR - Settings table has `security.crowdsec.enabled = "true"` Check container logs for reconciliation decisions: - "CrowdSec reconciliation: starting based on SecurityConfig mode='local'" - "CrowdSec reconciliation: starting based on Settings table override" - "CrowdSec reconciliation skipped: both SecurityConfig and Settings indicate disabled" ``` **4. Troubleshooting Guide - UPDATE RECOMMENDED** - **File**: `docs/troubleshooting/` (if exists) or `docs/security.md` - **Required Change**: Add section on "CrowdSec Not Starting After Restart" - Explain reconciliation logic - Show how to check Settings and SecurityConfig tables - Show example log output ### Phase 3: Unified Toggle Endpoint (OPTIONAL) #### Files Directly Modified - `backend/internal/api/handlers/crowdsec_handler.go` (new method) - `backend/internal/api/handlers/crowdsec_handler.go` (RegisterRoutes) - `frontend/src/api/crowdsec.ts` (new function) - `frontend/src/pages/Security.tsx` (mutation update) #### Dependencies and Required Updates **1. Handler Tests - NEW TESTS REQUIRED** - **File**: `backend/internal/api/handlers/crowdsec_handler_test.go` - **Required Tests**: - `TestCrowdsecHandler_Toggle_EnableSuccess` - `TestCrowdsecHandler_Toggle_DisableSuccess` - `TestCrowdsecHandler_Toggle_TransactionRollback` (if Start fails) - `TestCrowdsecHandler_Toggle_VerifyBothTablesUpdated` **2. Existing Handlers - DEPRECATION CONSIDERATION** - **Files**: - Start handler (line ~167 in crowdsec_handler.go) - Stop handler (line ~260 in crowdsec_handler.go) - **Impact**: New toggle endpoint duplicates Start/Stop functionality - **Decision Required**: - **Option A**: Keep both for backward compatibility (RECOMMENDED) - **Option B**: Deprecate Start/Stop, add deprecation warnings - **Option C**: Remove Start/Stop entirely (BREAKING CHANGE - NOT RECOMMENDED) - **Recommendation**: Keep Start/Stop handlers unchanged, document toggle as "preferred method" **3. Frontend API Layer - MIGRATION PATH** - **File**: `frontend/src/api/crowdsec.ts` - **Current Exports**: `startCrowdsec`, `stopCrowdsec`, `statusCrowdsec` - **After Change**: Add `toggleCrowdsec` to exports (line 75) - **Backward Compatibility**: Keep existing functions, don't remove them **4. Frontend Component - LIMITED SCOPE** - **File**: `frontend/src/pages/Security.tsx` - **Impact**: Only `crowdsecPowerMutation` needs updating (lines 86-125) - **Other Components**: No other components import these functions (verified) - **Risk**: Low - isolated change **5. API Documentation - NEW ENDPOINT** - **File**: `docs/api.md` (if exists) - **Required Addition**: Document `/admin/crowdsec/toggle` endpoint **6. Integration Tests - NEW TEST CASE** - **Files**: `scripts/crowdsec_integration.sh` - **Required Addition**: Test toggle endpoint directly **7. Backward Compatibility - ANALYSIS** - **Frontend**: Existing `/admin/crowdsec/start` and `/admin/crowdsec/stop` endpoints remain functional - **API Consumers**: External tools using Start/Stop continue to work - **Risk**: None - purely additive change ### Cross-Cutting Concerns #### Database Migration - **No schema changes required** - both Settings and SecurityConfig tables already exist - **Data migration**: None needed - changes are behavioral only #### Configuration Files - **No changes required** - no new environment variables or config files #### Docker/Deployment - **No Dockerfile changes** - all changes are code-level - **No docker-compose changes** - no new services or volumes #### Security Implications - **Phase 1**: Improves security by respecting user's intent across restarts - **Phase 2**: No security impact (logging only) - **Phase 3**: Transaction safety prevents partial updates (improvement) #### Performance Considerations - **Phase 1**: Adds one SQL query during auto-initialization (one-time, on startup) - **Phase 2**: Minimal - only adds log statements - **Phase 3**: Minimal - wraps existing logic in transaction #### Rollback Safety - **All phases**: No database schema changes, can be rolled back via git revert - **Data safety**: No data loss risk - only affects process startup behavior ### Summary of Required File Updates | Phase | Files to Modify | Files to Create | Tests to Add | Docs to Update | |-------|----------------|-----------------|--------------|----------------| | **Phase 1** | `crowdsec_startup.go` | None | 3 new unit tests | None (covered in Phase 2) | | **Phase 2** | `crowdsec_startup.go` | None | None | `features.md`, troubleshooting docs | | **Phase 3** | `crowdsec_handler.go`, `crowdsec.ts`, `Security.tsx` | None | 4 new handler tests | `api.md` (if exists) | ### Testing Matrix | Scenario | Phase 1 | Phase 2 | Phase 3 | |----------|---------|---------|---------| | Fresh install → toggle ON → restart | ✅ Fixes | ✅ Better logs | ✅ Cleaner code | | Existing install with Settings='true', missing SecurityConfig | ✅ Fixes | ✅ Better logs | N/A | | Toggle ON → restart → verify logs | ✅ Works | ✅ MUST verify new messages | ✅ Works | | Toggle OFF → restart → verify logs | ✅ Works | ✅ MUST verify new messages | ✅ Works | | Start/Stop handlers (backward compat) | N/A | N/A | ✅ MUST verify still work | ### Missing from Original Plan The original plan DID NOT explicitly mention: 1. **Unit test updates required** - Critical for Phase 1 (`TestReconcileCrowdSecOnStartup_NoSecurityConfig` needs major refactoring) 2. **Integration script verification** - May break if they expect specific behavior 3. **Documentation updates** - Features and troubleshooting guides need new reconciliation behavior documented 4. **Backward compatibility analysis for Phase 3** - Need explicit decision on Start/Stop handler fate 5. **API documentation** - New endpoint needs docs 6. **Testing matrix for all three phases together** - Need to verify they work in combination --- **END OF SPECIFICATION**