22 KiB
CrowdSec Critical Hotfix Remediation Plan
Date: December 15, 2025 Priority: CRITICAL Issue Count: 4 reported issues after 17 failed commit attempts Affected Components: Backend (handlers, services), Frontend (pages, hooks, components)
Executive Summary
After exhaustive analysis of the CrowdSec functionality across both backend and frontend, I have identified the root causes of all four reported issues. The core problem is a dual-state architecture conflict where CrowdSec's enabled state is managed by TWO independent systems that don't synchronize properly:
- Settings Table (
security.crowdsec.enabledandsecurity.crowdsec.mode) - Runtime overrides - SecurityConfig Table (
CrowdSecModecolumn) - User configuration
Additionally, the Live Log Viewer has a WebSocket lifecycle bug and the deprecated mode UI causes state conflicts.
The 4 Reported Issues
| # | Issue | Root Cause | Severity |
|---|---|---|---|
| 1 | CrowdSec card toggle broken - shows "active" but not actually on | Dual-state conflict: security.crowdsec.mode overrides security.crowdsec.enabled |
CRITICAL |
| 2 | Live logs show "disconnected" but logs appear; navigation clears logs | WebSocket reconnection lifecycle bug + state not persisted | HIGH |
| 3 | Deprecated mode toggle still in UI causing confusion | UI component not removed after deprecation | MEDIUM |
| 4 | Enrollment shows "not running" when LAPI initializing | Race condition between process start and LAPI readiness | HIGH |
Current State Analysis
Backend Data Flow
1. SecurityConfig Model
File: backend/internal/models/security_config.go
type SecurityConfig struct {
CrowdSecMode string `json:"crowdsec_mode"` // "disabled" or "local" - DEPRECATED
Enabled bool `json:"enabled"` // Cerberus master switch
// ...
}
2. GetStatus Handler - THE BUG
File: backend/internal/api/handlers/security_handler.go#L75-175
The GetStatus endpoint has a three-tier priority chain that causes the bug:
// PRIORITY 1 (highest): Settings table overrides
// Line 135-140: Check security.crowdsec.enabled
if strings.EqualFold(setting.Value, "true") {
crowdSecMode = "local"
} else {
crowdSecMode = "disabled"
}
// Line 143-148: THEN check security.crowdsec.mode - THIS OVERRIDES THE ABOVE!
setting = struct{ Value string }{}
if err := h.db.Raw("SELECT value FROM settings WHERE key = ? LIMIT 1", "security.crowdsec.mode").Scan(&setting).Error; err == nil && setting.Value != "" {
crowdSecMode = setting.Value // <-- BUG: This can override the enabled check!
}
The Bug Flow:
- User toggles CrowdSec ON →
security.crowdsec.enabled = "true"→crowdSecMode = "local"✓ - BUT if
security.crowdsec.mode = "disabled"was previously set (by deprecated UI), it OVERRIDES step 1 - Final result:
crowdSecMode = "disabled"even though user just toggled it ON
3. CrowdSec Start Handler - INCONSISTENT STATE UPDATE
File: backend/internal/api/handlers/crowdsec_handler.go#L184-240
func (h *CrowdsecHandler) Start(c *gin.Context) {
// Updates SecurityConfig table
cfg.CrowdSecMode = "local"
cfg.Enabled = true
h.DB.Save(&cfg) // Saves to security_configs table
// BUT: Does NOT update settings table!
// Missing: h.DB.Create/Update(&models.Setting{Key: "security.crowdsec.enabled", Value: "true"})
}
Problem: Start() updates SecurityConfig.CrowdSecMode but the frontend toggle updates settings.security.crowdsec.enabled. These are TWO DIFFERENT tables that both affect CrowdSec state.
4. Feature Flags Handler
File: backend/internal/api/handlers/feature_flags_handler.go
Only manages THREE flags:
feature.cerberus.enabled(Cerberus master switch)feature.uptime.enabledfeature.crowdsec.console_enrollment
Missing: No feature.crowdsec.enabled. CrowdSec uses security.crowdsec.enabled in settings table, which is NOT a feature flag.
Frontend Data Flow
1. Security.tsx (Cerberus Dashboard)
File: frontend/src/pages/Security.tsx#L65-110
const crowdsecPowerMutation = useMutation({
mutationFn: async (enabled: boolean) => {
// Step 1: Update settings table
await updateSetting('security.crowdsec.enabled', enabled ? 'true' : 'false', 'security', 'bool')
if (enabled) {
// Step 2: Start process (which updates SecurityConfig table)
const result = await startCrowdsec()
// ...
}
}
})
The mutation updates TWO places:
settingstable viaupdateSetting()→ setssecurity.crowdsec.enabledsecurity_configstable viastartCrowdsec()backend → setsCrowdSecMode
But GetStatus reads from BOTH and can get conflicting values.
2. CrowdSecConfig.tsx - DEPRECATED MODE TOGGLE
File: frontend/src/pages/CrowdSecConfig.tsx#L69-90
const updateModeMutation = useMutation({
mutationFn: async (mode: string) => updateSetting('security.crowdsec.mode', mode, 'security', 'string'),
// This updates security.crowdsec.mode which OVERRIDES security.crowdsec.enabled!
})
This is the deprecated toggle that should not exist. It sets security.crowdsec.mode which takes precedence over security.crowdsec.enabled in GetStatus.
3. LiveLogViewer.tsx - WEBSOCKET BUGS
File: frontend/src/components/LiveLogViewer.tsx#L100-150
useEffect(() => {
// Close existing connection
if (closeConnectionRef.current) {
closeConnectionRef.current();
closeConnectionRef.current = null;
}
// ... reconnect logic
}, [currentMode, filters, securityFilters, isPaused, maxLogs, showBlockedOnly]);
// ^^^^^^^^
// BUG: isPaused in dependencies causes reconnection when user just wants to pause!
Problems:
isPausedin deps → toggling pause causes WebSocket disconnect/reconnect- Navigation away unmounts component →
logsstate is lost isConnectedis local state → lost on unmount, starts asfalseon remount- No reconnection retry logic
4. Console Enrollment LAPI Check
File: frontend/src/pages/CrowdSecConfig.tsx#L85-120
// Wait 3 seconds before first LAPI check
const timer = setTimeout(() => {
setInitialCheckComplete(true)
}, 3000)
Problem: 3 seconds may not be enough. CrowdSec LAPI typically takes 5-10 seconds to initialize. Users see "not running" error during this window.
Identified Problems
Problem 1: Dual-State Conflict (Toggle Shows Active But Not Working)
Evidence Chain:
User toggles ON → updateSetting('security.crowdsec.enabled', 'true')
→ startCrowdsec() → sets SecurityConfig.CrowdSecMode = 'local'
User refreshes page → getSecurityStatus()
→ Reads security.crowdsec.enabled = 'true' → crowdSecMode = 'local'
→ Reads security.crowdsec.mode (if exists) → OVERRIDES to whatever value
If security.crowdsec.mode = 'disabled' (from deprecated UI) → Final: crowdSecMode = 'disabled'
Locations:
- Backend: security_handler.go#L135-148
- Backend: crowdsec_handler.go#L195-215
- Frontend: Security.tsx#L65-110
Problem 2: Live Log Viewer State Issues
Evidence:
- Shows "Disconnected" immediately after page load (initial state = false)
- Logs appear because WebSocket connects quickly, but
isConnectedstate update races - Navigation away loses all log entries (component state)
- Pausing causes reconnection flicker
Location: LiveLogViewer.tsx#L100-150
Problem 3: Deprecated Mode Toggle Still Present
Evidence: CrowdSecConfig.tsx still renders:
<Card>
<h2>CrowdSec Mode</h2>
<Switch checked={isLocalMode} onChange={(e) => handleModeToggle(e.target.checked)} />
{/* Disabled/Local toggle - DEPRECATED */}
</Card>
Location: CrowdSecConfig.tsx#L395-420
Problem 4: Enrollment "Not Running" Error
Evidence: User enables CrowdSec, immediately tries to enroll, sees error because:
- Process starts (running=true)
- LAPI takes 5-10s to initialize (lapi_ready=false)
- Frontend shows "not running" because it checks lapi_ready
Locations:
- Frontend: CrowdSecConfig.tsx#L85-120
- Backend: console_enroll.go#L165-190
Remediation Plan
Phase 1: Backend Fixes (CRITICAL)
1.1 Fix GetStatus Priority Chain
File: backend/internal/api/handlers/security_handler.go
Lines: 143-148
Current Code (BUGGY):
// CrowdSec mode override (AFTER enabled check - causes override bug)
setting = struct{ Value string }{}
if err := h.db.Raw("SELECT value FROM settings WHERE key = ? LIMIT 1", "security.crowdsec.mode").Scan(&setting).Error; err == nil && setting.Value != "" {
crowdSecMode = setting.Value
}
Fix: Remove the mode override OR make enabled take precedence:
// OPTION A: Remove mode override entirely (recommended)
// DELETE lines 143-148
// OPTION B: Make enabled take precedence over mode
setting = struct{ Value string }{}
if err := h.db.Raw("SELECT value FROM settings WHERE key = ? LIMIT 1", "security.crowdsec.mode").Scan(&setting).Error; err == nil && setting.Value != "" {
// Only use mode if enabled wasn't explicitly set
var enabledSetting struct{ Value string }
if h.db.Raw("SELECT value FROM settings WHERE key = ? LIMIT 1", "security.crowdsec.enabled").Scan(&enabledSetting).Error != nil || enabledSetting.Value == "" {
crowdSecMode = setting.Value
}
// If enabled was set, ignore deprecated mode setting
}
1.2 Update Start/Stop to Sync State
File: backend/internal/api/handlers/crowdsec_handler.go
In Start() after line 215:
// Sync settings table (source of truth for UI)
if h.DB != nil {
settingEnabled := models.Setting{
Key: "security.crowdsec.enabled",
Value: "true",
Type: "bool",
Category: "security",
}
h.DB.Where(models.Setting{Key: "security.crowdsec.enabled"}).Assign(settingEnabled).FirstOrCreate(&settingEnabled)
// Clear deprecated mode setting to prevent conflicts
h.DB.Where("key = ?", "security.crowdsec.mode").Delete(&models.Setting{})
}
In Stop() after line 260:
// Sync settings table
if h.DB != nil {
settingEnabled := models.Setting{
Key: "security.crowdsec.enabled",
Value: "false",
Type: "bool",
Category: "security",
}
h.DB.Where(models.Setting{Key: "security.crowdsec.enabled"}).Assign(settingEnabled).FirstOrCreate(&settingEnabled)
}
1.3 Add Deprecation Warning for Mode Setting
File: backend/internal/api/handlers/settings_handler.go
Add validation in the update handler:
func (h *SettingsHandler) UpdateSetting(c *gin.Context) {
// ... existing code ...
if setting.Key == "security.crowdsec.mode" {
logger.Log().Warn("DEPRECATED: security.crowdsec.mode is deprecated and will be removed. Use security.crowdsec.enabled instead.")
}
// ... rest of existing code ...
}
Phase 2: Frontend Fixes
2.1 Remove Deprecated Mode Toggle
File: frontend/src/pages/CrowdSecConfig.tsx
Remove these sections:
- Lines 69-78 - Remove
updateModeMutation:
// DELETE THIS ENTIRE MUTATION
const updateModeMutation = useMutation({
mutationFn: async (mode: string) => updateSetting('security.crowdsec.mode', mode, 'security', 'string'),
onSuccess: (_data, mode) => {
queryClient.invalidateQueries({ queryKey: ['security-status'] })
toast.success(mode === 'disabled' ? 'CrowdSec disabled' : 'CrowdSec set to Local mode')
},
onError: (err: unknown) => {
const msg = err instanceof Error ? err.message : 'Failed to update mode'
toast.error(msg)
},
})
- Lines ~395-420 - Remove the Mode Card from render:
// DELETE THIS ENTIRE CARD
<Card>
<div className="flex items-center justify-between gap-4 flex-wrap">
<div className="space-y-1">
<h2 className="text-lg font-semibold">CrowdSec Mode</h2>
<p className="text-sm text-gray-400">...</p>
</div>
<div className="flex items-center gap-3">
<span>Disabled</span>
<Switch checked={isLocalMode} onChange={(e) => handleModeToggle(e.target.checked)} />
<span>Local</span>
</div>
</div>
</Card>
- Replace with informational banner:
<Card>
<div className="p-4 bg-blue-900/20 border border-blue-700/50 rounded-lg">
<p className="text-sm text-blue-200">
CrowdSec is controlled from the <Link to="/security" className="text-blue-400 underline">Security Dashboard</Link>.
Use the toggle there to enable or disable CrowdSec protection.
</p>
</div>
</Card>
2.2 Fix Live Log Viewer
File: frontend/src/components/LiveLogViewer.tsx
Fix 1: Remove isPaused from dependencies (line 148):
// BEFORE:
}, [currentMode, filters, securityFilters, isPaused, maxLogs, showBlockedOnly]);
// AFTER:
}, [currentMode, filters, securityFilters, maxLogs, showBlockedOnly]);
Fix 2: Use ref for pause state in message handler:
// Add ref near other refs (around line 70):
const isPausedRef = useRef(isPaused);
// Sync ref with state (add useEffect around line 95):
useEffect(() => {
isPausedRef.current = isPaused;
}, [isPaused]);
// Update message handler (lines 110-120):
const handleSecurityMessage = (entry: SecurityLogEntry) => {
if (!isPausedRef.current) { // Use ref instead of state
const displayEntry = toDisplayFromSecurity(entry);
setLogs((prev) => {
const updated = [...prev, displayEntry];
return updated.length > maxLogs ? updated.slice(-maxLogs) : updated;
});
}
};
Fix 3: Add reconnection retry logic:
// Add state for retry (around line 50):
const [retryCount, setRetryCount] = useState(0);
const maxRetries = 5;
const retryDelay = 2000; // 2 seconds base delay
// Update connection effect (around line 100):
useEffect(() => {
// ... existing close logic ...
const handleClose = () => {
console.log(`${currentMode} log viewer disconnected`);
setIsConnected(false);
// Schedule retry with exponential backoff
if (retryCount < maxRetries) {
const delay = retryDelay * Math.pow(1.5, retryCount);
setTimeout(() => setRetryCount(r => r + 1), delay);
}
};
// ... rest of effect ...
return () => {
if (closeConnectionRef.current) {
closeConnectionRef.current();
closeConnectionRef.current = null;
}
setIsConnected(false);
// Reset retry on intentional unmount
};
}, [currentMode, filters, securityFilters, maxLogs, showBlockedOnly, retryCount]);
// Reset retry count on successful connect:
const handleOpen = () => {
console.log(`${currentMode} log viewer connected`);
setIsConnected(true);
setRetryCount(0); // Reset retry counter
};
2.3 Improve Enrollment LAPI Messaging
File: frontend/src/pages/CrowdSecConfig.tsx
Fix 1: Increase initial delay (line 85):
// BEFORE:
}, 3000) // Wait 3 seconds
// AFTER:
}, 5000) // Wait 5 seconds for LAPI to initialize
Fix 2: Improve warning messages (around lines 200-250):
{/* Show LAPI initializing warning when process running but LAPI not ready */}
{lapiStatusQuery.data && lapiStatusQuery.data.running && !lapiStatusQuery.data.lapi_ready && initialCheckComplete && (
<div className="flex items-start gap-3 p-4 bg-yellow-900/20 border border-yellow-700/50 rounded-lg">
<AlertTriangle className="w-5 h-5 text-yellow-400 flex-shrink-0 mt-0.5" />
<div className="flex-1">
<p className="text-sm text-yellow-200 font-medium mb-2">
CrowdSec Local API is initializing...
</p>
<p className="text-xs text-yellow-300 mb-3">
The CrowdSec process is running but LAPI takes 5-10 seconds to become ready.
Console enrollment will be available once LAPI is ready.
{lapiStatusQuery.isRefetching && ' Checking status...'}
</p>
<Button variant="secondary" size="sm" onClick={() => lapiStatusQuery.refetch()} disabled={lapiStatusQuery.isRefetching}>
Check Again
</Button>
</div>
</div>
)}
{/* Show not running warning when process not running */}
{lapiStatusQuery.data && !lapiStatusQuery.data.running && initialCheckComplete && (
<div className="flex items-start gap-3 p-4 bg-red-900/20 border border-red-700/50 rounded-lg">
<AlertTriangle className="w-5 h-5 text-red-400 flex-shrink-0 mt-0.5" />
<div className="flex-1">
<p className="text-sm text-red-200 font-medium mb-2">
CrowdSec is not running
</p>
<p className="text-xs text-red-300 mb-3">
Enable CrowdSec from the <Link to="/security" className="text-red-400 underline">Security Dashboard</Link> first.
The process typically takes 5-10 seconds to start and LAPI another 5-10 seconds to initialize.
</p>
</div>
</div>
)}
Phase 3: Cleanup & Testing
3.1 Database Cleanup Migration (Optional)
Create a one-time migration to remove conflicting settings:
-- Remove deprecated mode setting to prevent conflicts
DELETE FROM settings WHERE key = 'security.crowdsec.mode';
3.2 Backend Test Updates
Add test cases for:
GetStatusreturns correct enabled state when onlysecurity.crowdsec.enabledis setGetStatusreturns correct state when deprecatedsecurity.crowdsec.modeexists (should be ignored)Start()updatessettingstableStop()updatessettingstable
3.3 Frontend Test Updates
Add test cases for:
LiveLogViewerdoesn't reconnect when pause toggledLiveLogViewerretries connection on disconnectCrowdSecConfigdoesn't render mode toggle
Test Plan
Manual QA Checklist
-
Toggle Test:
- Go to Security Dashboard
- Toggle CrowdSec ON
- Verify card shows "Active"
- Verify
docker exec charon ps aux | grep crowdsecshows process - Toggle CrowdSec OFF
- Verify card shows "Disabled"
- Verify process stopped
-
State Persistence Test:
- Toggle CrowdSec ON
- Refresh page
- Verify toggle still shows ON
- Check database:
SELECT * FROM settings WHERE key LIKE '%crowdsec%'
-
Live Logs Test:
- Go to Security Dashboard
- Verify "Connected" status appears
- Generate some traffic
- Verify logs appear
- Click "Pause" - verify NO flicker/reconnect
- Navigate to another page
- Navigate back
- Verify reconnection happens (status goes from Disconnected → Connected)
-
Enrollment Test:
- Enable CrowdSec
- Go to CrowdSecConfig
- Verify warning shows "LAPI initializing" (not "not running")
- Wait for LAPI ready
- Enter enrollment key
- Click Enroll
- Verify success
-
Deprecated UI Removed:
- Go to CrowdSecConfig page
- Verify NO "CrowdSec Mode" card with Disabled/Local toggle
- Verify informational banner points to Security Dashboard
Integration Test Commands
# Test 1: Backend state consistency
# Enable via API
curl -X POST http://localhost:8080/api/v1/admin/crowdsec/start
# Check settings table
sqlite3 data/charon.db "SELECT * FROM settings WHERE key = 'security.crowdsec.enabled'"
# Expected: value = "true"
# Check status endpoint
curl http://localhost:8080/api/v1/security/status | jq '.crowdsec'
# Expected: {"mode":"local","enabled":true,...}
# Test 2: No deprecated mode conflict
sqlite3 data/charon.db "SELECT * FROM settings WHERE key = 'security.crowdsec.mode'"
# Expected: No rows (or deprecated warning logged)
# Test 3: Disable and verify
curl -X POST http://localhost:8080/api/v1/admin/crowdsec/stop
curl http://localhost:8080/api/v1/security/status | jq '.crowdsec'
# Expected: {"mode":"disabled","enabled":false,...}
sqlite3 data/charon.db "SELECT * FROM settings WHERE key = 'security.crowdsec.enabled'"
# Expected: value = "false"
Implementation Order
| Order | Phase | Task | Priority | Est. Time |
|---|---|---|---|---|
| 1 | 1.1 | Fix GetStatus to ignore deprecated mode | CRITICAL | 15 min |
| 2 | 1.2 | Update Start/Stop to sync settings table | CRITICAL | 20 min |
| 3 | 2.1 | Remove deprecated mode toggle from UI | HIGH | 15 min |
| 4 | 2.2 | Fix LiveLogViewer pause/reconnection | HIGH | 30 min |
| 5 | 2.3 | Improve enrollment LAPI messaging | MEDIUM | 15 min |
| 6 | 1.3 | Add deprecation warning for mode setting | LOW | 10 min |
| 7 | 3.1 | Database cleanup migration | LOW | 10 min |
| 8 | 3.2-3.3 | Update tests | MEDIUM | 30 min |
Total Estimated Time: ~2.5 hours
Success Criteria
- ✅ Toggling CrowdSec ON shows "Active" AND process is actually running
- ✅ Toggling CrowdSec OFF shows "Disabled" AND process is stopped
- ✅ State persists across page refresh
- ✅ No deprecated mode toggle visible on CrowdSecConfig page
- ✅ Live logs show "Connected" when WebSocket connects
- ✅ Pausing logs does NOT cause reconnection
- ✅ Enrollment shows appropriate LAPI status message
- ✅ All existing tests pass
- ✅ No errors in browser console related to CrowdSec
Appendix: File Reference
| Issue | Backend Files | Frontend Files |
|---|---|---|
| Toggle Bug | security_handler.go#L135-148, crowdsec_handler.go#L184-265 |
Security.tsx#L65-110 |
| Deprecated Mode | security_handler.go#L143-148 |
CrowdSecConfig.tsx#L69-90, L395-420 |
| Live Logs | cerberus_logs_ws.go |
LiveLogViewer.tsx#L100-150, logs.ts |
| Enrollment | console_enroll.go#L165-190 |
CrowdSecConfig.tsx#L85-120 |