Files
Charon/docs/plans/current_spec.md
T
GitHub Actions 71e44f79a7 fix: resolve CrowdSec state sync issues and remove deprecated mode toggle
- Backend: Start/Stop handlers now sync both settings and security_configs tables
- Frontend: CrowdSec toggle uses actual process status (crowdsecStatus.running)
- Frontend: Fixed LiveLogViewer WebSocket race condition by using isPausedRef
- Frontend: Removed deprecated mode toggle from CrowdSecConfig page
- Frontend: Added info banner directing users to Security Dashboard
- Frontend: Added "Start CrowdSec" button to enrollment warning panel

Fixes dual-source state conflict causing toggle to show incorrect state.
Fixes live log "disconnected" status appearing while logs stream.
Simplifies CrowdSec control to single source (Security Dashboard toggle).

Includes comprehensive test updates for new architecture.
2025-12-15 23:36:07 +00:00

16 KiB

Comprehensive Bug Analysis: CrowdSec & Live Logs Issues

Date: December 15, 2025 Status: Ready for Implementation


Executive Summary

Four user-reported issues all stem from configuration state synchronization problems between:

  1. The settings table (runtime toggles)
  2. The security_configs table (SecurityConfig model)
  3. The actual CrowdSec process state
  4. Frontend display state

Issue 1: CrowdSec Card Toggle Broken on Cerberus Dashboard

Symptoms

  • CrowdSec card shows "Active" but toggle doesn't work properly
  • Shows "on and active" but CrowdSec is NOT actually on

Root Cause Analysis

Files Involved:

The Problem:

  1. Dual-Source State Conflict: The GetStatus() endpoint in security_handler.go#L61-L137 combines state from TWO sources:

    • settings table: security.crowdsec.enabled and security.crowdsec.mode
    • security_configs table: CrowdSecMode field
  2. Toggle Updates Wrong Store: When the user toggles CrowdSec via crowdsecPowerMutation:

    • It calls updateSetting('security.crowdsec.enabled', ...) which updates the settings table
    • It calls startCrowdsec() / stopCrowdsec() which updates security_configs.CrowdSecMode
  3. State Priority Mismatch: In security_handler.go#L100-L108:

    // CrowdSec enabled override (from settings table)
    if err := h.db.Raw("SELECT value FROM settings WHERE key = ? LIMIT 1", "security.crowdsec.enabled").Scan(&setting).Error; err == nil && setting.Value != "" {
        if strings.EqualFold(setting.Value, "true") {
            crowdSecMode = "local"
        } else {
            crowdSecMode = "disabled"
        }
    }
    

    The settings table overrides security_configs, but the Start() handler updates security_configs.

  4. Process State Not Verified: The frontend shows "Active" based on status.crowdsec.enabled from the API, but this is computed from DB settings, NOT from actual process status. The crowdsecStatus state (line 43-44) fetches real process status but this is a separate query displayed below the card.

The Fix

Backend (security_handler.go):

  • GetStatus() should check actual CrowdSec process status via the CrowdsecExecutor.Status() call, not just DB state

Frontend (Security.tsx):

  • The toggle's checked state should use crowdsecStatus?.running (actual process state) instead of status.crowdsec.enabled (DB setting)
  • Or sync both states properly after toggle

Issue 2: Live Log Viewer Shows "Disconnected" But Logs Appear

Symptoms

  • Shows "Disconnected" status badge but logs ARE appearing
  • Navigating away and back causes logs to disappear

Root Cause Analysis

Files Involved:

The Problem:

  1. Connection State Race Condition: In LiveLogViewer.tsx#L165-L240:

    useEffect(() => {
      // Close existing connection
      if (closeConnectionRef.current) {
        closeConnectionRef.current();
        closeConnectionRef.current = null;
      }
      // ... setup handlers ...
      return () => {
        if (closeConnectionRef.current) {
          closeConnectionRef.current();
          closeConnectionRef.current = null;
        }
        setIsConnected(false);  // <-- Issue: cleanup runs AFTER effect re-runs
      };
    }, [currentMode, filters, securityFilters, isPaused, maxLogs, showBlockedOnly]);
    
  2. Dependency Array Includes isPaused: When isPaused changes, the entire effect re-runs, creating a new WebSocket. But the cleanup of the old connection sets isConnected(false) AFTER the new connection's onOpen sets isConnected(true), causing a flash of "Disconnected".

  3. Logs Disappear on Navigation: The logs state is stored locally in the component via useState<DisplayLogEntry[]>([]). When the component unmounts (navigation) and remounts, state resets to empty array. There's no persistence or caching.

The Fix

LiveLogViewer.tsx:

  1. Fix State Race: Use a ref to track connection state transitions:

    const connectionIdRef = useRef(0);
    // In effect: increment connectionId, check it in callbacks
    
  2. Remove isPaused from Dependencies: Pausing should NOT close/reopen the WebSocket. Instead, just skip adding messages when paused:

    // Current (wrong): connection is in dependency array
    // Fixed: only filter/process messages based on isPaused flag
    
  3. Persist Logs Across Navigation: Either:

    • Store logs in React Query cache
    • Use a global store (zustand/context)
    • Accept the limitation with a "Logs cleared on navigation" note

Issue 3: DEPRECATED CrowdSec Mode Toggle Still in UI

Symptoms

  • CrowdSec config page shows "Disabled/Local/External" mode toggle
  • This is confusing because CrowdSec should run based SOLELY on the Feature Flag in System Settings

Root Cause Analysis

Files Involved:

The Problem:

  1. Redundant Control Surfaces: There are THREE ways to control CrowdSec:

    • Feature Flag: feature.cerberus.enabled in Settings (System Settings page)
    • Per-Service Toggle: security.crowdsec.enabled in Settings (Security Dashboard)
    • Mode Toggle: CrowdSecMode in SecurityConfig (CrowdSec Config page)
  2. Deprecated UI Still Present: In CrowdSecConfig.tsx#L68-L100:

    <Card>
      <div className="flex items-center justify-between gap-4 flex-wrap">
        <div className="space-y-1">
          <h2 className="text-lg font-semibold">CrowdSec Mode</h2>
          <p className="text-sm text-gray-400">
            {isLocalMode ? 'CrowdSec runs locally...' : 'CrowdSec decisions are paused...'}
          </p>
        </div>
        <div className="flex items-center gap-3">
          <span className="text-sm text-gray-400">Disabled</span>
          <Switch
            checked={isLocalMode}
            onChange={(e) => handleModeToggle(e.target.checked)}
            ...
          />
          <span className="text-sm text-gray-200">Local</span>
        </div>
      </div>
    </Card>
    
  3. isLocalMode Derived from Wrong Source: Line 28:

    const isLocalMode = !!status && status.crowdsec?.mode !== 'disabled'
    

    This checks mode from security_configs.CrowdSecMode, not the feature flag.

  4. handleModeToggle Updates Wrong Setting: Lines 72-77:

    const handleModeToggle = (nextEnabled: boolean) => {
      const mode = nextEnabled ? 'local' : 'disabled'
      updateModeMutation.mutate(mode)  // Updates security.crowdsec.mode in settings
    }
    

The Fix

CrowdSecConfig.tsx:

  1. Remove the Mode Toggle Card entirely (lines 68-100)
  2. Add a notice: "CrowdSec is controlled via the toggle on the Security Dashboard or System Settings"

Backend Cleanup (optional future work):

  • Remove CrowdSecMode field from SecurityConfig model
  • Migrate all state to use only security.crowdsec.enabled setting

Issue 4: Enrollment Shows "CrowdSec is not running"

Symptoms

  • CrowdSec enrollment shows error even when enabled
  • Red warning box: "CrowdSec is not running"

Root Cause Analysis

Files Involved:

The Problem:

  1. LAPI Status Query Uses Wrong Condition: In CrowdSecConfig.tsx#L30-L40:

    const lapiStatusQuery = useQuery<CrowdSecStatus>({
      queryKey: ['crowdsec-lapi-status'],
      queryFn: statusCrowdsec,
      enabled: consoleEnrollmentEnabled && initialCheckComplete,
      refetchInterval: 5000,
      retry: false,
    })
    

    The query is enabled only when consoleEnrollmentEnabled (feature flag for console enrollment).

  2. Warning Shows When Process Not Running: In CrowdSecConfig.tsx#L172-L196:

    {lapiStatusQuery.data && !lapiStatusQuery.data.running && initialCheckComplete && (
      <div className="..." data-testid="lapi-not-running-warning">
        <p>CrowdSec is not running</p>
        ...
      </div>
    )}
    

    This shows when lapiStatusQuery.data.running === false.

  3. Status Check May Return Stale Data: The Status() backend handler checks:

    • PID file existence
    • Process status via kill -0
    • LAPI health via cscli lapi status

    But if CrowdSec was just enabled, there may be a race condition where the settings say "enabled" but the process hasn't started yet.

  4. Startup Reconciliation Timing: ReconcileCrowdSecOnStartup() in crowdsec_startup.go runs at container start, but if the user enables CrowdSec AFTER startup, the process won't auto-start.

The Fix

CrowdSecConfig.tsx:

  1. Improve Warning Message: The "not running" warning should include:

    • A "Start CrowdSec" button that calls startCrowdsec() API
    • Or a link to the Security Dashboard where the toggle is
  2. Check Both States: Show the warning only when:

    • User has enabled CrowdSec (via either toggle)
    • AND the process is not running
  3. Add Auto-Retry: After enabling CrowdSec, poll status more aggressively for 30 seconds


Implementation Plan

Phase 1: Backend Fixes (Priority: High)

1.1 Unify State Source

File: backend/internal/api/handlers/security_handler.go

Change: Modify GetStatus() to include actual process status:

// Add after line 137:
// Check actual CrowdSec process status
if h.crowdsecExecutor != nil {
    ctx := c.Request.Context()
    running, pid, _ := h.crowdsecExecutor.Status(ctx, h.dataDir)
    // Override enabled state based on actual process
    crowdsecProcessRunning = running
}

Add crowdsecExecutor field to SecurityHandler struct and inject it during initialization.

1.2 Consistent Mode Updates

File: backend/internal/api/handlers/crowdsec_handler.go

Change: In Start() and Stop(), also update the settings table:

// In Start(), after updating SecurityConfig (line ~165):
if h.DB != nil {
    setting := models.Setting{Key: "security.crowdsec.enabled", Value: "true", Category: "security", Type: "bool"}
    h.DB.Where(models.Setting{Key: "security.crowdsec.enabled"}).Assign(setting).FirstOrCreate(&setting)
}

// In Stop(), after updating SecurityConfig (line ~228):
if h.DB != nil {
    setting := models.Setting{Key: "security.crowdsec.enabled", Value: "false", Category: "security", Type: "bool"}
    h.DB.Where(models.Setting{Key: "security.crowdsec.enabled"}).Assign(setting).FirstOrCreate(&setting)
}

Phase 2: Frontend Fixes (Priority: High)

2.1 Fix CrowdSec Toggle State

File: frontend/src/pages/Security.tsx

Change 1: Use actual process status for toggle (around line 203):

// Replace: checked={status.crowdsec.enabled}
// With:
checked={crowdsecStatus?.running ?? status.crowdsec.enabled}

Change 2: After successful toggle, refetch both status and process status

2.2 Fix LiveLogViewer Connection State

File: frontend/src/components/LiveLogViewer.tsx

Change 1: Remove isPaused from useEffect dependencies (line 237):

// Change from:
}, [currentMode, filters, securityFilters, isPaused, maxLogs, showBlockedOnly]);
// To:
}, [currentMode, filters, securityFilters, maxLogs, showBlockedOnly]);

Change 2: Handle pause inside message handler (line 192):

const handleMessage = (entry: SecurityLogEntry) => {
  // isPaused check stays here, not in effect
  if (isPausedRef.current) return;  // Use ref instead of state
  // ... rest of handler
};

Change 3: Add ref for isPaused:

const isPausedRef = useRef(isPaused);
useEffect(() => { isPausedRef.current = isPaused; }, [isPaused]);

2.3 Remove Deprecated Mode Toggle

File: frontend/src/pages/CrowdSecConfig.tsx

Change: Remove the entire "CrowdSec Mode" Card (lines 291-311 in current render):

// DELETE: The entire <Card> block containing "CrowdSec Mode"

Add informational banner instead:

{/* Replace mode toggle with info banner */}
<div className="bg-blue-900/20 border border-blue-700 rounded-lg p-4">
  <p className="text-sm text-blue-200">
    <strong>Note:</strong> CrowdSec is controlled via the toggle on the{' '}
    <Link to="/security" className="underline">Security Dashboard</Link>.
    Enable/disable CrowdSec there, then configure presets and files here.
  </p>
</div>

2.4 Fix Enrollment Warning

File: frontend/src/pages/CrowdSecConfig.tsx

Change: Add "Start CrowdSec" button to the warning (around line 185):

<Button
  variant="primary"
  size="sm"
  onClick={async () => {
    try {
      await startCrowdsec();
      toast.info('Starting CrowdSec...');
      lapiStatusQuery.refetch();
    } catch (err) {
      toast.error('Failed to start CrowdSec');
    }
  }}
>
  Start CrowdSec
</Button>

Phase 3: Remove Deprecated Mode (Priority: Medium)

3.1 Backend Model Cleanup (Future)

File: backend/internal/models/security_config.go

Mark CrowdSecMode as deprecated with migration path.

3.2 Settings Migration

Create migration to ensure all users have security.crowdsec.enabled setting derived from CrowdSecMode.


Files to Modify Summary

Backend

File Changes
backend/internal/api/handlers/security_handler.go Add process status check to GetStatus()
backend/internal/api/handlers/crowdsec_handler.go Sync settings table in Start()/Stop()

Frontend

File Changes
frontend/src/pages/Security.tsx Use crowdsecStatus?.running for toggle state
frontend/src/components/LiveLogViewer.tsx Fix isPaused dependency, use ref
frontend/src/pages/CrowdSecConfig.tsx Remove mode toggle, add info banner, add "Start CrowdSec" button

Testing Checklist

  • Toggle CrowdSec on Security Dashboard → verify process starts
  • Toggle CrowdSec off → verify process stops
  • Refresh page → verify toggle state matches process state
  • Open LiveLogViewer → verify "Connected" status
  • Pause logs → verify connection remains open
  • Navigate away and back → logs are cleared (expected) but connection re-establishes
  • CrowdSec Config page → no mode toggle, info banner present
  • Enrollment section → shows "Start CrowdSec" button when process not running