Charon/docs/plans/crowdsec_toggle_fix_plan.md

# CrowdSec Toggle Integration Fix Plan

**Date**: December 15, 2025
**Issue**: CrowdSec toggle stuck ON, reconciliation silently exits, process not starting
**Root Cause**: Database disconnect between frontend (Settings table) and reconciliation (SecurityConfig table)

---

## Executive Summary

The CrowdSec toggle shows "ON" but the process is NOT running. The reconciliation function silently exits without starting CrowdSec because:

1. **Frontend writes to Settings table** (`security.crowdsec.enabled`)
2. **Backend reconciliation reads from SecurityConfig table** (`crowdsec_mode = "local"`)
3. **No synchronization** between the two tables
4. **Auto-initialization code EXISTS** (lines 46-71 in crowdsec_startup.go) but creates config with `crowdsec_mode = "disabled"`
5. **Reconciliation sees "disabled"** and exits silently with no logs

---

## Root Cause Analysis (DETAILED)

### Evidence Trail

**Container Logs Show Silent Exit**:
```
{"bin_path":"crowdsec","data_dir":"/app/data/crowdsec","level":"info","msg":"CrowdSec reconciliation: starting startup check","time":"2025-12-14T23:32:33-05:00"}
[NO FURTHER LOGS - Function exited here]
```

**Database State on Fresh Start**:
```
SELECT * FROM security_configs → record not found
{"level":"info","msg":"CrowdSec reconciliation: no SecurityConfig found, creating default config"}
```

**Process Check**:
```bash
$ docker exec charon ps aux | grep -i crowdsec
[NO RESULTS - Process not running]
```

### Why Reconciliation Exits Silently

**FILE**: `backend/internal/services/crowdsec_startup.go`

**Execution Flow**:
```
1. User clicks toggle ON in Security.tsx
2. Frontend calls updateSetting('security.crowdsec.enabled', 'true')
3. Settings table updated → security.crowdsec.enabled = "true"
4. Frontend calls startCrowdsec() → Handler updates SecurityConfig
5. CrowdSec starts successfully, toggle shows ON
6. Container restarts (docker restart or reboot)
7. ReconcileCrowdSecOnStartup() executes at line 26:

   Line 44: db.First(&cfg) → returns gorm.ErrRecordNotFound

   Lines 46-71: Auto-initialization block executes:
     - Creates SecurityConfig with crowdsec_mode = "disabled"
     - Logs "default SecurityConfig created successfully"
     - Returns early (line 70) WITHOUT checking Settings table
     - CrowdSec is NEVER started

   Result: Toggle shows "ON" (Settings table), but process is "OFF" (not running)
```

**THE BUG (Lines 46-71)**:
```go
if err == gorm.ErrRecordNotFound {
    // AUTO-INITIALIZE: Create default SecurityConfig on first startup
    logger.Log().Info("CrowdSec reconciliation: no SecurityConfig found, creating default config")

    defaultCfg := models.SecurityConfig{
        UUID:             "default",
        Name:             "Default Security Config",
        Enabled:          false,
        CrowdSecMode:     "disabled",  // ← PROBLEM: Ignores Settings table state
        WAFMode:          "disabled",
        WAFParanoiaLevel: 1,
        RateLimitMode:    "disabled",
        RateLimitBurst:   10,
        RateLimitRequests: 100,
        RateLimitWindowSec: 60,
    }

    if err := db.Create(&defaultCfg).Error; err != nil {
        logger.Log().WithError(err).Error("CrowdSec reconciliation: failed to create default SecurityConfig")
        return
    }

    logger.Log().Info("CrowdSec reconciliation: default SecurityConfig created successfully")
    // Don't start CrowdSec on fresh install - user must enable via UI
    return  // ← EXITS WITHOUT checking Settings table or starting process
}
```

**Why This Causes the Issue**:

1. **First Container Start**: User enables CrowdSec via toggle
   - Settings: `security.crowdsec.enabled = "true"` ✅
   - SecurityConfig: `crowdsec_mode = "local"` ✅ (via Start handler)
   - Process: Running ✅

2. **Container Restart**: Database persists but SecurityConfig table may be empty (migration issue or corruption)
   - Reconciliation runs
   - SecurityConfig table: **EMPTY** (record lost or never migrated)
   - Auto-init creates SecurityConfig with `crowdsec_mode = "disabled"`
   - Returns early without checking Settings table
   - Settings: Still shows `"true"` (UI says ON)
   - SecurityConfig: Says `"disabled"` (reconciliation source)
   - Process: NOT started ❌

3. **Result**: **State Mismatch**
   - Frontend toggle: **ON** (reads Settings table)
   - Backend reconciliation: **OFF** (reads SecurityConfig table)
   - Process: **NOT RUNNING** (reconciliation didn't start it)

---

## Current Code Analysis

### 1. Reconciliation Function (crowdsec_startup.go)

**Location**: `backend/internal/services/crowdsec_startup.go`

**Lines 44-71 (Auto-initialization - THE BUG)**:
```go
var cfg models.SecurityConfig
if err := db.First(&cfg).Error; err != nil {
    if err == gorm.ErrRecordNotFound {
        // AUTO-INITIALIZE: Create default SecurityConfig on first startup
        logger.Log().Info("CrowdSec reconciliation: no SecurityConfig found, creating default config")

        defaultCfg := models.SecurityConfig{
            UUID:             "default",
            Name:             "Default Security Config",
            Enabled:          false,
            CrowdSecMode:     "disabled",  // ← IGNORES Settings table
            WAFMode:          "disabled",
            WAFParanoiaLevel: 1,
            RateLimitMode:    "disabled",
            RateLimitBurst:   10,
            RateLimitRequests: 100,
            RateLimitWindowSec: 60,
        }

        if err := db.Create(&defaultCfg).Error; err != nil {
            logger.Log().WithError(err).Error("CrowdSec reconciliation: failed to create default SecurityConfig")
            return
        }

        logger.Log().Info("CrowdSec reconciliation: default SecurityConfig created successfully")
        // Don't start CrowdSec on fresh install - user must enable via UI
        return  // ← EARLY EXIT - Never checks Settings table
    }
    logger.Log().WithError(err).Warn("CrowdSec reconciliation: failed to read SecurityConfig")
    return
}
```

**Lines 74-90 (Runtime Setting Override - UNREACHABLE after auto-init)**:
```go
// Also check for runtime setting override in settings table
var settingOverride struct{ Value string }
crowdSecEnabled := false
if err := db.Raw("SELECT value FROM settings WHERE key = ? LIMIT 1", "security.crowdsec.enabled").Scan(&settingOverride).Error; err == nil && settingOverride.Value != "" {
    crowdSecEnabled = strings.EqualFold(settingOverride.Value, "true")
    logger.Log().WithFields(map[string]interface{}{
        "setting_value":    settingOverride.Value,
        "crowdsec_enabled": crowdSecEnabled,
    }).Debug("CrowdSec reconciliation: found runtime setting override")
}
```

**This code is NEVER REACHED** when SecurityConfig doesn't exist because line 70 returns early!

**Lines 91-98 (Decision Logic)**:
```go
// Only auto-start if CrowdSecMode is "local" OR runtime setting is enabled
if cfg.CrowdSecMode != "local" && !crowdSecEnabled {
    logger.Log().WithFields(map[string]interface{}{
        "db_mode":         cfg.CrowdSecMode,
        "setting_enabled": crowdSecEnabled,
    }).Debug("CrowdSec reconciliation skipped: mode is not 'local' and setting not enabled")
    return
}
```

**Also UNREACHABLE** during auto-init scenario!

### 2. Start Handler (crowdsec_handler.go)

**Location**: `backend/internal/api/handlers/crowdsec_handler.go`

**Lines 167-192 - CORRECT IMPLEMENTATION**:
```go
func (h *CrowdsecHandler) Start(c *gin.Context) {
    ctx := c.Request.Context()

    // UPDATE SecurityConfig to persist user's intent
    var cfg models.SecurityConfig
    if err := h.DB.First(&cfg).Error; err != nil {
        if err == gorm.ErrRecordNotFound {
            // Create default config with CrowdSec enabled
            cfg = models.SecurityConfig{
                UUID:         "default",
                Name:         "Default Security Config",
                Enabled:      true,
                CrowdSecMode: "local",  // ← CORRECT: Sets mode to "local"
            }
            if err := h.DB.Create(&cfg).Error; err != nil {
                logger.Log().WithError(err).Error("Failed to create SecurityConfig")
                c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to persist configuration"})
                return
            }
        } else {
            logger.Log().WithError(err).Error("Failed to read SecurityConfig")
            c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to read configuration"})
            return
        }
    } else {
        // Update existing config
        cfg.CrowdSecMode = "local"
        cfg.Enabled = true
        if err := h.DB.Save(&cfg).Error; err != nil {
            logger.Log().WithError(err).Error("Failed to update SecurityConfig")
            c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to persist configuration"})
            return
        }
    }

    // Start the process...
}
```

**Analysis**: This is CORRECT. The Start handler properly updates SecurityConfig when user clicks "Start" from the CrowdSec config page (/security/crowdsec).

### 3. Frontend Toggle (Security.tsx)

**Location**: `frontend/src/pages/Security.tsx`

**Lines 64-120 - THE DISCONNECT**:
```tsx
const crowdsecPowerMutation = useMutation({
  mutationFn: async (enabled: boolean) => {
    // Step 1: Update Settings table
    await updateSetting('security.crowdsec.enabled', enabled ? 'true' : 'false', 'security', 'bool')

    if (enabled) {
      // Step 2: Call Start() which updates SecurityConfig
      const result = await startCrowdsec()

      // Step 3: Verify running
      const status = await statusCrowdsec()
      if (!status.running) {
        await updateSetting('security.crowdsec.enabled', 'false', 'security', 'bool')
        throw new Error('CrowdSec process failed to start')
      }

      return result
    } else {
      // Step 2: Call Stop() which DOES NOT update SecurityConfig!
      await stopCrowdsec()

      // Step 3: Verify stopped
      await new Promise(resolve => setTimeout(resolve, 500))
      const status = await statusCrowdsec()
      if (status.running) {
        throw new Error('CrowdSec process still running')
      }

      return { enabled: false }
    }
  },
})
```

**Analysis**:
- **Enable Path**: Updates Settings → Calls Start() → Start() updates SecurityConfig → ✅ Both tables synced
- **Disable Path**: Updates Settings → Calls Stop() → Stop() **does NOT always update SecurityConfig** → ❌ Tables out of sync

Looking at the Stop handler:
```go
func (h *CrowdsecHandler) Stop(c *gin.Context) {
    ctx := c.Request.Context()
    if err := h.Executor.Stop(ctx, h.DataDir); err != nil {
        c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
        return
    }

    // UPDATE SecurityConfig to persist user's intent
    var cfg models.SecurityConfig
    if err := h.DB.First(&cfg).Error; err == nil {
        cfg.CrowdSecMode = "disabled"
        cfg.Enabled = false
        if err := h.DB.Save(&cfg).Error; err != nil {
            logger.Log().WithError(err).Warn("Failed to update SecurityConfig after stopping CrowdSec")
        }
    }

    c.JSON(http.StatusOK, gin.H{"status": "stopped"})
}
```

**This IS CORRECT** - Stop() handler updates SecurityConfig when it can find it. BUT:

**Scenario Where It Fails**:
1. SecurityConfig table gets corrupted/cleared/migrated incorrectly
2. User clicks toggle OFF
3. Stop() tries to update SecurityConfig → record not found → skips update
4. Settings table still updated to "false"
5. Container restarts → auto-init creates SecurityConfig with "disabled"
6. Both tables say "disabled" but UI might show stale state

---

## Comprehensive Fix Strategy

### Phase 1: Fix Auto-Initialization (CRITICAL - IMMEDIATE)

**FILE**: `backend/internal/services/crowdsec_startup.go`

**CHANGE**: Lines 46-71 (auto-initialization block)

**AFTER** (with Settings table check):
```go
if err == gorm.ErrRecordNotFound {
    // AUTO-INITIALIZE: Create default SecurityConfig by checking Settings table
    logger.Log().Info("CrowdSec reconciliation: no SecurityConfig found, checking Settings table for user preference")

    // Check if user has already enabled CrowdSec via Settings table (from toggle or legacy config)
    var settingOverride struct{ Value string }
    crowdSecEnabledInSettings := false
    if err := db.Raw("SELECT value FROM settings WHERE key = ? LIMIT 1", "security.crowdsec.enabled").Scan(&settingOverride).Error; err == nil && settingOverride.Value != "" {
        crowdSecEnabledInSettings = strings.EqualFold(settingOverride.Value, "true")
        logger.Log().WithFields(map[string]interface{}{
            "setting_value": settingOverride.Value,
            "enabled":       crowdSecEnabledInSettings,
        }).Info("CrowdSec reconciliation: found existing Settings table preference")
    }

    // Create SecurityConfig that matches Settings table state
    crowdSecMode := "disabled"
    if crowdSecEnabledInSettings {
        crowdSecMode = "local"
    }

    defaultCfg := models.SecurityConfig{
        UUID:               "default",
        Name:               "Default Security Config",
        Enabled:            crowdSecEnabledInSettings,
        CrowdSecMode:       crowdSecMode,  // ← NOW RESPECTS Settings table
        WAFMode:            "disabled",
        WAFParanoiaLevel:   1,
        RateLimitMode:      "disabled",
        RateLimitBurst:     10,
        RateLimitRequests:  100,
        RateLimitWindowSec: 60,
    }

    if err := db.Create(&defaultCfg).Error; err != nil {
        logger.Log().WithError(err).Error("CrowdSec reconciliation: failed to create default SecurityConfig")
        return
    }

    logger.Log().WithFields(map[string]interface{}{
        "crowdsec_mode": defaultCfg.CrowdSecMode,
        "enabled":       defaultCfg.Enabled,
        "source":        "settings_table",
    }).Info("CrowdSec reconciliation: default SecurityConfig created from Settings preference")

    // Continue to process the config (DON'T return early)
    cfg = defaultCfg
}
```

**KEY CHANGES**:
1. **Check Settings table** during auto-initialization
2. **Create SecurityConfig matching Settings state** (not hardcoded "disabled")
3. **Don't return early** - let the rest of the function process the config
4. **Assign to cfg variable** so flow continues to line 74+

### Phase 2: Enhance Logging (IMMEDIATE)

**FILE**: `backend/internal/services/crowdsec_startup.go`

**CHANGE**: Lines 91-98 (decision logic - better logging)

**AFTER**:
```go
// Start when EITHER SecurityConfig has mode="local" OR Settings table has enabled=true
// Exit only when BOTH are disabled
if cfg.CrowdSecMode != "local" && !crowdSecEnabled {
    logger.Log().WithFields(map[string]interface{}{
        "db_mode":         cfg.CrowdSecMode,
        "setting_enabled": crowdSecEnabled,
    }).Info("CrowdSec reconciliation skipped: both SecurityConfig and Settings indicate disabled")
    return
}

// Log which source triggered the start
if cfg.CrowdSecMode == "local" {
    logger.Log().WithField("mode", cfg.CrowdSecMode).Info("CrowdSec reconciliation: starting based on SecurityConfig mode='local'")
} else if crowdSecEnabled {
    logger.Log().WithField("setting", "true").Info("CrowdSec reconciliation: starting based on Settings table override")
}
```

**KEY CHANGES**:
1. **Change log level** from Debug to Info (so we see it in logs)
2. **Add source attribution** (which table triggered the start)
3. **Clarify condition** (exit only when BOTH are disabled)

### Phase 3: Add Unified Toggle Endpoint (OPTIONAL BUT RECOMMENDED)

**WHY**: Currently the toggle updates Settings, then calls Start/Stop which updates SecurityConfig. This creates potential race conditions. A unified endpoint is safer.

**FILE**: `backend/internal/api/handlers/crowdsec_handler.go`

**ADD**: New method (after Stop(), around line 260)

```go
// ToggleCrowdSec enables or disables CrowdSec, synchronizing Settings and SecurityConfig atomically
func (h *CrowdsecHandler) ToggleCrowdSec(c *gin.Context) {
    var payload struct {
        Enabled bool `json:"enabled"`
    }
    if err := c.ShouldBindJSON(&payload); err != nil {
        c.JSON(http.StatusBadRequest, gin.H{"error": "invalid payload"})
        return
    }

    logger.Log().WithField("enabled", payload.Enabled).Info("CrowdSec toggle: received request")

    // Use a transaction to ensure Settings and SecurityConfig stay in sync
    tx := h.DB.Begin()
    defer func() {
        if r := recover(); r != nil {
            tx.Rollback()
        }
    }()

    // STEP 1: Update Settings table
    settingKey := "security.crowdsec.enabled"
    settingValue := "false"
    if payload.Enabled {
        settingValue = "true"
    }

    var settingModel models.Setting
    if err := tx.Where("key = ?", settingKey).FirstOrCreate(&settingModel, models.Setting{
        Key:      settingKey,
        Value:    settingValue,
        Type:     "bool",
        Category: "security",
    }).Error; err != nil {
        tx.Rollback()
        logger.Log().WithError(err).Error("CrowdSec toggle: failed to update Settings table")
        c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to update settings"})
        return
    }
    settingModel.Value = settingValue
    if err := tx.Save(&settingModel).Error; err != nil {
        tx.Rollback()
        logger.Log().WithError(err).Error("CrowdSec toggle: failed to save Settings table")
        c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to update settings"})
        return
    }

    // STEP 2: Update SecurityConfig table
    var cfg models.SecurityConfig
    if err := tx.First(&cfg).Error; err != nil {
        if err == gorm.ErrRecordNotFound {
            // Create config matching toggle state
            crowdSecMode := "disabled"
            if payload.Enabled {
                crowdSecMode = "local"
            }

            cfg = models.SecurityConfig{
                UUID:               "default",
                Name:               "Default Security Config",
                Enabled:            payload.Enabled,
                CrowdSecMode:       crowdSecMode,
                WAFMode:            "disabled",
                WAFParanoiaLevel:   1,
                RateLimitMode:      "disabled",
                RateLimitBurst:     10,
                RateLimitRequests:  100,
                RateLimitWindowSec: 60,
            }
            if err := tx.Create(&cfg).Error; err != nil {
                tx.Rollback()
                logger.Log().WithError(err).Error("CrowdSec toggle: failed to create SecurityConfig")
                c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to persist configuration"})
                return
            }
        } else {
            tx.Rollback()
            logger.Log().WithError(err).Error("CrowdSec toggle: failed to read SecurityConfig")
            c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to read configuration"})
            return
        }
    } else {
        // Update existing config
        if payload.Enabled {
            cfg.CrowdSecMode = "local"
            cfg.Enabled = true
        } else {
            cfg.CrowdSecMode = "disabled"
            cfg.Enabled = false
        }
        if err := tx.Save(&cfg).Error; err != nil {
            tx.Rollback()
            logger.Log().WithError(err).Error("CrowdSec toggle: failed to update SecurityConfig")
            c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to persist configuration"})
            return
        }
    }

    // Commit the transaction before starting/stopping process
    if err := tx.Commit().Error; err != nil {
        logger.Log().WithError(err).Error("CrowdSec toggle: transaction commit failed")
        c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to commit changes"})
        return
    }

    logger.Log().WithFields(map[string]interface{}{
        "enabled":       cfg.Enabled,
        "crowdsec_mode": cfg.CrowdSecMode,
    }).Info("CrowdSec toggle: synchronized Settings and SecurityConfig successfully")

    // STEP 3: Start or stop the process
    ctx := c.Request.Context()
    if payload.Enabled {
        // Start CrowdSec
        pid, err := h.Executor.Start(ctx, h.BinPath, h.DataDir)
        if err != nil {
            logger.Log().WithError(err).Error("CrowdSec toggle: failed to start process, reverting DB changes")

            // Revert both tables (in new transaction)
            revertTx := h.DB.Begin()
            cfg.CrowdSecMode = "disabled"
            cfg.Enabled = false
            revertTx.Save(&cfg)
            settingModel.Value = "false"
            revertTx.Save(&settingModel)
            revertTx.Commit()

            c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
            return
        }

        // Wait for LAPI readiness
        lapiReady := false
        maxWait := 30 * time.Second
        pollInterval := 500 * time.Millisecond
        deadline := time.Now().Add(maxWait)

        for time.Now().Before(deadline) {
            args := []string{"lapi", "status"}
            if _, err := os.Stat(filepath.Join(h.DataDir, "config.yaml")); err == nil {
                args = append([]string{"-c", filepath.Join(h.DataDir, "config.yaml")}, args...)
            }

            checkCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
            _, err := h.CmdExec.Execute(checkCtx, "cscli", args...)
            cancel()

            if err == nil {
                lapiReady = true
                break
            }

            time.Sleep(pollInterval)
        }

        logger.Log().WithFields(map[string]interface{}{
            "pid":        pid,
            "lapi_ready": lapiReady,
        }).Info("CrowdSec toggle: started successfully")

        c.JSON(http.StatusOK, gin.H{
            "enabled":    true,
            "pid":        pid,
            "lapi_ready": lapiReady,
        })
        return
    } else {
        // Stop CrowdSec
        if err := h.Executor.Stop(ctx, h.DataDir); err != nil {
            logger.Log().WithError(err).Error("CrowdSec toggle: failed to stop process")
            c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
            return
        }

        logger.Log().Info("CrowdSec toggle: stopped successfully")
        c.JSON(http.StatusOK, gin.H{"enabled": false})
        return
    }
}
```

**Register Route**:
```go
// In RegisterRoutes() method
rg.POST("/admin/crowdsec/toggle", h.ToggleCrowdSec)
```

**Frontend API Client** (`frontend/src/api/crowdsec.ts`):
```typescript
export async function toggleCrowdsec(enabled: boolean): Promise<{ enabled: boolean; pid?: number; lapi_ready?: boolean }> {
  const response = await client.post('/admin/crowdsec/toggle', { enabled })
  return response.data
}
```

**Frontend Toggle Update** (`frontend/src/pages/Security.tsx`):
```tsx
const crowdsecPowerMutation = useMutation({
  mutationFn: async (enabled: boolean) => {
    if (enabled) {
      toast.info('Starting CrowdSec... This may take up to 30 seconds')
    }

    // Use unified toggle endpoint (handles Settings + SecurityConfig + Process)
    const result = await toggleCrowdsec(enabled)

    // Backend already verified state, just do final status check
    const status = await statusCrowdsec()
    if (enabled && !status.running) {
      throw new Error('CrowdSec process failed to start. Check server logs for details.')
    }
    if (!enabled && status.running) {
      throw new Error('CrowdSec process still running. Check server logs for details.')
    }

    return result
  },
  // ... rest remains the same
})
```

---

## Testing Plan

### Test 1: Fresh Install

**Scenario**: Brand new Charon installation

1. Start container: `docker compose up -d`
2. Navigate to Security page
3. Verify CrowdSec toggle shows OFF
4. Check status: `curl http://localhost:8080/api/v1/admin/crowdsec/status`
   - Expected: `{"running": false}`
5. Check logs: `docker logs charon 2>&1 | grep "reconciliation"`
   - Expected: "no SecurityConfig found, checking Settings table"
   - Expected: "default SecurityConfig created from Settings preference"
   - Expected: "crowdsec_mode: disabled"

### Test 2: Toggle ON → Container Restart

**Scenario**: User enables CrowdSec, then restarts container

1. Enable toggle in UI (click ON)
2. Verify CrowdSec starts
3. Check status: `{"running": true, "pid": xxx}`
4. Restart: `docker restart charon`
5. Wait 10 seconds
6. Check status again: `{"running": true, "pid": xxx}` (NEW PID)
7. Check logs:
   - Expected: "starting based on SecurityConfig mode='local'"

### Test 3: Legacy Migration (Settings Table Only)

**Scenario**: Existing install with Settings table but no SecurityConfig

1. Manually set: `INSERT INTO settings (key, value, type, category) VALUES ('security.crowdsec.enabled', 'true', 'bool', 'security');`
2. Delete SecurityConfig: `DELETE FROM security_configs;`
3. Restart container
4. Check logs:
   - Expected: "found existing Settings table preference"
   - Expected: "default SecurityConfig created from Settings preference"
   - Expected: "crowdsec_mode: local"
5. Check status: `{"running": true}`

### Test 4: Toggle OFF → Container Restart

**Scenario**: User disables CrowdSec, then restarts container

1. Start with CrowdSec enabled and running
2. Click toggle OFF in UI
3. Verify process stops
4. Restart: `docker restart charon`
5. Wait 10 seconds
6. Check status: `{"running": false}`
7. Verify toggle still shows OFF

### Test 5: Corrupted SecurityConfig Recovery

**Scenario**: SecurityConfig gets deleted but Settings exists

1. Enable CrowdSec via UI
2. Manually delete SecurityConfig: `DELETE FROM security_configs;`
3. Restart container
4. Verify auto-init recreates SecurityConfig matching Settings table
5. Verify CrowdSec auto-starts

---

## Verification Checklist

### Phase 1 (Auto-Initialization Fix)

- [ ] Modified `crowdsec_startup.go` lines 46-71
- [ ] Auto-init checks Settings table for existing preference
- [ ] Auto-init creates SecurityConfig matching Settings state
- [ ] Auto-init does NOT return early (continues to line 74+)
- [ ] Test 1 (Fresh Install) passes
- [ ] Test 3 (Legacy Migration) passes

### Phase 2 (Logging Enhancement)

- [ ] Modified `crowdsec_startup.go` lines 91-98
- [ ] Changed log level from Debug to Info
- [ ] Added source attribution logging
- [ ] Test 2 (Toggle ON → Restart) shows correct log
- [ ] Test 4 (Toggle OFF → Restart) shows correct log

### Phase 3 (Unified Toggle - Optional)

- [ ] Added `ToggleCrowdSec()` method to `crowdsec_handler.go`
- [ ] Registered `/admin/crowdsec/toggle` route
- [ ] Added `toggleCrowdsec()` to `crowdsec.ts`
- [ ] Updated `crowdsecPowerMutation` in `Security.tsx`
- [ ] Test 4 (Toggle synchronization) passes
- [ ] Test 5 (Corrupted recovery) passes

### Pre-Deployment

- [ ] Pre-commit linters pass: `pre-commit run --all-files`
- [ ] Backend tests pass: `cd backend && go test ./...`
- [ ] Frontend tests pass: `cd frontend && npm run test`
- [ ] Docker build succeeds: `docker build -t charon:local .`
- [ ] Integration test passes: `scripts/crowdsec_integration.sh`

---

## Success Criteria

✅ **Fix is complete when**:

1. Toggle shows correct state (ON = running, OFF = stopped)
2. Toggle persists across container restarts
3. Reconciliation logs clearly show decision reason
4. Auto-initialization respects Settings table preference
5. No "stuck toggle" scenarios
6. All 5 test cases pass
7. Pre-commit checks pass
8. No regressions in existing CrowdSec functionality

---

## Risk Assessment

| Change | Risk Level | Mitigation |
|--------|------------|------------|
| Phase 1 (Auto-init) | **Low** | Only affects fresh installs or corrupted state recovery |
| Phase 2 (Logging) | **Very Low** | Only changes log output, no logic changes |
| Phase 3 (Unified toggle) | **Medium** | New endpoint, requires thorough testing, but backward compatible |

---

## Rollback Plan

If issues arise:

1. **Immediate Revert**: `git revert <commit-hash>` (no DB changes needed)
2. **Manual Fix** (if toggle stuck):
   ```sql
   -- Reset SecurityConfig
   UPDATE security_configs
   SET crowdsec_mode = 'disabled', enabled = 0
   WHERE uuid = 'default';

   -- Reset Settings
   UPDATE settings
   SET value = 'false'
   WHERE key = 'security.crowdsec.enabled';
   ```
3. **Force Stop CrowdSec**: `docker exec charon pkill -SIGTERM crowdsec`

---

## Dependency Impact Analysis

### Phase 1: Auto-Initialization Changes (crowdsec_startup.go)

#### Files Directly Modified
- `backend/internal/services/crowdsec_startup.go` (lines 46-71)

#### Dependencies and Required Updates

**1. Unit Tests - MUST BE UPDATED**
- **File**: `backend/internal/services/crowdsec_startup_test.go`
- **Impact**: Test `TestReconcileCrowdSecOnStartup_NoSecurityConfig` expects the function to skip/return early when no SecurityConfig exists
- **Required Change**: Update test to:
  - Create a Settings table entry with `security.crowdsec.enabled = 'true'`
  - Verify that SecurityConfig is auto-created with `crowdsec_mode = "local"`
  - Verify that CrowdSec process is started (not skipped)
- **Additional Tests Needed**:
  - `TestReconcileCrowdSecOnStartup_NoSecurityConfig_SettingsDisabled` - Settings='false' → creates config with mode="disabled", does NOT start
  - `TestReconcileCrowdSecOnStartup_NoSecurityConfig_SettingsEnabled` - Settings='true' → creates config with mode="local", DOES start
  - `TestReconcileCrowdSecOnStartup_NoSecurityConfig_NoSettingsEntry` - No Settings entry → creates config with mode="disabled", does NOT start

**2. Integration Tests - VERIFICATION NEEDED**
- **Files**:
  - `scripts/crowdsec_integration.sh`
  - `scripts/crowdsec_startup_test.sh`
  - `scripts/crowdsec_decision_integration.sh`
- **Impact**: These scripts may assume specific startup behavior
- **Verification Required**:
  - Do any scripts pre-populate Settings table?
  - Do any scripts expect reconciliation to skip on fresh DB?
  - Do any scripts verify log output from reconciliation?
- **Action**: Review scripts for assumptions about auto-initialization behavior

**3. Migration/Upgrade Path - DATABASE CONCERN**
- **Scenario**: Existing installations with Settings='true' but missing SecurityConfig
- **Impact**: After upgrade, reconciliation will auto-create SecurityConfig from Settings (POSITIVE)
- **Risk**: Low - this is the intended fix
- **Documentation**: Should document this as expected behavior in migration guide

**4. Models - NO CHANGES REQUIRED**
- **File**: `backend/internal/models/security_config.go`
- **Analysis**: SecurityConfig model structure unchanged
- **File**: `backend/internal/models/setting.go`
- **Analysis**: Setting model structure unchanged

**5. Route Registration - NO CHANGES REQUIRED**
- **File**: `backend/internal/api/routes/routes.go` (line 360)
- **Analysis**: Already calls `ReconcileCrowdSecOnStartup`, no signature changes

**6. Handler Dependencies - NO CHANGES REQUIRED**
- **File**: `backend/internal/api/handlers/crowdsec_handler.go`
- **Analysis**: Start/Stop handlers operate independently, no coupling to reconciliation logic

### Phase 2: Logging Enhancement Changes (crowdsec_startup.go)

#### Files Directly Modified
- `backend/internal/services/crowdsec_startup.go` (lines 91-98)

#### Dependencies and Required Updates

**1. Log Aggregation/Parsing - DOCUMENTATION UPDATE**
- **Concern**: Changing log level from Debug → Info increases log volume
- **Impact**:
  - Logs will now appear in production (Info is default minimum level)
  - Log aggregation tools may need filter updates if they parse specific messages
- **Required**: Update any log parsing scripts or documentation about expected log output

**2. Integration Tests - POTENTIAL GREP PATTERNS**
- **Files**: `scripts/crowdsec_*.sh`
- **Impact**: If scripts `grep` for specific log messages, they may need updates
- **Action**: Search for log message expectations in scripts

**3. Documentation - UPDATE REQUIRED**
- **File**: `docs/features.md`
- **Section**: CrowdSec Integration (line 167+)
- **Required Change**: Add note about reconciliation behavior:
  ```markdown
  #### Startup Behavior

  CrowdSec automatically starts on container restart if:
  - SecurityConfig has `crowdsec_mode = "local"` OR
  - Settings table has `security.crowdsec.enabled = "true"`

  Check container logs for reconciliation decisions:
  - "CrowdSec reconciliation: starting based on SecurityConfig mode='local'"
  - "CrowdSec reconciliation: starting based on Settings table override"
  - "CrowdSec reconciliation skipped: both SecurityConfig and Settings indicate disabled"
  ```

**4. Troubleshooting Guide - UPDATE RECOMMENDED**
- **File**: `docs/troubleshooting/` (if exists) or `docs/security.md`
- **Required Change**: Add section on "CrowdSec Not Starting After Restart"
  - Explain reconciliation logic
  - Show how to check Settings and SecurityConfig tables
  - Show example log output

### Phase 3: Unified Toggle Endpoint (OPTIONAL)

#### Files Directly Modified
- `backend/internal/api/handlers/crowdsec_handler.go` (new method)
- `backend/internal/api/handlers/crowdsec_handler.go` (RegisterRoutes)
- `frontend/src/api/crowdsec.ts` (new function)
- `frontend/src/pages/Security.tsx` (mutation update)

#### Dependencies and Required Updates

**1. Handler Tests - NEW TESTS REQUIRED**
- **File**: `backend/internal/api/handlers/crowdsec_handler_test.go`
- **Required Tests**:
  - `TestCrowdsecHandler_Toggle_EnableSuccess`
  - `TestCrowdsecHandler_Toggle_DisableSuccess`
  - `TestCrowdsecHandler_Toggle_TransactionRollback` (if Start fails)
  - `TestCrowdsecHandler_Toggle_VerifyBothTablesUpdated`

**2. Existing Handlers - DEPRECATION CONSIDERATION**
- **Files**:
  - Start handler (line ~167 in crowdsec_handler.go)
  - Stop handler (line ~260 in crowdsec_handler.go)
- **Impact**: New toggle endpoint duplicates Start/Stop functionality
- **Decision Required**:
  - **Option A**: Keep both for backward compatibility (RECOMMENDED)
  - **Option B**: Deprecate Start/Stop, add deprecation warnings
  - **Option C**: Remove Start/Stop entirely (BREAKING CHANGE - NOT RECOMMENDED)
- **Recommendation**: Keep Start/Stop handlers unchanged, document toggle as "preferred method"

**3. Frontend API Layer - MIGRATION PATH**
- **File**: `frontend/src/api/crowdsec.ts`
- **Current Exports**: `startCrowdsec`, `stopCrowdsec`, `statusCrowdsec`
- **After Change**: Add `toggleCrowdsec` to exports (line 75)
- **Backward Compatibility**: Keep existing functions, don't remove them

**4. Frontend Component - LIMITED SCOPE**
- **File**: `frontend/src/pages/Security.tsx`
- **Impact**: Only `crowdsecPowerMutation` needs updating (lines 86-125)
- **Other Components**: No other components import these functions (verified)
- **Risk**: Low - isolated change

**5. API Documentation - NEW ENDPOINT**
- **File**: `docs/api.md` (if exists)
- **Required Addition**: Document `/admin/crowdsec/toggle` endpoint

**6. Integration Tests - NEW TEST CASE**
- **Files**: `scripts/crowdsec_integration.sh`
- **Required Addition**: Test toggle endpoint directly

**7. Backward Compatibility - ANALYSIS**
- **Frontend**: Existing `/admin/crowdsec/start` and `/admin/crowdsec/stop` endpoints remain functional
- **API Consumers**: External tools using Start/Stop continue to work
- **Risk**: None - purely additive change

### Cross-Cutting Concerns

#### Database Migration
- **No schema changes required** - both Settings and SecurityConfig tables already exist
- **Data migration**: None needed - changes are behavioral only

#### Configuration Files
- **No changes required** - no new environment variables or config files

#### Docker/Deployment
- **No Dockerfile changes** - all changes are code-level
- **No docker-compose changes** - no new services or volumes

#### Security Implications
- **Phase 1**: Improves security by respecting user's intent across restarts
- **Phase 2**: No security impact (logging only)
- **Phase 3**: Transaction safety prevents partial updates (improvement)

#### Performance Considerations
- **Phase 1**: Adds one SQL query during auto-initialization (one-time, on startup)
- **Phase 2**: Minimal - only adds log statements
- **Phase 3**: Minimal - wraps existing logic in transaction

#### Rollback Safety
- **All phases**: No database schema changes, can be rolled back via git revert
- **Data safety**: No data loss risk - only affects process startup behavior

### Summary of Required File Updates

| Phase | Files to Modify | Files to Create | Tests to Add | Docs to Update |
|-------|----------------|-----------------|--------------|----------------|
| **Phase 1** | `crowdsec_startup.go` | None | 3 new unit tests | None (covered in Phase 2) |
| **Phase 2** | `crowdsec_startup.go` | None | None | `features.md`, troubleshooting docs |
| **Phase 3** | `crowdsec_handler.go`, `crowdsec.ts`, `Security.tsx` | None | 4 new handler tests | `api.md` (if exists) |

### Testing Matrix

| Scenario | Phase 1 | Phase 2 | Phase 3 |
|----------|---------|---------|---------|
| Fresh install → toggle ON → restart | ✅ Fixes | ✅ Better logs | ✅ Cleaner code |
| Existing install with Settings='true', missing SecurityConfig | ✅ Fixes | ✅ Better logs | N/A |
| Toggle ON → restart → verify logs | ✅ Works | ✅ MUST verify new messages | ✅ Works |
| Toggle OFF → restart → verify logs | ✅ Works | ✅ MUST verify new messages | ✅ Works |
| Start/Stop handlers (backward compat) | N/A | N/A | ✅ MUST verify still work |

### Missing from Original Plan

The original plan DID NOT explicitly mention:

1. **Unit test updates required** - Critical for Phase 1 (`TestReconcileCrowdSecOnStartup_NoSecurityConfig` needs major refactoring)
2. **Integration script verification** - May break if they expect specific behavior
3. **Documentation updates** - Features and troubleshooting guides need new reconciliation behavior documented
4. **Backward compatibility analysis for Phase 3** - Need explicit decision on Start/Stop handler fate
5. **API documentation** - New endpoint needs docs
6. **Testing matrix for all three phases together** - Need to verify they work in combination

---

**END OF SPECIFICATION**