diff --git a/docs/plans/current_spec.md b/docs/plans/current_spec.md index 8ea6be0f..3f446f28 100644 --- a/docs/plans/current_spec.md +++ b/docs/plans/current_spec.md @@ -1,528 +1,369 @@ -# CrowdSec LAPI Status Bug - Diagnostic & Fix Plan +# Fix Plan: Critical Issues After Docker Rebuild **Date:** December 14, 2025 -**Issue:** CrowdSecConfig page persistently shows "LAPI is initializing" even when LAPI is running -**Status:** 🎯 **ROOT CAUSE IDENTIFIED** - Status endpoint checks process, not LAPI connectivity -**Priority:** HIGH (Blocks Console Enrollment Feature) -**Previous Issue:** [crowdsec_lapi_error_diagnostic.md](crowdsec_lapi_error_diagnostic.md) - Race condition fix introduced this regression +**Status:** Planning Phase +**Priority:** P0 - Urgent --- -## 🎯 Key Findings +## Issue Summary -### Critical Discovery +After a Docker container rebuild, three critical issues were identified: -After implementing fixes from `docs/plans/crowdsec_lapi_error_diagnostic.md`, the CrowdSecConfig page now persistently displays: - -> "CrowdSec Local API is initializing... -> The CrowdSec process is running but the Local API (LAPI) is still starting up." - -This message appears **even when LAPI is actually running and reachable**. The fix introduced a regression where the Status endpoint was not updated to match the new LAPI-aware Start endpoint. - -### Root Cause Chain - -1. `Start()` handler was correctly updated to wait for LAPI and return `lapi_ready: true/false` -2. **BUT** `Status()` handler was **NOT updated** - still only checks process status -3. Frontend expects `running` to mean "LAPI responding" -4. Backend returns `running: true` meaning only "process running" -5. **MISMATCH:** Frontend needs `lapi_ready` field to determine actual LAPI status - -### Why This is a Regression - -- The original fix added LAPI readiness check to `Start()` handler ✅ -- But forgot to add the same check to `Status()` handler ❌ -- Frontend now uses `statusCrowdsec()` for polling LAPI status -- This endpoint doesn't actually verify LAPI connectivity - -### Impact - -- Console enrollment section always shows "initializing" warning -- Enroll button is disabled even when LAPI is working -- Users cannot complete console enrollment despite CrowdSec being functional - ---- - -## Executive Summary - -The `Start()` handler was correctly updated to wait for LAPI readiness before returning (lines 201-236 in [crowdsec_handler.go](../../backend/internal/api/handlers/crowdsec_handler.go#L201-L236)): - -```go -// Start() now waits for LAPI and returns lapi_ready: true/false -c.JSON(http.StatusOK, gin.H{ - "status": "started", - "pid": pid, - "lapi_ready": true, // NEW: indicates LAPI is ready -}) -``` - -However, the `Status()` handler was **NOT updated** and still only checks process status (lines 287-294): - -```go -func (h *CrowdsecHandler) Status(c *gin.Context) { - ctx := c.Request.Context() - running, pid, err := h.Executor.Status(ctx, h.DataDir) // Only checks PID! - if err != nil { - c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()}) - return - } - c.JSON(http.StatusOK, gin.H{"running": running, "pid": pid}) // Missing lapi_ready! -} -``` +1. **500 error on CrowdSec Stop()** - Toggling CrowdSec OFF returns "Failed to stop CrowdSec: Request failed with status code 500" +2. **CrowdSec shows "not running"** - Despite database setting being enabled, the process isn't running after container restart +3. **Live Logs Disconnected** - WebSocket shows disconnected state even when logs are being generated --- ## Root Cause Analysis -### The Executor's Status() Method +### Issue 1: 500 Error on Stop() -The `DefaultCrowdsecExecutor.Status()` in [crowdsec_exec.go](../../backend/internal/api/handlers/crowdsec_exec.go#L65-L87) only checks: +**Location:** [crowdsec_exec.go](../../backend/internal/api/handlers/crowdsec_exec.go#L36-L51) -1. If PID file exists -2. If process with that PID is running (via signal 0) +**Root Cause:** The `Stop()` method in `DefaultCrowdsecExecutor` fails when there's no PID file. + +```go +func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error { + b, err := os.ReadFile(e.pidFile(configDir)) + if err != nil { + return fmt.Errorf("pid file read: %w", err) // ← FAILS HERE with 500 + } + // ... +} +``` + +**Problem:** When the container restarts: +1. The PID file (`/app/data/crowdsec/crowdsec.pid`) is deleted (ephemeral or process cleanup removes it) +2. The database still has CrowdSec "enabled" = true +3. User clicks "Disable" (which calls Stop()) +4. Stop() tries to read the PID file → fails → returns error → 500 response + +**Why it fails:** The code assumes a PID file always exists when stopping. But after container restart, there's no PID file because the CrowdSec process wasn't running (GUI-controlled lifecycle, NOT auto-started). + +### Issue 2: CrowdSec Shows "Not Running" After Restart + +**Location:** [docker-entrypoint.sh](../../docker-entrypoint.sh#L91-L97) + +**Root Cause:** CrowdSec is **intentionally NOT auto-started** in the entrypoint. The design is "GUI-controlled lifecycle." + +From the entrypoint: +```bash +# CrowdSec Lifecycle Management: +# CrowdSec configuration is initialized above (symlinks, directories, hub updates) +# However, the CrowdSec agent is NOT auto-started in the entrypoint. +# Instead, CrowdSec lifecycle is managed by the backend handlers via GUI controls. +``` + +**Problem:** The database stores the user's "enabled" preference, but there's no reconciliation at startup: +1. Container restarts +2. Database says CrowdSec `enabled = true` (user's previous preference) +3. But CrowdSec process is NOT started (by design) +4. UI shows "enabled" but status shows "not running" → confusing state mismatch + +**Missing Logic:** No startup reconciliation to check "if DB says enabled, start CrowdSec process." + +### Issue 3: Live Logs Disconnected + +**Location:** [logs_ws.go](../../backend/internal/api/handlers/logs_ws.go) and [log_watcher.go](../../backend/internal/services/log_watcher.go) + +**Root Cause:** There are **two separate WebSocket log systems** that may be misconfigured: + +1. **`/api/v1/logs/live`** (logs_ws.go) - Streams application logs via `logger.GetBroadcastHook()` +2. **`/api/v1/cerberus/logs/ws`** (cerberus_logs_ws.go) - Streams Caddy access logs via `LogWatcher` + +**Potential Issues:** + +a) **LogWatcher not started:** The `LogWatcher` must be explicitly started with `Start(ctx)`. If the watcher isn't started during server initialization, no logs are broadcast. + +b) **Log file doesn't exist:** The LogWatcher waits for `/var/log/caddy/access.log` to exist. After container restart with no traffic, this file may not exist yet. + +c) **WebSocket connection path mismatch:** Frontend might connect to wrong endpoint or with invalid token. + +d) **CSP blocking WebSocket:** Security middleware's Content-Security-Policy must allow `ws:` and `wss:` protocols. + +--- + +## Detailed Code Analysis + +### Stop() Method - Full Code Review + +```go +// File: backend/internal/api/handlers/crowdsec_exec.go +func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error { + b, err := os.ReadFile(e.pidFile(configDir)) + if err != nil { + return fmt.Errorf("pid file read: %w", err) // ← CRITICAL: Returns error on ENOENT + } + pid, err := strconv.Atoi(string(b)) + if err != nil { + return fmt.Errorf("invalid pid: %w", err) + } + proc, err := os.FindProcess(pid) + if err != nil { + return err + } + if err := proc.Signal(syscall.SIGTERM); err != nil { + return err + } + _ = os.Remove(e.pidFile(configDir)) + return nil +} +``` + +The problem is clear: `os.ReadFile()` returns `os.ErrNotExist` when the PID file doesn't exist, and this is propagated as a 500 error. + +### Status() Method - Already Handles Missing PID Gracefully ```go func (e *DefaultCrowdsecExecutor) Status(ctx context.Context, configDir string) (running bool, pid int, err error) { b, err := os.ReadFile(e.pidFile(configDir)) if err != nil { - // Missing pid file is treated as not running + // Missing pid file is treated as not running ← GOOD PATTERN return false, 0, nil } - // ... check if process is alive via signal 0 ... - return true, pid, nil + // ... } ``` -It does **NOT** check if LAPI HTTP endpoint is responding. - -### Frontend Expectation Mismatch - -The frontend in [CrowdSecConfig.tsx](../../frontend/src/pages/CrowdSecConfig.tsx#L71-L77) queries LAPI status: - -```tsx -const lapiStatusQuery = useQuery({ - queryKey: ['crowdsec-lapi-status'], - queryFn: statusCrowdsec, - enabled: consoleEnrollmentEnabled && initialCheckComplete, - refetchInterval: 5000, // Poll every 5 seconds - retry: false, -}) -``` - -And displays a warning based on `running` field (lines 207-231): - -```tsx -{lapiStatusQuery.data && !lapiStatusQuery.data.running && initialCheckComplete && ( -
-

CrowdSec Local API is initializing...

-
-)} -``` - -**The Problem:** The frontend checks `lapiStatusQuery.data?.running` expecting it to indicate LAPI connectivity. But the backend returns `running: true` which only means "process is running", not "LAPI is responding". - -### Evidence Chain - -| Component | File | Line | Returns | Actually Checks | -|-----------|------|------|---------|-----------------| -| Backend Handler | crowdsec_handler.go | 287-294 | `{running, pid}` | Process running via PID | -| Backend Executor | crowdsec_exec.go | 65-87 | `(running, pid, err)` | PID file + signal 0 | -| Frontend API | crowdsec.ts | 18-21 | `resp.data` | N/A (passthrough) | -| Frontend Query | CrowdSecConfig.tsx | 71-77 | `lapiStatusQuery.data` | Checks `.running` field | -| Frontend UI | CrowdSecConfig.tsx | 207-231 | Shows warning | `!running` | - -**Bug:** Frontend interprets `running` as "LAPI responding" but backend returns "process running". +**Key Insight:** `Status()` already handles missing PID file gracefully by returning `(false, 0, nil)`. The `Stop()` method should follow the same pattern. --- -## Detailed Analysis: Why Warning Always Shows +## Fix Plan -Looking at the conditional again: +### Fix 1: Make Stop() Idempotent (No Error When Already Stopped) -```tsx -{lapiStatusQuery.data && !lapiStatusQuery.data.running && initialCheckComplete && ( -``` +**File:** `backend/internal/api/handlers/crowdsec_exec.go` -This shows the warning when: -- `lapiStatusQuery.data` is truthy ✓ -- `!lapiStatusQuery.data.running` is truthy (i.e., `running` is falsy) -- `initialCheckComplete` is truthy ✓ - -**Re-analyzing:** If `running: true`, then `!true = false`, so warning should NOT show. - -**But user reports it DOES show!** - -**Possible causes:** - -1. **Process not actually running:** The `Status()` endpoint returns `running: false` because CrowdSec process crashed or PID file is missing/stale -2. **Different `running` field:** Frontend might be checking a different property -3. **Query state issue:** React Query might be returning stale data - -**Most Likely:** Looking at the message being displayed: - -> "CrowdSec Local API is **initializing**..." - -This message was designed for the case where **process IS running** but **LAPI is NOT ready yet**. But the current conditional shows it when `running` is false! - -**The Fix Needed:** The conditional should check: -- Process running (`running: true`) AND -- LAPI not ready (`lapi_ready: false`) - -NOT just: -- Process not running (`running: false`) - ---- - -## The Complete Fix - -### Files to Modify - -1. **Backend:** [backend/internal/api/handlers/crowdsec_handler.go](../../backend/internal/api/handlers/crowdsec_handler.go#L287-L294) -2. **Frontend API:** [frontend/src/api/crowdsec.ts](../../frontend/src/api/crowdsec.ts#L18-L21) -3. **Frontend UI:** [frontend/src/pages/CrowdSecConfig.tsx](../../frontend/src/pages/CrowdSecConfig.tsx#L207-L231) -4. **Tests:** [backend/internal/api/handlers/crowdsec_handler_test.go](../../backend/internal/api/handlers/crowdsec_handler_test.go) - -### Change 1: Backend Status Handler - -**File:** `backend/internal/api/handlers/crowdsec_handler.go` -**Location:** Lines 287-294 - -**Before:** +**Current Code (lines 36-51):** ```go -// Status returns simple running state. -func (h *CrowdsecHandler) Status(c *gin.Context) { - ctx := c.Request.Context() - running, pid, err := h.Executor.Status(ctx, h.DataDir) +func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error { + b, err := os.ReadFile(e.pidFile(configDir)) if err != nil { - c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()}) - return + return fmt.Errorf("pid file read: %w", err) } - c.JSON(http.StatusOK, gin.H{"running": running, "pid": pid}) + pid, err := strconv.Atoi(string(b)) + if err != nil { + return fmt.Errorf("invalid pid: %w", err) + } + proc, err := os.FindProcess(pid) + if err != nil { + return err + } + if err := proc.Signal(syscall.SIGTERM); err != nil { + return err + } + // best-effort remove pid file + _ = os.Remove(e.pidFile(configDir)) + return nil } ``` -**After:** +**Fixed Code:** ```go -// Status returns running state including LAPI availability check. -func (h *CrowdsecHandler) Status(c *gin.Context) { - ctx := c.Request.Context() - running, pid, err := h.Executor.Status(ctx, h.DataDir) +func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error { + b, err := os.ReadFile(e.pidFile(configDir)) if err != nil { - c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()}) - return - } - - // Check LAPI connectivity if process is running - lapiReady := false - if running { - args := []string{"lapi", "status"} - if _, err := os.Stat(filepath.Join(h.DataDir, "config.yaml")); err == nil { - args = append([]string{"-c", filepath.Join(h.DataDir, "config.yaml")}, args...) + if os.IsNotExist(err) { + // No PID file means process isn't running - that's OK for Stop() + // This makes Stop() idempotent (safe to call multiple times) + return nil } - checkCtx, cancel := context.WithTimeout(ctx, 2*time.Second) - _, checkErr := h.CmdExec.Execute(checkCtx, "cscli", args...) - cancel() - lapiReady = (checkErr == nil) + return fmt.Errorf("pid file read: %w", err) } - - c.JSON(http.StatusOK, gin.H{ - "running": running, - "pid": pid, - "lapi_ready": lapiReady, - }) + pid, err := strconv.Atoi(string(b)) + if err != nil { + // Malformed PID file - remove it and treat as not running + _ = os.Remove(e.pidFile(configDir)) + return nil + } + proc, err := os.FindProcess(pid) + if err != nil { + // Process lookup failed - clean up PID file + _ = os.Remove(e.pidFile(configDir)) + return nil + } + if err := proc.Signal(syscall.SIGTERM); err != nil { + // Process may already be dead - clean up PID file + _ = os.Remove(e.pidFile(configDir)) + // Only return error if it's not "process doesn't exist" + if !errors.Is(err, os.ErrProcessDone) && !errors.Is(err, syscall.ESRCH) { + return err + } + return nil + } + // best-effort remove pid file + _ = os.Remove(e.pidFile(configDir)) + return nil } ``` -### Change 2: Frontend API Type +**Rationale:** Stop() should be idempotent. Stopping an already-stopped service shouldn't error. -**File:** `frontend/src/api/crowdsec.ts` -**Location:** Lines 18-21 +### Fix 2: Add CrowdSec Startup Reconciliation -**Before:** -```typescript -export async function statusCrowdsec() { - const resp = await client.get('/admin/crowdsec/status') - return resp.data +**File:** `backend/internal/api/routes/routes.go` (or create `backend/internal/services/crowdsec_startup.go`) + +**New Function:** +```go +// ReconcileCrowdSecOnStartup checks if CrowdSec should be running based on DB settings +// and starts it if necessary. This handles the case where the container restarts +// but the user's preference was to have CrowdSec enabled. +func ReconcileCrowdSecOnStartup(db *gorm.DB, executor handlers.CrowdsecExecutor, binPath, dataDir string) { + if db == nil { + return + } + + var secCfg models.SecurityConfig + if err := db.First(&secCfg).Error; err != nil { + // No config yet or error - nothing to reconcile + logger.Log().WithError(err).Debug("No security config found for CrowdSec reconciliation") + return + } + + // Check if CrowdSec should be running based on mode + if secCfg.CrowdSecMode != "local" { + logger.Log().WithField("mode", secCfg.CrowdSecMode).Debug("CrowdSec mode is not 'local', skipping auto-start") + return + } + + // Check if already running + ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) + defer cancel() + + running, _, _ := executor.Status(ctx, dataDir) + if running { + logger.Log().Info("CrowdSec already running, no action needed") + return + } + + // Start CrowdSec since DB says it should be enabled + logger.Log().Info("CrowdSec mode is 'local' but process not running, starting...") + _, err := executor.Start(ctx, binPath, dataDir) + if err != nil { + logger.Log().WithError(err).Warn("Failed to auto-start CrowdSec on startup reconciliation") + } else { + logger.Log().Info("CrowdSec started successfully via startup reconciliation") + } } ``` -**After:** -```typescript -export interface CrowdSecStatus { - running: boolean - pid: number - lapi_ready: boolean -} - -export async function statusCrowdsec(): Promise { - const resp = await client.get('/admin/crowdsec/status') - return resp.data +**Integration Point:** Call this function after database migration in server initialization: +```go +// In routes.go or server.go, after DB is ready and handlers are created +if crowdsecHandler != nil { + ReconcileCrowdSecOnStartup(db, crowdsecHandler.Executor, crowdsecHandler.BinPath, crowdsecHandler.DataDir) } ``` -### Change 3: Frontend CrowdSecConfig Conditional Logic +### Fix 3: Ensure LogWatcher is Started and Log File Exists -**File:** `frontend/src/pages/CrowdSecConfig.tsx` -**Location:** Lines 207-231 +**File:** `backend/internal/api/routes/routes.go` -**Before:** -```tsx -{/* Warning when CrowdSec LAPI is not running */} -{lapiStatusQuery.data && !lapiStatusQuery.data.running && initialCheckComplete && ( -
- -
-

- CrowdSec Local API is initializing... -

-

- The CrowdSec process is running but the Local API (LAPI) is still starting up. - This typically takes 5-10 seconds after enabling CrowdSec. - {lapiStatusQuery.isRefetching && ' Checking again in 5 seconds...'} -

-
- - {!status?.crowdsec?.enabled && ( - - )} -
-
-
-)} -``` +**Check that LogWatcher.Start() is called:** +```go +// Ensure LogWatcher is started with proper log path +logPath := "/var/log/caddy/access.log" -**After:** -```tsx -{/* Warning when CrowdSec process is running but LAPI is not ready */} -{lapiStatusQuery.data && lapiStatusQuery.data.running && !lapiStatusQuery.data.lapi_ready && initialCheckComplete && ( -
- -
-

- CrowdSec Local API is initializing... -

-

- The CrowdSec process is running but the Local API (LAPI) is still starting up. - This typically takes 5-10 seconds after enabling CrowdSec. - {lapiStatusQuery.isRefetching && ' Checking again in 5 seconds...'} -

-
- -
-
-
-)} +// Ensure log directory exists +if err := os.MkdirAll(filepath.Dir(logPath), 0755); err != nil { + logger.Log().WithError(err).Warn("Failed to create log directory") +} -{/* Warning when CrowdSec is not running at all */} -{lapiStatusQuery.data && !lapiStatusQuery.data.running && initialCheckComplete && ( -
- -
-

- CrowdSec is not running -

-

- Please enable CrowdSec using the toggle switch in the Security dashboard before enrolling in the Console. -

- -
-
-)} -``` +// Create empty log file if it doesn't exist (allows LogWatcher to start tailing immediately) +if _, err := os.Stat(logPath); os.IsNotExist(err) { + if f, err := os.Create(logPath); err == nil { + f.Close() + logger.Log().WithField("path", logPath).Info("Created empty log file for LogWatcher") + } +} -### Change 4: Update Enrollment Button Disabled State - -**File:** `frontend/src/pages/CrowdSecConfig.tsx` -**Location:** Lines 255-289 (Enroll, Rotate key, and Retry enrollment buttons) - -**Before:** -```tsx -disabled={isConsolePending || (lapiStatusQuery.data && !lapiStatusQuery.data.running) || !enrollmentToken.trim()} -``` - -**After:** -```tsx -disabled={isConsolePending || (lapiStatusQuery.data && !lapiStatusQuery.data.lapi_ready) || !enrollmentToken.trim()} -``` - -Also update the `title` attributes: - -**Before:** -```tsx -title={ - lapiStatusQuery.data && !lapiStatusQuery.data.running - ? 'CrowdSec LAPI must be running to enroll' - : ... +// Create and start the LogWatcher +watcher := services.NewLogWatcher(logPath) +if err := watcher.Start(context.Background()); err != nil { + logger.Log().WithError(err).Error("Failed to start LogWatcher") } ``` -**After:** -```tsx -title={ - lapiStatusQuery.data && !lapiStatusQuery.data.lapi_ready - ? 'CrowdSec LAPI must be running to enroll' - : ... -} +**Additionally, verify CSP allows WebSocket:** +The security middleware in `backend/internal/api/middleware/security.go` already has: +```go +directives["connect-src"] = "'self' ws: wss:" // WebSocket for HMR +``` + +This should allow WebSocket connections. + +--- + +## Files to Modify + +| File | Change | +|------|--------| +| `backend/internal/api/handlers/crowdsec_exec.go` | Make `Stop()` idempotent for missing PID file | +| `backend/internal/api/routes/routes.go` | Add CrowdSec startup reconciliation call | +| `backend/internal/services/crowdsec_startup.go` | (NEW) Create startup reconciliation function | +| `backend/internal/api/handlers/crowdsec_exec_test.go` | Add tests for Stop() idempotency | + +--- + +## Testing Plan + +### Test 1: Stop() Idempotency +```bash +# Start container fresh (no CrowdSec running) +docker compose down -v && docker compose up -d + +# Call Stop() without starting CrowdSec first +curl -X POST http://localhost:8080/api/v1/admin/crowdsec/stop \ + -H "Authorization: Bearer $TOKEN" + +# Expected: 200 OK {"status": "stopped"} +# NOT: 500 Internal Server Error +``` + +### Test 2: Startup Reconciliation +```bash +# Enable CrowdSec via API +curl -X POST http://localhost:8080/api/v1/admin/crowdsec/start \ + -H "Authorization: Bearer $TOKEN" + +# Verify running +curl http://localhost:8080/api/v1/admin/crowdsec/status \ + -H "Authorization: Bearer $TOKEN" +# Expected: {"running": true, "pid": xxx, "lapi_ready": true} + +# Restart container +docker compose restart charon + +# Wait for startup +sleep 10 + +# Verify CrowdSec auto-started +curl http://localhost:8080/api/v1/admin/crowdsec/status \ + -H "Authorization: Bearer $TOKEN" +# Expected: {"running": true, "pid": xxx, "lapi_ready": true} +``` + +### Test 3: Live Logs +```bash +# Connect to WebSocket +websocat "ws://localhost:8080/api/v1/logs/live?token=$TOKEN" + +# In another terminal, generate traffic +curl http://localhost:8080/api/v1/health + +# Verify log entries appear in WebSocket connection ``` --- -## Testing Steps +## Implementation Order -### Unit Test: Backend Status Handler - -Add test in `backend/internal/api/handlers/crowdsec_handler_test.go`: - -```go -func TestCrowdsecHandler_Status_IncludesLAPIReady(t *testing.T) { - mockExec := &fakeExec{running: true, pid: 1234} - mockCmdExec := &mockCommandExecutor{returnErr: nil} // cscli lapi status succeeds - - handler := &CrowdsecHandler{ - Executor: mockExec, - CmdExec: mockCmdExec, - DataDir: "/app/data", - } - - w := httptest.NewRecorder() - c, _ := gin.CreateTestContext(w) - c.Request = httptest.NewRequest(http.MethodGet, "/admin/crowdsec/status", nil) - - handler.Status(c) - - assert.Equal(t, http.StatusOK, w.Code) - - var response map[string]interface{} - json.Unmarshal(w.Body.Bytes(), &response) - - assert.True(t, response["running"].(bool)) - assert.Equal(t, float64(1234), response["pid"].(float64)) - assert.True(t, response["lapi_ready"].(bool)) // NEW: Check lapi_ready is present and true -} - -func TestCrowdsecHandler_Status_LAPINotReady(t *testing.T) { - mockExec := &fakeExec{running: true, pid: 1234} - mockCmdExec := &mockCommandExecutor{returnErr: errors.New("connection refused")} // cscli lapi status fails - - handler := &CrowdsecHandler{ - Executor: mockExec, - CmdExec: mockCmdExec, - DataDir: "/app/data", - } - - w := httptest.NewRecorder() - c, _ := gin.CreateTestContext(w) - c.Request = httptest.NewRequest(http.MethodGet, "/admin/crowdsec/status", nil) - - handler.Status(c) - - assert.Equal(t, http.StatusOK, w.Code) - - var response map[string]interface{} - json.Unmarshal(w.Body.Bytes(), &response) - - assert.True(t, response["running"].(bool)) - assert.Equal(t, float64(1234), response["pid"].(float64)) - assert.False(t, response["lapi_ready"].(bool)) // LAPI not ready -} - -func TestCrowdsecHandler_Status_ProcessNotRunning(t *testing.T) { - mockExec := &fakeExec{running: false, pid: 0} - mockCmdExec := &mockCommandExecutor{} - - handler := &CrowdsecHandler{ - Executor: mockExec, - CmdExec: mockCmdExec, - DataDir: "/app/data", - } - - w := httptest.NewRecorder() - c, _ := gin.CreateTestContext(w) - c.Request = httptest.NewRequest(http.MethodGet, "/admin/crowdsec/status", nil) - - handler.Status(c) - - assert.Equal(t, http.StatusOK, w.Code) - - var response map[string]interface{} - json.Unmarshal(w.Body.Bytes(), &response) - - assert.False(t, response["running"].(bool)) - assert.False(t, response["lapi_ready"].(bool)) // LAPI can't be ready if process not running -} -``` - -### Manual Testing Procedure - -1. **Start Fresh:** - ```bash - docker compose down -v - docker compose up -d - ``` - -2. **Enable CrowdSec:** - - Go to Security dashboard - - Toggle CrowdSec ON - - Wait for toast "CrowdSec started and LAPI is ready" - -3. **Navigate to Config:** - - Click "Config" button - - Verify NO "initializing" warning shows - - Console enrollment section should be enabled - -4. **Verify API Response:** - ```bash - curl -s http://localhost:8080/api/v1/admin/crowdsec/status | jq - ``` - Expected: - ```json - { - "running": true, - "pid": 123, - "lapi_ready": true - } - ``` - -5. **Test LAPI Down Scenario:** - - SSH into container: `docker exec -it charon bash` - - Stop CrowdSec: `pkill -f crowdsec` - - Call API: - ```bash - curl -s http://localhost:8080/api/v1/admin/crowdsec/status | jq - ``` - - Expected: `{"running": false, "pid": 0, "lapi_ready": false}` - - Refresh CrowdSecConfig page - - Should show "CrowdSec is not running" error (red) - -6. **Test Restart Scenario:** - - Re-enable CrowdSec via Security dashboard - - Immediately navigate to CrowdSecConfig - - Should show "initializing" briefly (yellow) then clear when `lapi_ready: true` +1. **Fix 1 (Stop Idempotency)** - Quick fix, high impact, unblocks users immediately +2. **Fix 2 (Startup Reconciliation)** - Core fix for state mismatch after restart +3. **Fix 3 (Live Logs)** - May need more investigation; likely already working if LogWatcher is started --- @@ -530,46 +371,16 @@ func TestCrowdsecHandler_Status_ProcessNotRunning(t *testing.T) { | Change | Risk | Mitigation | |--------|------|------------| -| Backend Status handler modification | Low | Status handler is read-only, adds 2s timeout check | -| LAPI check timeout (2s) | Low | Short timeout prevents blocking; async refresh handles retries | -| Frontend conditional logic change | Low | More precise state handling, clear error states | -| Type definition update | Low | TypeScript will catch any mismatches at compile time | -| Two separate warning states | Low | Better UX with distinct yellow (initializing) vs red (not running) | +| Stop() idempotency | Very Low | Makes existing code more robust | +| Startup reconciliation | Low | Only runs once at startup, respects DB state | +| Log file creation | Very Low | Standard file operations with error handling | --- -## Summary +## Notes -**Root Cause:** The `Status()` endpoint was not updated when `Start()` was modified to check LAPI readiness. The frontend expects the status endpoint to indicate LAPI availability, but it only returns process status. - -**Fix:** Add `lapi_ready` field to `Status()` response by checking `cscli lapi status`, update frontend to use this new field for the warning display logic. - -**Files Changed:** -1. `backend/internal/api/handlers/crowdsec_handler.go` - Add LAPI check to Status() -2. `frontend/src/api/crowdsec.ts` - Add TypeScript interface with `lapi_ready` -3. `frontend/src/pages/CrowdSecConfig.tsx` - Update conditional logic: - - Yellow warning: process running, LAPI not ready - - Red warning: process not running - - No warning: process running AND LAPI ready -4. `backend/internal/api/handlers/crowdsec_handler_test.go` - Add unit tests - -**Estimated Time:** 1-2 hours including testing - -**Commit Message:** -``` -fix: add LAPI readiness check to CrowdSec status endpoint - -The Status() handler was only checking if the CrowdSec process was -running, not if LAPI was actually responding. This caused the -CrowdSecConfig page to always show "LAPI is initializing" even when -LAPI was fully operational. - -Changes: -- Backend: Add `lapi_ready` field to /admin/crowdsec/status response -- Frontend: Add CrowdSecStatus TypeScript interface -- Frontend: Update conditional logic to check `lapi_ready` not `running` -- Frontend: Separate warnings for "initializing" vs "not running" -- Tests: Add unit tests for Status handler LAPI check - -Fixes regression from crowdsec_lapi_error_diagnostic.md fixes. -``` +- The CrowdSec PID file path is `${dataDir}/crowdsec.pid` (e.g., `/app/data/crowdsec/crowdsec.pid`) +- The LogWatcher monitors `/var/log/caddy/access.log` by default +- WebSocket ping interval is 30 seconds for keep-alive +- CrowdSec LAPI runs on port 8085 (not 8080) to avoid conflict with Charon +- The `Status()` handler already includes `lapi_ready` field (from previous fix) diff --git a/docs/plans/post_rebuild_diagnostic.md b/docs/plans/post_rebuild_diagnostic.md new file mode 100644 index 00000000..26389275 --- /dev/null +++ b/docs/plans/post_rebuild_diagnostic.md @@ -0,0 +1,526 @@ +# Diagnostic & Fix Plan: CrowdSec and Live Logs Issues Post Docker Rebuild + +**Date:** December 14, 2025 +**Investigator:** Planning Agent +**Scope:** Three user-reported issues after Docker rebuild +**Status:** ✅ **COMPLETE - Root causes identified with fixes ready** + +--- + +## Executive Summary + +After thorough investigation of the backend handlers, executor implementation, entrypoint script, and frontend code, I've identified the root causes for all three reported issues: + +1. **CrowdSec shows "not running"** - Process detection via PID file is failing +2. **500 error when stopping CrowdSec** - PID file doesn't exist when CrowdSec wasn't started via handlers +3. **Live log viewer disconnected** - LogWatcher can't find the access log file + +--- + +## Issue 1: CrowdSec Shows "Not Running" Even Though Enabled in UI + +### Root Cause Analysis + +The mismatch occurs because: + +1. **Database Setting vs Process State**: The UI toggle updates the setting `security.crowdsec.enabled` in the database, but **does not actually start the CrowdSec process**. + +2. **Process Lifecycle Design**: Per [docker-entrypoint.sh](../../docker-entrypoint.sh) (line 56-65), CrowdSec is explicitly **NOT auto-started** in the container entrypoint: + ```bash + # CrowdSec Lifecycle Management: + # CrowdSec agent is NOT auto-started in the entrypoint. + # Instead, CrowdSec lifecycle is managed by the backend handlers via GUI controls. + ``` + +3. **Status() Handler Behavior** ([crowdsec_handler.go#L238-L266](../../backend/internal/api/handlers/crowdsec_handler.go)): + - Calls `h.Executor.Status()` which reads from PID file at `{configDir}/crowdsec.pid` + - If PID file doesn't exist (CrowdSec never started), returns `running: false` + - The frontend correctly shows "Stopped" even when setting is "enabled" + +4. **The Disconnect**: + - Setting `security.crowdsec.enabled = true` ≠ Process running + - The setting tells Cerberus middleware to "use CrowdSec for protection" IF running + - The actual start requires clicking the toggle which calls `crowdsecPowerMutation.mutate(true)` + +### Why It Appears Broken + +After Docker rebuild: +- Fresh container has `security.crowdsec.enabled` potentially still `true` in DB (persisted volume) +- But PID file is gone (container restart) +- CrowdSec process not running +- UI shows "enabled" setting but status shows "not running" + +### Status() Handler Already Fixed + +Looking at the current implementation in [crowdsec_handler.go#L238-L266](../../backend/internal/api/handlers/crowdsec_handler.go), the `Status()` handler **already includes LAPI readiness check**: + +```go +func (h *CrowdsecHandler) Status(c *gin.Context) { + ctx := c.Request.Context() + running, pid, err := h.Executor.Status(ctx, h.DataDir) + // ... + // Check LAPI connectivity if process is running + lapiReady := false + if running { + args := []string{"lapi", "status"} + // ... LAPI check implementation ... + lapiReady = (checkErr == nil) + } + + c.JSON(http.StatusOK, gin.H{ + "running": running, + "pid": pid, + "lapi_ready": lapiReady, + }) +} +``` + +### Additional Enhancement Required + +Add `setting_enabled` and `needs_start` fields to help frontend show correct state: + +**File:** [backend/internal/api/handlers/crowdsec_handler.go](../../backend/internal/api/handlers/crowdsec_handler.go) + +```go +func (h *CrowdsecHandler) Status(c *gin.Context) { + ctx := c.Request.Context() + running, pid, err := h.Executor.Status(ctx, h.DataDir) + if err != nil { + c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()}) + return + } + + // Check setting state + settingEnabled := false + if h.DB != nil { + var setting models.Setting + if err := h.DB.Where("key = ?", "security.crowdsec.enabled").First(&setting).Error; err == nil { + settingEnabled = strings.EqualFold(strings.TrimSpace(setting.Value), "true") + } + } + + // Check LAPI connectivity if process is running + lapiReady := false + if running { + // ... existing LAPI check ... + } + + c.JSON(http.StatusOK, gin.H{ + "running": running, + "pid": pid, + "lapi_ready": lapiReady, + "setting_enabled": settingEnabled, + "needs_start": settingEnabled && !running, // NEW: hint for frontend + }) +} +``` + +--- + +## Issue 2: 500 Error When Stopping CrowdSec + +### Root Cause Analysis + +The 500 error occurs in [crowdsec_exec.go#L37-L53](../../backend/internal/api/handlers/crowdsec_exec.go): + +```go +func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error { + b, err := os.ReadFile(e.pidFile(configDir)) + if err != nil { + return fmt.Errorf("pid file read: %w", err) // <-- 500 error here + } + // ... +} +``` + +**The Problem:** +1. PID file at `/app/data/crowdsec/crowdsec.pid` doesn't exist +2. This happens when: + - CrowdSec was never started via the handlers + - Container was restarted (PID file lost) + - CrowdSec was started externally but not via Charon handlers + +### Fix Required + +Modify `Stop()` in [crowdsec_exec.go](../../backend/internal/api/handlers/crowdsec_exec.go) to handle missing PID gracefully: + +```go +func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error { + b, err := os.ReadFile(e.pidFile(configDir)) + if err != nil { + if os.IsNotExist(err) { + // PID file doesn't exist - process likely not running or was started externally + // Try to find and stop any running crowdsec process + return e.stopByProcessName(ctx) + } + return fmt.Errorf("pid file read: %w", err) + } + pid, err := strconv.Atoi(string(b)) + if err != nil { + return fmt.Errorf("invalid pid: %w", err) + } + proc, err := os.FindProcess(pid) + if err != nil { + return err + } + if err := proc.Signal(syscall.SIGTERM); err != nil { + // Process might already be dead + if errors.Is(err, os.ErrProcessDone) { + _ = os.Remove(e.pidFile(configDir)) + return nil + } + return err + } + _ = os.Remove(e.pidFile(configDir)) + return nil +} + +// stopByProcessName attempts to stop CrowdSec by finding it via process name +func (e *DefaultCrowdsecExecutor) stopByProcessName(ctx context.Context) error { + // Use pkill or pgrep to find crowdsec process + cmd := exec.CommandContext(ctx, "pkill", "-TERM", "crowdsec") + err := cmd.Run() + if err != nil { + // pkill returns exit code 1 if no processes matched - that's OK + if exitErr, ok := err.(*exec.ExitError); ok && exitErr.ExitCode() == 1 { + return nil // No process to kill, already stopped + } + return fmt.Errorf("failed to stop crowdsec by process name: %w", err) + } + return nil +} +``` + +**File:** [backend/internal/api/handlers/crowdsec_exec.go](../../backend/internal/api/handlers/crowdsec_exec.go) + +--- + +## Issue 3: Live Log Viewer Disconnected on Cerberus Dashboard + +### Root Cause Analysis + +The Live Log Viewer uses two WebSocket endpoints: + +1. **Application Logs** (`/api/v1/logs/live`) - Works via `BroadcastHook` in logger +2. **Security Logs** (`/api/v1/cerberus/logs/ws`) - Requires `LogWatcher` to tail access log file + +The Cerberus Security Logs WebSocket ([cerberus_logs_ws.go](../../backend/internal/api/handlers/cerberus_logs_ws.go)) depends on `LogWatcher` which tails `/var/log/caddy/access.log`. + +**The Problem:** + +In [log_watcher.go#L102-L117](../../backend/internal/services/log_watcher.go): +```go +func (w *LogWatcher) tailFile() { + for { + // Wait for file to exist + if _, err := os.Stat(w.logPath); os.IsNotExist(err) { + logger.Log().WithField("path", w.logPath).Debug("Log file not found, waiting...") + time.Sleep(time.Second) + continue + } + // ... + } +} +``` + +After Docker rebuild: +1. Caddy may not have written any logs yet +2. `/var/log/caddy/access.log` doesn't exist +3. `LogWatcher` enters infinite "waiting" loop +4. No log entries are ever sent to WebSocket clients +5. Frontend shows "disconnected" because no heartbeat/data received + +### Why "Disconnected" Appears + +From [cerberus_logs_ws.go#L79-L83](../../backend/internal/api/handlers/cerberus_logs_ws.go): +```go +case <-ticker.C: + // Send ping to keep connection alive + if err := conn.WriteMessage(websocket.PingMessage, []byte{}); err != nil { + return + } +``` + +The ping is sent every 30 seconds, but if the frontend's WebSocket connection times out or encounters an error before receiving any message, it shows "disconnected". + +### Fix Required + +**Fix 1:** Create log file if missing in `LogWatcher.Start()`: + +**File:** [backend/internal/services/log_watcher.go](../../backend/internal/services/log_watcher.go) + +```go +import "path/filepath" + +func (w *LogWatcher) Start(ctx context.Context) error { + w.mu.Lock() + if w.started { + w.mu.Unlock() + return nil + } + w.started = true + w.mu.Unlock() + + // Ensure log file exists + logDir := filepath.Dir(w.logPath) + if err := os.MkdirAll(logDir, 0755); err != nil { + logger.Log().WithError(err).Warn("Failed to create log directory") + } + if _, err := os.Stat(w.logPath); os.IsNotExist(err) { + if f, err := os.Create(w.logPath); err == nil { + f.Close() + logger.Log().WithField("path", w.logPath).Info("Created empty log file for tailing") + } + } + + go w.tailFile() + logger.Log().WithField("path", w.logPath).Info("LogWatcher started") + return nil +} +``` + +**Fix 2:** Send initial heartbeat message on WebSocket connect: + +**File:** [backend/internal/api/handlers/cerberus_logs_ws.go](../../backend/internal/api/handlers/cerberus_logs_ws.go) + +```go +func (h *CerberusLogsHandler) LiveLogs(c *gin.Context) { + // ... existing upgrade code ... + + logger.Log().WithField("subscriber_id", subscriberID).Info("Cerberus logs WebSocket connected") + + // Send connection confirmation immediately + _ = conn.WriteJSON(map[string]interface{}{ + "type": "connected", + "timestamp": time.Now().Format(time.RFC3339), + }) + + // ... rest unchanged ... +} +``` + +--- + +## Summary of Required Changes + +### File 1: [backend/internal/api/handlers/crowdsec_exec.go](../../backend/internal/api/handlers/crowdsec_exec.go) + +**Change:** Make `Stop()` handle missing PID file gracefully + +```go +// Add import for exec +import "os/exec" + +// Add this method +func (e *DefaultCrowdsecExecutor) stopByProcessName(ctx context.Context) error { + cmd := exec.CommandContext(ctx, "pkill", "-TERM", "crowdsec") + err := cmd.Run() + if err != nil { + if exitErr, ok := err.(*exec.ExitError); ok && exitErr.ExitCode() == 1 { + return nil + } + return fmt.Errorf("failed to stop crowdsec by process name: %w", err) + } + return nil +} + +// Modify Stop() +func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error { + b, err := os.ReadFile(e.pidFile(configDir)) + if err != nil { + if os.IsNotExist(err) { + return e.stopByProcessName(ctx) + } + return fmt.Errorf("pid file read: %w", err) + } + // ... rest unchanged ... +} +``` + +### File 2: [backend/internal/services/log_watcher.go](../../backend/internal/services/log_watcher.go) + +**Change:** Ensure log file exists before starting tail + +```go +import "path/filepath" + +func (w *LogWatcher) Start(ctx context.Context) error { + w.mu.Lock() + if w.started { + w.mu.Unlock() + return nil + } + w.started = true + w.mu.Unlock() + + // Ensure log file exists + logDir := filepath.Dir(w.logPath) + if err := os.MkdirAll(logDir, 0755); err != nil { + logger.Log().WithError(err).Warn("Failed to create log directory") + } + if _, err := os.Stat(w.logPath); os.IsNotExist(err) { + if f, err := os.Create(w.logPath); err == nil { + f.Close() + } + } + + go w.tailFile() + logger.Log().WithField("path", w.logPath).Info("LogWatcher started") + return nil +} +``` + +### File 3: [backend/internal/api/handlers/cerberus_logs_ws.go](../../backend/internal/api/handlers/cerberus_logs_ws.go) + +**Change:** Send connection confirmation on WebSocket connect + +```go +func (h *CerberusLogsHandler) LiveLogs(c *gin.Context) { + // ... existing upgrade code ... + + logger.Log().WithField("subscriber_id", subscriberID).Info("Cerberus logs WebSocket connected") + + // Send connection confirmation immediately + _ = conn.WriteJSON(map[string]interface{}{ + "type": "connected", + "timestamp": time.Now().Format(time.RFC3339), + }) + + // ... rest unchanged ... +} +``` + +### File 4: [backend/internal/api/handlers/crowdsec_handler.go](../../backend/internal/api/handlers/crowdsec_handler.go) + +**Change:** Add setting reconciliation hint in Status response + +```go +func (h *CrowdsecHandler) Status(c *gin.Context) { + ctx := c.Request.Context() + running, pid, err := h.Executor.Status(ctx, h.DataDir) + if err != nil { + c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()}) + return + } + + // Check setting state + settingEnabled := false + if h.DB != nil { + var setting models.Setting + if err := h.DB.Where("key = ?", "security.crowdsec.enabled").First(&setting).Error; err == nil { + settingEnabled = strings.EqualFold(strings.TrimSpace(setting.Value), "true") + } + } + + // Check LAPI connectivity if process is running + lapiReady := false + if running { + // ... existing LAPI check ... + } + + c.JSON(http.StatusOK, gin.H{ + "running": running, + "pid": pid, + "lapi_ready": lapiReady, + "setting_enabled": settingEnabled, + "needs_start": settingEnabled && !running, + }) +} +``` + +--- + +## Testing Steps + +### Test Issue 1: CrowdSec Status Consistency + +1. Start container fresh +2. Check Security dashboard - should show CrowdSec as "Disabled" +3. Toggle CrowdSec on - should start process and show "Running" +4. Restart container +5. Check Security dashboard - should show "needs restart" or auto-start + +### Test Issue 2: Stop CrowdSec Without Error + +1. With CrowdSec not running, try to stop via UI toggle +2. Should NOT return 500 error +3. Should return success or "already stopped" +4. Check logs for graceful handling + +### Test Issue 3: Live Logs Connection + +1. Start container fresh +2. Navigate to Cerberus Dashboard +3. Live Log Viewer should show "Connected" status +4. Make a request to trigger log entry +5. Entry should appear in viewer + +### Integration Test + +```bash +# Run in container +cd /projects/Charon/backend +go test ./internal/api/handlers/... -run TestCrowdsec -v +``` + +--- + +## Debug Commands + +```bash +# Check if CrowdSec PID file exists +ls -la /app/data/crowdsec/crowdsec.pid + +# Check CrowdSec process status +pgrep -la crowdsec + +# Check access log file +ls -la /var/log/caddy/access.log + +# Test LAPI health +curl http://127.0.0.1:8085/health + +# Check WebSocket endpoint +# In browser console: +# new WebSocket('ws://localhost:8080/api/v1/cerberus/logs/ws') +``` + +--- + +## Conclusion + +All three issues stem from **state synchronization problems** after container restart: + +1. **CrowdSec**: Database setting doesn't match process state +2. **Stop Error**: Handler assumes PID file exists when it may not +3. **Live Logs**: Log file may not exist, causing LogWatcher to wait indefinitely + +The fixes are defensive programming patterns: +- Handle missing PID file gracefully +- Create log files if they don't exist +- Add reconciliation hints in status responses +- Send WebSocket heartbeats immediately on connect + +--- + +## Commit Message Template + +``` +fix: handle container restart edge cases for CrowdSec and Live Logs + +Issue 1 - CrowdSec "not running" status: +- Add setting_enabled and needs_start fields to Status() response +- Frontend can now show proper "needs restart" state + +Issue 2 - 500 error on Stop: +- Handle missing PID file gracefully in Stop() +- Fallback to pkill if PID file doesn't exist +- Return success if process already stopped + +Issue 3 - Live Logs disconnected: +- Create log file if it doesn't exist on LogWatcher.Start() +- Send WebSocket connection confirmation immediately +- Ensure clients know connection is alive before first log entry + +All fixes are defensive programming patterns for container restart scenarios. +```