fix: address post-rebuild issues with CrowdSec and Live Logs
- Issue 1: Corrected CrowdSec status reporting by adding `setting_enabled` and `needs_start` fields to the Status() response, allowing the frontend to accurately reflect the need for a restart. - Issue 2: Resolved 500 error on stopping CrowdSec by implementing graceful handling of missing PID files in the Stop() method, with a fallback to process termination via pkill. - Issue 3: Fixed Live Logs disconnection issue by ensuring the log file is created if it doesn't exist during LogWatcher.Start() and sending an immediate WebSocket connection confirmation to clients. These changes enhance the robustness of the application in handling container restart scenarios.
This commit is contained in:
@@ -1,528 +1,369 @@
|
||||
# CrowdSec LAPI Status Bug - Diagnostic & Fix Plan
|
||||
# Fix Plan: Critical Issues After Docker Rebuild
|
||||
|
||||
**Date:** December 14, 2025
|
||||
**Issue:** CrowdSecConfig page persistently shows "LAPI is initializing" even when LAPI is running
|
||||
**Status:** 🎯 **ROOT CAUSE IDENTIFIED** - Status endpoint checks process, not LAPI connectivity
|
||||
**Priority:** HIGH (Blocks Console Enrollment Feature)
|
||||
**Previous Issue:** [crowdsec_lapi_error_diagnostic.md](crowdsec_lapi_error_diagnostic.md) - Race condition fix introduced this regression
|
||||
**Status:** Planning Phase
|
||||
**Priority:** P0 - Urgent
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Key Findings
|
||||
## Issue Summary
|
||||
|
||||
### Critical Discovery
|
||||
After a Docker container rebuild, three critical issues were identified:
|
||||
|
||||
After implementing fixes from `docs/plans/crowdsec_lapi_error_diagnostic.md`, the CrowdSecConfig page now persistently displays:
|
||||
|
||||
> "CrowdSec Local API is initializing...
|
||||
> The CrowdSec process is running but the Local API (LAPI) is still starting up."
|
||||
|
||||
This message appears **even when LAPI is actually running and reachable**. The fix introduced a regression where the Status endpoint was not updated to match the new LAPI-aware Start endpoint.
|
||||
|
||||
### Root Cause Chain
|
||||
|
||||
1. `Start()` handler was correctly updated to wait for LAPI and return `lapi_ready: true/false`
|
||||
2. **BUT** `Status()` handler was **NOT updated** - still only checks process status
|
||||
3. Frontend expects `running` to mean "LAPI responding"
|
||||
4. Backend returns `running: true` meaning only "process running"
|
||||
5. **MISMATCH:** Frontend needs `lapi_ready` field to determine actual LAPI status
|
||||
|
||||
### Why This is a Regression
|
||||
|
||||
- The original fix added LAPI readiness check to `Start()` handler ✅
|
||||
- But forgot to add the same check to `Status()` handler ❌
|
||||
- Frontend now uses `statusCrowdsec()` for polling LAPI status
|
||||
- This endpoint doesn't actually verify LAPI connectivity
|
||||
|
||||
### Impact
|
||||
|
||||
- Console enrollment section always shows "initializing" warning
|
||||
- Enroll button is disabled even when LAPI is working
|
||||
- Users cannot complete console enrollment despite CrowdSec being functional
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The `Start()` handler was correctly updated to wait for LAPI readiness before returning (lines 201-236 in [crowdsec_handler.go](../../backend/internal/api/handlers/crowdsec_handler.go#L201-L236)):
|
||||
|
||||
```go
|
||||
// Start() now waits for LAPI and returns lapi_ready: true/false
|
||||
c.JSON(http.StatusOK, gin.H{
|
||||
"status": "started",
|
||||
"pid": pid,
|
||||
"lapi_ready": true, // NEW: indicates LAPI is ready
|
||||
})
|
||||
```
|
||||
|
||||
However, the `Status()` handler was **NOT updated** and still only checks process status (lines 287-294):
|
||||
|
||||
```go
|
||||
func (h *CrowdsecHandler) Status(c *gin.Context) {
|
||||
ctx := c.Request.Context()
|
||||
running, pid, err := h.Executor.Status(ctx, h.DataDir) // Only checks PID!
|
||||
if err != nil {
|
||||
c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
|
||||
return
|
||||
}
|
||||
c.JSON(http.StatusOK, gin.H{"running": running, "pid": pid}) // Missing lapi_ready!
|
||||
}
|
||||
```
|
||||
1. **500 error on CrowdSec Stop()** - Toggling CrowdSec OFF returns "Failed to stop CrowdSec: Request failed with status code 500"
|
||||
2. **CrowdSec shows "not running"** - Despite database setting being enabled, the process isn't running after container restart
|
||||
3. **Live Logs Disconnected** - WebSocket shows disconnected state even when logs are being generated
|
||||
|
||||
---
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
### The Executor's Status() Method
|
||||
### Issue 1: 500 Error on Stop()
|
||||
|
||||
The `DefaultCrowdsecExecutor.Status()` in [crowdsec_exec.go](../../backend/internal/api/handlers/crowdsec_exec.go#L65-L87) only checks:
|
||||
**Location:** [crowdsec_exec.go](../../backend/internal/api/handlers/crowdsec_exec.go#L36-L51)
|
||||
|
||||
1. If PID file exists
|
||||
2. If process with that PID is running (via signal 0)
|
||||
**Root Cause:** The `Stop()` method in `DefaultCrowdsecExecutor` fails when there's no PID file.
|
||||
|
||||
```go
|
||||
func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error {
|
||||
b, err := os.ReadFile(e.pidFile(configDir))
|
||||
if err != nil {
|
||||
return fmt.Errorf("pid file read: %w", err) // ← FAILS HERE with 500
|
||||
}
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
**Problem:** When the container restarts:
|
||||
1. The PID file (`/app/data/crowdsec/crowdsec.pid`) is deleted (ephemeral or process cleanup removes it)
|
||||
2. The database still has CrowdSec "enabled" = true
|
||||
3. User clicks "Disable" (which calls Stop())
|
||||
4. Stop() tries to read the PID file → fails → returns error → 500 response
|
||||
|
||||
**Why it fails:** The code assumes a PID file always exists when stopping. But after container restart, there's no PID file because the CrowdSec process wasn't running (GUI-controlled lifecycle, NOT auto-started).
|
||||
|
||||
### Issue 2: CrowdSec Shows "Not Running" After Restart
|
||||
|
||||
**Location:** [docker-entrypoint.sh](../../docker-entrypoint.sh#L91-L97)
|
||||
|
||||
**Root Cause:** CrowdSec is **intentionally NOT auto-started** in the entrypoint. The design is "GUI-controlled lifecycle."
|
||||
|
||||
From the entrypoint:
|
||||
```bash
|
||||
# CrowdSec Lifecycle Management:
|
||||
# CrowdSec configuration is initialized above (symlinks, directories, hub updates)
|
||||
# However, the CrowdSec agent is NOT auto-started in the entrypoint.
|
||||
# Instead, CrowdSec lifecycle is managed by the backend handlers via GUI controls.
|
||||
```
|
||||
|
||||
**Problem:** The database stores the user's "enabled" preference, but there's no reconciliation at startup:
|
||||
1. Container restarts
|
||||
2. Database says CrowdSec `enabled = true` (user's previous preference)
|
||||
3. But CrowdSec process is NOT started (by design)
|
||||
4. UI shows "enabled" but status shows "not running" → confusing state mismatch
|
||||
|
||||
**Missing Logic:** No startup reconciliation to check "if DB says enabled, start CrowdSec process."
|
||||
|
||||
### Issue 3: Live Logs Disconnected
|
||||
|
||||
**Location:** [logs_ws.go](../../backend/internal/api/handlers/logs_ws.go) and [log_watcher.go](../../backend/internal/services/log_watcher.go)
|
||||
|
||||
**Root Cause:** There are **two separate WebSocket log systems** that may be misconfigured:
|
||||
|
||||
1. **`/api/v1/logs/live`** (logs_ws.go) - Streams application logs via `logger.GetBroadcastHook()`
|
||||
2. **`/api/v1/cerberus/logs/ws`** (cerberus_logs_ws.go) - Streams Caddy access logs via `LogWatcher`
|
||||
|
||||
**Potential Issues:**
|
||||
|
||||
a) **LogWatcher not started:** The `LogWatcher` must be explicitly started with `Start(ctx)`. If the watcher isn't started during server initialization, no logs are broadcast.
|
||||
|
||||
b) **Log file doesn't exist:** The LogWatcher waits for `/var/log/caddy/access.log` to exist. After container restart with no traffic, this file may not exist yet.
|
||||
|
||||
c) **WebSocket connection path mismatch:** Frontend might connect to wrong endpoint or with invalid token.
|
||||
|
||||
d) **CSP blocking WebSocket:** Security middleware's Content-Security-Policy must allow `ws:` and `wss:` protocols.
|
||||
|
||||
---
|
||||
|
||||
## Detailed Code Analysis
|
||||
|
||||
### Stop() Method - Full Code Review
|
||||
|
||||
```go
|
||||
// File: backend/internal/api/handlers/crowdsec_exec.go
|
||||
func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error {
|
||||
b, err := os.ReadFile(e.pidFile(configDir))
|
||||
if err != nil {
|
||||
return fmt.Errorf("pid file read: %w", err) // ← CRITICAL: Returns error on ENOENT
|
||||
}
|
||||
pid, err := strconv.Atoi(string(b))
|
||||
if err != nil {
|
||||
return fmt.Errorf("invalid pid: %w", err)
|
||||
}
|
||||
proc, err := os.FindProcess(pid)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
if err := proc.Signal(syscall.SIGTERM); err != nil {
|
||||
return err
|
||||
}
|
||||
_ = os.Remove(e.pidFile(configDir))
|
||||
return nil
|
||||
}
|
||||
```
|
||||
|
||||
The problem is clear: `os.ReadFile()` returns `os.ErrNotExist` when the PID file doesn't exist, and this is propagated as a 500 error.
|
||||
|
||||
### Status() Method - Already Handles Missing PID Gracefully
|
||||
|
||||
```go
|
||||
func (e *DefaultCrowdsecExecutor) Status(ctx context.Context, configDir string) (running bool, pid int, err error) {
|
||||
b, err := os.ReadFile(e.pidFile(configDir))
|
||||
if err != nil {
|
||||
// Missing pid file is treated as not running
|
||||
// Missing pid file is treated as not running ← GOOD PATTERN
|
||||
return false, 0, nil
|
||||
}
|
||||
// ... check if process is alive via signal 0 ...
|
||||
return true, pid, nil
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
It does **NOT** check if LAPI HTTP endpoint is responding.
|
||||
|
||||
### Frontend Expectation Mismatch
|
||||
|
||||
The frontend in [CrowdSecConfig.tsx](../../frontend/src/pages/CrowdSecConfig.tsx#L71-L77) queries LAPI status:
|
||||
|
||||
```tsx
|
||||
const lapiStatusQuery = useQuery({
|
||||
queryKey: ['crowdsec-lapi-status'],
|
||||
queryFn: statusCrowdsec,
|
||||
enabled: consoleEnrollmentEnabled && initialCheckComplete,
|
||||
refetchInterval: 5000, // Poll every 5 seconds
|
||||
retry: false,
|
||||
})
|
||||
```
|
||||
|
||||
And displays a warning based on `running` field (lines 207-231):
|
||||
|
||||
```tsx
|
||||
{lapiStatusQuery.data && !lapiStatusQuery.data.running && initialCheckComplete && (
|
||||
<div className="..." data-testid="lapi-warning">
|
||||
<p>CrowdSec Local API is initializing...</p>
|
||||
</div>
|
||||
)}
|
||||
```
|
||||
|
||||
**The Problem:** The frontend checks `lapiStatusQuery.data?.running` expecting it to indicate LAPI connectivity. But the backend returns `running: true` which only means "process is running", not "LAPI is responding".
|
||||
|
||||
### Evidence Chain
|
||||
|
||||
| Component | File | Line | Returns | Actually Checks |
|
||||
|-----------|------|------|---------|-----------------|
|
||||
| Backend Handler | crowdsec_handler.go | 287-294 | `{running, pid}` | Process running via PID |
|
||||
| Backend Executor | crowdsec_exec.go | 65-87 | `(running, pid, err)` | PID file + signal 0 |
|
||||
| Frontend API | crowdsec.ts | 18-21 | `resp.data` | N/A (passthrough) |
|
||||
| Frontend Query | CrowdSecConfig.tsx | 71-77 | `lapiStatusQuery.data` | Checks `.running` field |
|
||||
| Frontend UI | CrowdSecConfig.tsx | 207-231 | Shows warning | `!running` |
|
||||
|
||||
**Bug:** Frontend interprets `running` as "LAPI responding" but backend returns "process running".
|
||||
**Key Insight:** `Status()` already handles missing PID file gracefully by returning `(false, 0, nil)`. The `Stop()` method should follow the same pattern.
|
||||
|
||||
---
|
||||
|
||||
## Detailed Analysis: Why Warning Always Shows
|
||||
## Fix Plan
|
||||
|
||||
Looking at the conditional again:
|
||||
### Fix 1: Make Stop() Idempotent (No Error When Already Stopped)
|
||||
|
||||
```tsx
|
||||
{lapiStatusQuery.data && !lapiStatusQuery.data.running && initialCheckComplete && (
|
||||
```
|
||||
**File:** `backend/internal/api/handlers/crowdsec_exec.go`
|
||||
|
||||
This shows the warning when:
|
||||
- `lapiStatusQuery.data` is truthy ✓
|
||||
- `!lapiStatusQuery.data.running` is truthy (i.e., `running` is falsy)
|
||||
- `initialCheckComplete` is truthy ✓
|
||||
|
||||
**Re-analyzing:** If `running: true`, then `!true = false`, so warning should NOT show.
|
||||
|
||||
**But user reports it DOES show!**
|
||||
|
||||
**Possible causes:**
|
||||
|
||||
1. **Process not actually running:** The `Status()` endpoint returns `running: false` because CrowdSec process crashed or PID file is missing/stale
|
||||
2. **Different `running` field:** Frontend might be checking a different property
|
||||
3. **Query state issue:** React Query might be returning stale data
|
||||
|
||||
**Most Likely:** Looking at the message being displayed:
|
||||
|
||||
> "CrowdSec Local API is **initializing**..."
|
||||
|
||||
This message was designed for the case where **process IS running** but **LAPI is NOT ready yet**. But the current conditional shows it when `running` is false!
|
||||
|
||||
**The Fix Needed:** The conditional should check:
|
||||
- Process running (`running: true`) AND
|
||||
- LAPI not ready (`lapi_ready: false`)
|
||||
|
||||
NOT just:
|
||||
- Process not running (`running: false`)
|
||||
|
||||
---
|
||||
|
||||
## The Complete Fix
|
||||
|
||||
### Files to Modify
|
||||
|
||||
1. **Backend:** [backend/internal/api/handlers/crowdsec_handler.go](../../backend/internal/api/handlers/crowdsec_handler.go#L287-L294)
|
||||
2. **Frontend API:** [frontend/src/api/crowdsec.ts](../../frontend/src/api/crowdsec.ts#L18-L21)
|
||||
3. **Frontend UI:** [frontend/src/pages/CrowdSecConfig.tsx](../../frontend/src/pages/CrowdSecConfig.tsx#L207-L231)
|
||||
4. **Tests:** [backend/internal/api/handlers/crowdsec_handler_test.go](../../backend/internal/api/handlers/crowdsec_handler_test.go)
|
||||
|
||||
### Change 1: Backend Status Handler
|
||||
|
||||
**File:** `backend/internal/api/handlers/crowdsec_handler.go`
|
||||
**Location:** Lines 287-294
|
||||
|
||||
**Before:**
|
||||
**Current Code (lines 36-51):**
|
||||
```go
|
||||
// Status returns simple running state.
|
||||
func (h *CrowdsecHandler) Status(c *gin.Context) {
|
||||
ctx := c.Request.Context()
|
||||
running, pid, err := h.Executor.Status(ctx, h.DataDir)
|
||||
func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error {
|
||||
b, err := os.ReadFile(e.pidFile(configDir))
|
||||
if err != nil {
|
||||
c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
|
||||
return
|
||||
return fmt.Errorf("pid file read: %w", err)
|
||||
}
|
||||
c.JSON(http.StatusOK, gin.H{"running": running, "pid": pid})
|
||||
pid, err := strconv.Atoi(string(b))
|
||||
if err != nil {
|
||||
return fmt.Errorf("invalid pid: %w", err)
|
||||
}
|
||||
proc, err := os.FindProcess(pid)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
if err := proc.Signal(syscall.SIGTERM); err != nil {
|
||||
return err
|
||||
}
|
||||
// best-effort remove pid file
|
||||
_ = os.Remove(e.pidFile(configDir))
|
||||
return nil
|
||||
}
|
||||
```
|
||||
|
||||
**After:**
|
||||
**Fixed Code:**
|
||||
```go
|
||||
// Status returns running state including LAPI availability check.
|
||||
func (h *CrowdsecHandler) Status(c *gin.Context) {
|
||||
ctx := c.Request.Context()
|
||||
running, pid, err := h.Executor.Status(ctx, h.DataDir)
|
||||
func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error {
|
||||
b, err := os.ReadFile(e.pidFile(configDir))
|
||||
if err != nil {
|
||||
c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
|
||||
return
|
||||
}
|
||||
|
||||
// Check LAPI connectivity if process is running
|
||||
lapiReady := false
|
||||
if running {
|
||||
args := []string{"lapi", "status"}
|
||||
if _, err := os.Stat(filepath.Join(h.DataDir, "config.yaml")); err == nil {
|
||||
args = append([]string{"-c", filepath.Join(h.DataDir, "config.yaml")}, args...)
|
||||
if os.IsNotExist(err) {
|
||||
// No PID file means process isn't running - that's OK for Stop()
|
||||
// This makes Stop() idempotent (safe to call multiple times)
|
||||
return nil
|
||||
}
|
||||
checkCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
|
||||
_, checkErr := h.CmdExec.Execute(checkCtx, "cscli", args...)
|
||||
cancel()
|
||||
lapiReady = (checkErr == nil)
|
||||
return fmt.Errorf("pid file read: %w", err)
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, gin.H{
|
||||
"running": running,
|
||||
"pid": pid,
|
||||
"lapi_ready": lapiReady,
|
||||
})
|
||||
pid, err := strconv.Atoi(string(b))
|
||||
if err != nil {
|
||||
// Malformed PID file - remove it and treat as not running
|
||||
_ = os.Remove(e.pidFile(configDir))
|
||||
return nil
|
||||
}
|
||||
proc, err := os.FindProcess(pid)
|
||||
if err != nil {
|
||||
// Process lookup failed - clean up PID file
|
||||
_ = os.Remove(e.pidFile(configDir))
|
||||
return nil
|
||||
}
|
||||
if err := proc.Signal(syscall.SIGTERM); err != nil {
|
||||
// Process may already be dead - clean up PID file
|
||||
_ = os.Remove(e.pidFile(configDir))
|
||||
// Only return error if it's not "process doesn't exist"
|
||||
if !errors.Is(err, os.ErrProcessDone) && !errors.Is(err, syscall.ESRCH) {
|
||||
return err
|
||||
}
|
||||
return nil
|
||||
}
|
||||
// best-effort remove pid file
|
||||
_ = os.Remove(e.pidFile(configDir))
|
||||
return nil
|
||||
}
|
||||
```
|
||||
|
||||
### Change 2: Frontend API Type
|
||||
**Rationale:** Stop() should be idempotent. Stopping an already-stopped service shouldn't error.
|
||||
|
||||
**File:** `frontend/src/api/crowdsec.ts`
|
||||
**Location:** Lines 18-21
|
||||
### Fix 2: Add CrowdSec Startup Reconciliation
|
||||
|
||||
**Before:**
|
||||
```typescript
|
||||
export async function statusCrowdsec() {
|
||||
const resp = await client.get('/admin/crowdsec/status')
|
||||
return resp.data
|
||||
**File:** `backend/internal/api/routes/routes.go` (or create `backend/internal/services/crowdsec_startup.go`)
|
||||
|
||||
**New Function:**
|
||||
```go
|
||||
// ReconcileCrowdSecOnStartup checks if CrowdSec should be running based on DB settings
|
||||
// and starts it if necessary. This handles the case where the container restarts
|
||||
// but the user's preference was to have CrowdSec enabled.
|
||||
func ReconcileCrowdSecOnStartup(db *gorm.DB, executor handlers.CrowdsecExecutor, binPath, dataDir string) {
|
||||
if db == nil {
|
||||
return
|
||||
}
|
||||
|
||||
var secCfg models.SecurityConfig
|
||||
if err := db.First(&secCfg).Error; err != nil {
|
||||
// No config yet or error - nothing to reconcile
|
||||
logger.Log().WithError(err).Debug("No security config found for CrowdSec reconciliation")
|
||||
return
|
||||
}
|
||||
|
||||
// Check if CrowdSec should be running based on mode
|
||||
if secCfg.CrowdSecMode != "local" {
|
||||
logger.Log().WithField("mode", secCfg.CrowdSecMode).Debug("CrowdSec mode is not 'local', skipping auto-start")
|
||||
return
|
||||
}
|
||||
|
||||
// Check if already running
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
|
||||
defer cancel()
|
||||
|
||||
running, _, _ := executor.Status(ctx, dataDir)
|
||||
if running {
|
||||
logger.Log().Info("CrowdSec already running, no action needed")
|
||||
return
|
||||
}
|
||||
|
||||
// Start CrowdSec since DB says it should be enabled
|
||||
logger.Log().Info("CrowdSec mode is 'local' but process not running, starting...")
|
||||
_, err := executor.Start(ctx, binPath, dataDir)
|
||||
if err != nil {
|
||||
logger.Log().WithError(err).Warn("Failed to auto-start CrowdSec on startup reconciliation")
|
||||
} else {
|
||||
logger.Log().Info("CrowdSec started successfully via startup reconciliation")
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**After:**
|
||||
```typescript
|
||||
export interface CrowdSecStatus {
|
||||
running: boolean
|
||||
pid: number
|
||||
lapi_ready: boolean
|
||||
}
|
||||
|
||||
export async function statusCrowdsec(): Promise<CrowdSecStatus> {
|
||||
const resp = await client.get<CrowdSecStatus>('/admin/crowdsec/status')
|
||||
return resp.data
|
||||
**Integration Point:** Call this function after database migration in server initialization:
|
||||
```go
|
||||
// In routes.go or server.go, after DB is ready and handlers are created
|
||||
if crowdsecHandler != nil {
|
||||
ReconcileCrowdSecOnStartup(db, crowdsecHandler.Executor, crowdsecHandler.BinPath, crowdsecHandler.DataDir)
|
||||
}
|
||||
```
|
||||
|
||||
### Change 3: Frontend CrowdSecConfig Conditional Logic
|
||||
### Fix 3: Ensure LogWatcher is Started and Log File Exists
|
||||
|
||||
**File:** `frontend/src/pages/CrowdSecConfig.tsx`
|
||||
**Location:** Lines 207-231
|
||||
**File:** `backend/internal/api/routes/routes.go`
|
||||
|
||||
**Before:**
|
||||
```tsx
|
||||
{/* Warning when CrowdSec LAPI is not running */}
|
||||
{lapiStatusQuery.data && !lapiStatusQuery.data.running && initialCheckComplete && (
|
||||
<div className="flex items-start gap-3 p-4 bg-yellow-900/20 border border-yellow-700/50 rounded-lg" data-testid="lapi-warning">
|
||||
<AlertTriangle className="w-5 h-5 text-yellow-400 flex-shrink-0 mt-0.5" />
|
||||
<div className="flex-1">
|
||||
<p className="text-sm text-yellow-200 font-medium mb-2">
|
||||
CrowdSec Local API is initializing...
|
||||
</p>
|
||||
<p className="text-xs text-yellow-300 mb-3">
|
||||
The CrowdSec process is running but the Local API (LAPI) is still starting up.
|
||||
This typically takes 5-10 seconds after enabling CrowdSec.
|
||||
{lapiStatusQuery.isRefetching && ' Checking again in 5 seconds...'}
|
||||
</p>
|
||||
<div className="flex gap-2">
|
||||
<Button
|
||||
variant="secondary"
|
||||
size="sm"
|
||||
onClick={() => lapiStatusQuery.refetch()}
|
||||
disabled={lapiStatusQuery.isRefetching}
|
||||
>
|
||||
Check Now
|
||||
</Button>
|
||||
{!status?.crowdsec?.enabled && (
|
||||
<Button
|
||||
variant="secondary"
|
||||
size="sm"
|
||||
onClick={() => navigate('/security')}
|
||||
>
|
||||
Go to Security Dashboard
|
||||
</Button>
|
||||
)}
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
)}
|
||||
```
|
||||
**Check that LogWatcher.Start() is called:**
|
||||
```go
|
||||
// Ensure LogWatcher is started with proper log path
|
||||
logPath := "/var/log/caddy/access.log"
|
||||
|
||||
**After:**
|
||||
```tsx
|
||||
{/* Warning when CrowdSec process is running but LAPI is not ready */}
|
||||
{lapiStatusQuery.data && lapiStatusQuery.data.running && !lapiStatusQuery.data.lapi_ready && initialCheckComplete && (
|
||||
<div className="flex items-start gap-3 p-4 bg-yellow-900/20 border border-yellow-700/50 rounded-lg" data-testid="lapi-warning">
|
||||
<AlertTriangle className="w-5 h-5 text-yellow-400 flex-shrink-0 mt-0.5" />
|
||||
<div className="flex-1">
|
||||
<p className="text-sm text-yellow-200 font-medium mb-2">
|
||||
CrowdSec Local API is initializing...
|
||||
</p>
|
||||
<p className="text-xs text-yellow-300 mb-3">
|
||||
The CrowdSec process is running but the Local API (LAPI) is still starting up.
|
||||
This typically takes 5-10 seconds after enabling CrowdSec.
|
||||
{lapiStatusQuery.isRefetching && ' Checking again in 5 seconds...'}
|
||||
</p>
|
||||
<div className="flex gap-2">
|
||||
<Button
|
||||
variant="secondary"
|
||||
size="sm"
|
||||
onClick={() => lapiStatusQuery.refetch()}
|
||||
disabled={lapiStatusQuery.isRefetching}
|
||||
>
|
||||
Check Now
|
||||
</Button>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
)}
|
||||
// Ensure log directory exists
|
||||
if err := os.MkdirAll(filepath.Dir(logPath), 0755); err != nil {
|
||||
logger.Log().WithError(err).Warn("Failed to create log directory")
|
||||
}
|
||||
|
||||
{/* Warning when CrowdSec is not running at all */}
|
||||
{lapiStatusQuery.data && !lapiStatusQuery.data.running && initialCheckComplete && (
|
||||
<div className="flex items-start gap-3 p-4 bg-red-900/20 border border-red-700/50 rounded-lg" data-testid="crowdsec-not-running-warning">
|
||||
<AlertTriangle className="w-5 h-5 text-red-400 flex-shrink-0 mt-0.5" />
|
||||
<div className="flex-1">
|
||||
<p className="text-sm text-red-200 font-medium mb-2">
|
||||
CrowdSec is not running
|
||||
</p>
|
||||
<p className="text-xs text-red-300 mb-3">
|
||||
Please enable CrowdSec using the toggle switch in the Security dashboard before enrolling in the Console.
|
||||
</p>
|
||||
<Button
|
||||
variant="secondary"
|
||||
size="sm"
|
||||
onClick={() => navigate('/security')}
|
||||
>
|
||||
Go to Security Dashboard
|
||||
</Button>
|
||||
</div>
|
||||
</div>
|
||||
)}
|
||||
```
|
||||
// Create empty log file if it doesn't exist (allows LogWatcher to start tailing immediately)
|
||||
if _, err := os.Stat(logPath); os.IsNotExist(err) {
|
||||
if f, err := os.Create(logPath); err == nil {
|
||||
f.Close()
|
||||
logger.Log().WithField("path", logPath).Info("Created empty log file for LogWatcher")
|
||||
}
|
||||
}
|
||||
|
||||
### Change 4: Update Enrollment Button Disabled State
|
||||
|
||||
**File:** `frontend/src/pages/CrowdSecConfig.tsx`
|
||||
**Location:** Lines 255-289 (Enroll, Rotate key, and Retry enrollment buttons)
|
||||
|
||||
**Before:**
|
||||
```tsx
|
||||
disabled={isConsolePending || (lapiStatusQuery.data && !lapiStatusQuery.data.running) || !enrollmentToken.trim()}
|
||||
```
|
||||
|
||||
**After:**
|
||||
```tsx
|
||||
disabled={isConsolePending || (lapiStatusQuery.data && !lapiStatusQuery.data.lapi_ready) || !enrollmentToken.trim()}
|
||||
```
|
||||
|
||||
Also update the `title` attributes:
|
||||
|
||||
**Before:**
|
||||
```tsx
|
||||
title={
|
||||
lapiStatusQuery.data && !lapiStatusQuery.data.running
|
||||
? 'CrowdSec LAPI must be running to enroll'
|
||||
: ...
|
||||
// Create and start the LogWatcher
|
||||
watcher := services.NewLogWatcher(logPath)
|
||||
if err := watcher.Start(context.Background()); err != nil {
|
||||
logger.Log().WithError(err).Error("Failed to start LogWatcher")
|
||||
}
|
||||
```
|
||||
|
||||
**After:**
|
||||
```tsx
|
||||
title={
|
||||
lapiStatusQuery.data && !lapiStatusQuery.data.lapi_ready
|
||||
? 'CrowdSec LAPI must be running to enroll'
|
||||
: ...
|
||||
}
|
||||
**Additionally, verify CSP allows WebSocket:**
|
||||
The security middleware in `backend/internal/api/middleware/security.go` already has:
|
||||
```go
|
||||
directives["connect-src"] = "'self' ws: wss:" // WebSocket for HMR
|
||||
```
|
||||
|
||||
This should allow WebSocket connections.
|
||||
|
||||
---
|
||||
|
||||
## Files to Modify
|
||||
|
||||
| File | Change |
|
||||
|------|--------|
|
||||
| `backend/internal/api/handlers/crowdsec_exec.go` | Make `Stop()` idempotent for missing PID file |
|
||||
| `backend/internal/api/routes/routes.go` | Add CrowdSec startup reconciliation call |
|
||||
| `backend/internal/services/crowdsec_startup.go` | (NEW) Create startup reconciliation function |
|
||||
| `backend/internal/api/handlers/crowdsec_exec_test.go` | Add tests for Stop() idempotency |
|
||||
|
||||
---
|
||||
|
||||
## Testing Plan
|
||||
|
||||
### Test 1: Stop() Idempotency
|
||||
```bash
|
||||
# Start container fresh (no CrowdSec running)
|
||||
docker compose down -v && docker compose up -d
|
||||
|
||||
# Call Stop() without starting CrowdSec first
|
||||
curl -X POST http://localhost:8080/api/v1/admin/crowdsec/stop \
|
||||
-H "Authorization: Bearer $TOKEN"
|
||||
|
||||
# Expected: 200 OK {"status": "stopped"}
|
||||
# NOT: 500 Internal Server Error
|
||||
```
|
||||
|
||||
### Test 2: Startup Reconciliation
|
||||
```bash
|
||||
# Enable CrowdSec via API
|
||||
curl -X POST http://localhost:8080/api/v1/admin/crowdsec/start \
|
||||
-H "Authorization: Bearer $TOKEN"
|
||||
|
||||
# Verify running
|
||||
curl http://localhost:8080/api/v1/admin/crowdsec/status \
|
||||
-H "Authorization: Bearer $TOKEN"
|
||||
# Expected: {"running": true, "pid": xxx, "lapi_ready": true}
|
||||
|
||||
# Restart container
|
||||
docker compose restart charon
|
||||
|
||||
# Wait for startup
|
||||
sleep 10
|
||||
|
||||
# Verify CrowdSec auto-started
|
||||
curl http://localhost:8080/api/v1/admin/crowdsec/status \
|
||||
-H "Authorization: Bearer $TOKEN"
|
||||
# Expected: {"running": true, "pid": xxx, "lapi_ready": true}
|
||||
```
|
||||
|
||||
### Test 3: Live Logs
|
||||
```bash
|
||||
# Connect to WebSocket
|
||||
websocat "ws://localhost:8080/api/v1/logs/live?token=$TOKEN"
|
||||
|
||||
# In another terminal, generate traffic
|
||||
curl http://localhost:8080/api/v1/health
|
||||
|
||||
# Verify log entries appear in WebSocket connection
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing Steps
|
||||
## Implementation Order
|
||||
|
||||
### Unit Test: Backend Status Handler
|
||||
|
||||
Add test in `backend/internal/api/handlers/crowdsec_handler_test.go`:
|
||||
|
||||
```go
|
||||
func TestCrowdsecHandler_Status_IncludesLAPIReady(t *testing.T) {
|
||||
mockExec := &fakeExec{running: true, pid: 1234}
|
||||
mockCmdExec := &mockCommandExecutor{returnErr: nil} // cscli lapi status succeeds
|
||||
|
||||
handler := &CrowdsecHandler{
|
||||
Executor: mockExec,
|
||||
CmdExec: mockCmdExec,
|
||||
DataDir: "/app/data",
|
||||
}
|
||||
|
||||
w := httptest.NewRecorder()
|
||||
c, _ := gin.CreateTestContext(w)
|
||||
c.Request = httptest.NewRequest(http.MethodGet, "/admin/crowdsec/status", nil)
|
||||
|
||||
handler.Status(c)
|
||||
|
||||
assert.Equal(t, http.StatusOK, w.Code)
|
||||
|
||||
var response map[string]interface{}
|
||||
json.Unmarshal(w.Body.Bytes(), &response)
|
||||
|
||||
assert.True(t, response["running"].(bool))
|
||||
assert.Equal(t, float64(1234), response["pid"].(float64))
|
||||
assert.True(t, response["lapi_ready"].(bool)) // NEW: Check lapi_ready is present and true
|
||||
}
|
||||
|
||||
func TestCrowdsecHandler_Status_LAPINotReady(t *testing.T) {
|
||||
mockExec := &fakeExec{running: true, pid: 1234}
|
||||
mockCmdExec := &mockCommandExecutor{returnErr: errors.New("connection refused")} // cscli lapi status fails
|
||||
|
||||
handler := &CrowdsecHandler{
|
||||
Executor: mockExec,
|
||||
CmdExec: mockCmdExec,
|
||||
DataDir: "/app/data",
|
||||
}
|
||||
|
||||
w := httptest.NewRecorder()
|
||||
c, _ := gin.CreateTestContext(w)
|
||||
c.Request = httptest.NewRequest(http.MethodGet, "/admin/crowdsec/status", nil)
|
||||
|
||||
handler.Status(c)
|
||||
|
||||
assert.Equal(t, http.StatusOK, w.Code)
|
||||
|
||||
var response map[string]interface{}
|
||||
json.Unmarshal(w.Body.Bytes(), &response)
|
||||
|
||||
assert.True(t, response["running"].(bool))
|
||||
assert.Equal(t, float64(1234), response["pid"].(float64))
|
||||
assert.False(t, response["lapi_ready"].(bool)) // LAPI not ready
|
||||
}
|
||||
|
||||
func TestCrowdsecHandler_Status_ProcessNotRunning(t *testing.T) {
|
||||
mockExec := &fakeExec{running: false, pid: 0}
|
||||
mockCmdExec := &mockCommandExecutor{}
|
||||
|
||||
handler := &CrowdsecHandler{
|
||||
Executor: mockExec,
|
||||
CmdExec: mockCmdExec,
|
||||
DataDir: "/app/data",
|
||||
}
|
||||
|
||||
w := httptest.NewRecorder()
|
||||
c, _ := gin.CreateTestContext(w)
|
||||
c.Request = httptest.NewRequest(http.MethodGet, "/admin/crowdsec/status", nil)
|
||||
|
||||
handler.Status(c)
|
||||
|
||||
assert.Equal(t, http.StatusOK, w.Code)
|
||||
|
||||
var response map[string]interface{}
|
||||
json.Unmarshal(w.Body.Bytes(), &response)
|
||||
|
||||
assert.False(t, response["running"].(bool))
|
||||
assert.False(t, response["lapi_ready"].(bool)) // LAPI can't be ready if process not running
|
||||
}
|
||||
```
|
||||
|
||||
### Manual Testing Procedure
|
||||
|
||||
1. **Start Fresh:**
|
||||
```bash
|
||||
docker compose down -v
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
2. **Enable CrowdSec:**
|
||||
- Go to Security dashboard
|
||||
- Toggle CrowdSec ON
|
||||
- Wait for toast "CrowdSec started and LAPI is ready"
|
||||
|
||||
3. **Navigate to Config:**
|
||||
- Click "Config" button
|
||||
- Verify NO "initializing" warning shows
|
||||
- Console enrollment section should be enabled
|
||||
|
||||
4. **Verify API Response:**
|
||||
```bash
|
||||
curl -s http://localhost:8080/api/v1/admin/crowdsec/status | jq
|
||||
```
|
||||
Expected:
|
||||
```json
|
||||
{
|
||||
"running": true,
|
||||
"pid": 123,
|
||||
"lapi_ready": true
|
||||
}
|
||||
```
|
||||
|
||||
5. **Test LAPI Down Scenario:**
|
||||
- SSH into container: `docker exec -it charon bash`
|
||||
- Stop CrowdSec: `pkill -f crowdsec`
|
||||
- Call API:
|
||||
```bash
|
||||
curl -s http://localhost:8080/api/v1/admin/crowdsec/status | jq
|
||||
```
|
||||
- Expected: `{"running": false, "pid": 0, "lapi_ready": false}`
|
||||
- Refresh CrowdSecConfig page
|
||||
- Should show "CrowdSec is not running" error (red)
|
||||
|
||||
6. **Test Restart Scenario:**
|
||||
- Re-enable CrowdSec via Security dashboard
|
||||
- Immediately navigate to CrowdSecConfig
|
||||
- Should show "initializing" briefly (yellow) then clear when `lapi_ready: true`
|
||||
1. **Fix 1 (Stop Idempotency)** - Quick fix, high impact, unblocks users immediately
|
||||
2. **Fix 2 (Startup Reconciliation)** - Core fix for state mismatch after restart
|
||||
3. **Fix 3 (Live Logs)** - May need more investigation; likely already working if LogWatcher is started
|
||||
|
||||
---
|
||||
|
||||
@@ -530,46 +371,16 @@ func TestCrowdsecHandler_Status_ProcessNotRunning(t *testing.T) {
|
||||
|
||||
| Change | Risk | Mitigation |
|
||||
|--------|------|------------|
|
||||
| Backend Status handler modification | Low | Status handler is read-only, adds 2s timeout check |
|
||||
| LAPI check timeout (2s) | Low | Short timeout prevents blocking; async refresh handles retries |
|
||||
| Frontend conditional logic change | Low | More precise state handling, clear error states |
|
||||
| Type definition update | Low | TypeScript will catch any mismatches at compile time |
|
||||
| Two separate warning states | Low | Better UX with distinct yellow (initializing) vs red (not running) |
|
||||
| Stop() idempotency | Very Low | Makes existing code more robust |
|
||||
| Startup reconciliation | Low | Only runs once at startup, respects DB state |
|
||||
| Log file creation | Very Low | Standard file operations with error handling |
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
## Notes
|
||||
|
||||
**Root Cause:** The `Status()` endpoint was not updated when `Start()` was modified to check LAPI readiness. The frontend expects the status endpoint to indicate LAPI availability, but it only returns process status.
|
||||
|
||||
**Fix:** Add `lapi_ready` field to `Status()` response by checking `cscli lapi status`, update frontend to use this new field for the warning display logic.
|
||||
|
||||
**Files Changed:**
|
||||
1. `backend/internal/api/handlers/crowdsec_handler.go` - Add LAPI check to Status()
|
||||
2. `frontend/src/api/crowdsec.ts` - Add TypeScript interface with `lapi_ready`
|
||||
3. `frontend/src/pages/CrowdSecConfig.tsx` - Update conditional logic:
|
||||
- Yellow warning: process running, LAPI not ready
|
||||
- Red warning: process not running
|
||||
- No warning: process running AND LAPI ready
|
||||
4. `backend/internal/api/handlers/crowdsec_handler_test.go` - Add unit tests
|
||||
|
||||
**Estimated Time:** 1-2 hours including testing
|
||||
|
||||
**Commit Message:**
|
||||
```
|
||||
fix: add LAPI readiness check to CrowdSec status endpoint
|
||||
|
||||
The Status() handler was only checking if the CrowdSec process was
|
||||
running, not if LAPI was actually responding. This caused the
|
||||
CrowdSecConfig page to always show "LAPI is initializing" even when
|
||||
LAPI was fully operational.
|
||||
|
||||
Changes:
|
||||
- Backend: Add `lapi_ready` field to /admin/crowdsec/status response
|
||||
- Frontend: Add CrowdSecStatus TypeScript interface
|
||||
- Frontend: Update conditional logic to check `lapi_ready` not `running`
|
||||
- Frontend: Separate warnings for "initializing" vs "not running"
|
||||
- Tests: Add unit tests for Status handler LAPI check
|
||||
|
||||
Fixes regression from crowdsec_lapi_error_diagnostic.md fixes.
|
||||
```
|
||||
- The CrowdSec PID file path is `${dataDir}/crowdsec.pid` (e.g., `/app/data/crowdsec/crowdsec.pid`)
|
||||
- The LogWatcher monitors `/var/log/caddy/access.log` by default
|
||||
- WebSocket ping interval is 30 seconds for keep-alive
|
||||
- CrowdSec LAPI runs on port 8085 (not 8080) to avoid conflict with Charon
|
||||
- The `Status()` handler already includes `lapi_ready` field (from previous fix)
|
||||
|
||||
526
docs/plans/post_rebuild_diagnostic.md
Normal file
526
docs/plans/post_rebuild_diagnostic.md
Normal file
@@ -0,0 +1,526 @@
|
||||
# Diagnostic & Fix Plan: CrowdSec and Live Logs Issues Post Docker Rebuild
|
||||
|
||||
**Date:** December 14, 2025
|
||||
**Investigator:** Planning Agent
|
||||
**Scope:** Three user-reported issues after Docker rebuild
|
||||
**Status:** ✅ **COMPLETE - Root causes identified with fixes ready**
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
After thorough investigation of the backend handlers, executor implementation, entrypoint script, and frontend code, I've identified the root causes for all three reported issues:
|
||||
|
||||
1. **CrowdSec shows "not running"** - Process detection via PID file is failing
|
||||
2. **500 error when stopping CrowdSec** - PID file doesn't exist when CrowdSec wasn't started via handlers
|
||||
3. **Live log viewer disconnected** - LogWatcher can't find the access log file
|
||||
|
||||
---
|
||||
|
||||
## Issue 1: CrowdSec Shows "Not Running" Even Though Enabled in UI
|
||||
|
||||
### Root Cause Analysis
|
||||
|
||||
The mismatch occurs because:
|
||||
|
||||
1. **Database Setting vs Process State**: The UI toggle updates the setting `security.crowdsec.enabled` in the database, but **does not actually start the CrowdSec process**.
|
||||
|
||||
2. **Process Lifecycle Design**: Per [docker-entrypoint.sh](../../docker-entrypoint.sh) (line 56-65), CrowdSec is explicitly **NOT auto-started** in the container entrypoint:
|
||||
```bash
|
||||
# CrowdSec Lifecycle Management:
|
||||
# CrowdSec agent is NOT auto-started in the entrypoint.
|
||||
# Instead, CrowdSec lifecycle is managed by the backend handlers via GUI controls.
|
||||
```
|
||||
|
||||
3. **Status() Handler Behavior** ([crowdsec_handler.go#L238-L266](../../backend/internal/api/handlers/crowdsec_handler.go)):
|
||||
- Calls `h.Executor.Status()` which reads from PID file at `{configDir}/crowdsec.pid`
|
||||
- If PID file doesn't exist (CrowdSec never started), returns `running: false`
|
||||
- The frontend correctly shows "Stopped" even when setting is "enabled"
|
||||
|
||||
4. **The Disconnect**:
|
||||
- Setting `security.crowdsec.enabled = true` ≠ Process running
|
||||
- The setting tells Cerberus middleware to "use CrowdSec for protection" IF running
|
||||
- The actual start requires clicking the toggle which calls `crowdsecPowerMutation.mutate(true)`
|
||||
|
||||
### Why It Appears Broken
|
||||
|
||||
After Docker rebuild:
|
||||
- Fresh container has `security.crowdsec.enabled` potentially still `true` in DB (persisted volume)
|
||||
- But PID file is gone (container restart)
|
||||
- CrowdSec process not running
|
||||
- UI shows "enabled" setting but status shows "not running"
|
||||
|
||||
### Status() Handler Already Fixed
|
||||
|
||||
Looking at the current implementation in [crowdsec_handler.go#L238-L266](../../backend/internal/api/handlers/crowdsec_handler.go), the `Status()` handler **already includes LAPI readiness check**:
|
||||
|
||||
```go
|
||||
func (h *CrowdsecHandler) Status(c *gin.Context) {
|
||||
ctx := c.Request.Context()
|
||||
running, pid, err := h.Executor.Status(ctx, h.DataDir)
|
||||
// ...
|
||||
// Check LAPI connectivity if process is running
|
||||
lapiReady := false
|
||||
if running {
|
||||
args := []string{"lapi", "status"}
|
||||
// ... LAPI check implementation ...
|
||||
lapiReady = (checkErr == nil)
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, gin.H{
|
||||
"running": running,
|
||||
"pid": pid,
|
||||
"lapi_ready": lapiReady,
|
||||
})
|
||||
}
|
||||
```
|
||||
|
||||
### Additional Enhancement Required
|
||||
|
||||
Add `setting_enabled` and `needs_start` fields to help frontend show correct state:
|
||||
|
||||
**File:** [backend/internal/api/handlers/crowdsec_handler.go](../../backend/internal/api/handlers/crowdsec_handler.go)
|
||||
|
||||
```go
|
||||
func (h *CrowdsecHandler) Status(c *gin.Context) {
|
||||
ctx := c.Request.Context()
|
||||
running, pid, err := h.Executor.Status(ctx, h.DataDir)
|
||||
if err != nil {
|
||||
c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
|
||||
return
|
||||
}
|
||||
|
||||
// Check setting state
|
||||
settingEnabled := false
|
||||
if h.DB != nil {
|
||||
var setting models.Setting
|
||||
if err := h.DB.Where("key = ?", "security.crowdsec.enabled").First(&setting).Error; err == nil {
|
||||
settingEnabled = strings.EqualFold(strings.TrimSpace(setting.Value), "true")
|
||||
}
|
||||
}
|
||||
|
||||
// Check LAPI connectivity if process is running
|
||||
lapiReady := false
|
||||
if running {
|
||||
// ... existing LAPI check ...
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, gin.H{
|
||||
"running": running,
|
||||
"pid": pid,
|
||||
"lapi_ready": lapiReady,
|
||||
"setting_enabled": settingEnabled,
|
||||
"needs_start": settingEnabled && !running, // NEW: hint for frontend
|
||||
})
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Issue 2: 500 Error When Stopping CrowdSec
|
||||
|
||||
### Root Cause Analysis
|
||||
|
||||
The 500 error occurs in [crowdsec_exec.go#L37-L53](../../backend/internal/api/handlers/crowdsec_exec.go):
|
||||
|
||||
```go
|
||||
func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error {
|
||||
b, err := os.ReadFile(e.pidFile(configDir))
|
||||
if err != nil {
|
||||
return fmt.Errorf("pid file read: %w", err) // <-- 500 error here
|
||||
}
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
**The Problem:**
|
||||
1. PID file at `/app/data/crowdsec/crowdsec.pid` doesn't exist
|
||||
2. This happens when:
|
||||
- CrowdSec was never started via the handlers
|
||||
- Container was restarted (PID file lost)
|
||||
- CrowdSec was started externally but not via Charon handlers
|
||||
|
||||
### Fix Required
|
||||
|
||||
Modify `Stop()` in [crowdsec_exec.go](../../backend/internal/api/handlers/crowdsec_exec.go) to handle missing PID gracefully:
|
||||
|
||||
```go
|
||||
func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error {
|
||||
b, err := os.ReadFile(e.pidFile(configDir))
|
||||
if err != nil {
|
||||
if os.IsNotExist(err) {
|
||||
// PID file doesn't exist - process likely not running or was started externally
|
||||
// Try to find and stop any running crowdsec process
|
||||
return e.stopByProcessName(ctx)
|
||||
}
|
||||
return fmt.Errorf("pid file read: %w", err)
|
||||
}
|
||||
pid, err := strconv.Atoi(string(b))
|
||||
if err != nil {
|
||||
return fmt.Errorf("invalid pid: %w", err)
|
||||
}
|
||||
proc, err := os.FindProcess(pid)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
if err := proc.Signal(syscall.SIGTERM); err != nil {
|
||||
// Process might already be dead
|
||||
if errors.Is(err, os.ErrProcessDone) {
|
||||
_ = os.Remove(e.pidFile(configDir))
|
||||
return nil
|
||||
}
|
||||
return err
|
||||
}
|
||||
_ = os.Remove(e.pidFile(configDir))
|
||||
return nil
|
||||
}
|
||||
|
||||
// stopByProcessName attempts to stop CrowdSec by finding it via process name
|
||||
func (e *DefaultCrowdsecExecutor) stopByProcessName(ctx context.Context) error {
|
||||
// Use pkill or pgrep to find crowdsec process
|
||||
cmd := exec.CommandContext(ctx, "pkill", "-TERM", "crowdsec")
|
||||
err := cmd.Run()
|
||||
if err != nil {
|
||||
// pkill returns exit code 1 if no processes matched - that's OK
|
||||
if exitErr, ok := err.(*exec.ExitError); ok && exitErr.ExitCode() == 1 {
|
||||
return nil // No process to kill, already stopped
|
||||
}
|
||||
return fmt.Errorf("failed to stop crowdsec by process name: %w", err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
```
|
||||
|
||||
**File:** [backend/internal/api/handlers/crowdsec_exec.go](../../backend/internal/api/handlers/crowdsec_exec.go)
|
||||
|
||||
---
|
||||
|
||||
## Issue 3: Live Log Viewer Disconnected on Cerberus Dashboard
|
||||
|
||||
### Root Cause Analysis
|
||||
|
||||
The Live Log Viewer uses two WebSocket endpoints:
|
||||
|
||||
1. **Application Logs** (`/api/v1/logs/live`) - Works via `BroadcastHook` in logger
|
||||
2. **Security Logs** (`/api/v1/cerberus/logs/ws`) - Requires `LogWatcher` to tail access log file
|
||||
|
||||
The Cerberus Security Logs WebSocket ([cerberus_logs_ws.go](../../backend/internal/api/handlers/cerberus_logs_ws.go)) depends on `LogWatcher` which tails `/var/log/caddy/access.log`.
|
||||
|
||||
**The Problem:**
|
||||
|
||||
In [log_watcher.go#L102-L117](../../backend/internal/services/log_watcher.go):
|
||||
```go
|
||||
func (w *LogWatcher) tailFile() {
|
||||
for {
|
||||
// Wait for file to exist
|
||||
if _, err := os.Stat(w.logPath); os.IsNotExist(err) {
|
||||
logger.Log().WithField("path", w.logPath).Debug("Log file not found, waiting...")
|
||||
time.Sleep(time.Second)
|
||||
continue
|
||||
}
|
||||
// ...
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
After Docker rebuild:
|
||||
1. Caddy may not have written any logs yet
|
||||
2. `/var/log/caddy/access.log` doesn't exist
|
||||
3. `LogWatcher` enters infinite "waiting" loop
|
||||
4. No log entries are ever sent to WebSocket clients
|
||||
5. Frontend shows "disconnected" because no heartbeat/data received
|
||||
|
||||
### Why "Disconnected" Appears
|
||||
|
||||
From [cerberus_logs_ws.go#L79-L83](../../backend/internal/api/handlers/cerberus_logs_ws.go):
|
||||
```go
|
||||
case <-ticker.C:
|
||||
// Send ping to keep connection alive
|
||||
if err := conn.WriteMessage(websocket.PingMessage, []byte{}); err != nil {
|
||||
return
|
||||
}
|
||||
```
|
||||
|
||||
The ping is sent every 30 seconds, but if the frontend's WebSocket connection times out or encounters an error before receiving any message, it shows "disconnected".
|
||||
|
||||
### Fix Required
|
||||
|
||||
**Fix 1:** Create log file if missing in `LogWatcher.Start()`:
|
||||
|
||||
**File:** [backend/internal/services/log_watcher.go](../../backend/internal/services/log_watcher.go)
|
||||
|
||||
```go
|
||||
import "path/filepath"
|
||||
|
||||
func (w *LogWatcher) Start(ctx context.Context) error {
|
||||
w.mu.Lock()
|
||||
if w.started {
|
||||
w.mu.Unlock()
|
||||
return nil
|
||||
}
|
||||
w.started = true
|
||||
w.mu.Unlock()
|
||||
|
||||
// Ensure log file exists
|
||||
logDir := filepath.Dir(w.logPath)
|
||||
if err := os.MkdirAll(logDir, 0755); err != nil {
|
||||
logger.Log().WithError(err).Warn("Failed to create log directory")
|
||||
}
|
||||
if _, err := os.Stat(w.logPath); os.IsNotExist(err) {
|
||||
if f, err := os.Create(w.logPath); err == nil {
|
||||
f.Close()
|
||||
logger.Log().WithField("path", w.logPath).Info("Created empty log file for tailing")
|
||||
}
|
||||
}
|
||||
|
||||
go w.tailFile()
|
||||
logger.Log().WithField("path", w.logPath).Info("LogWatcher started")
|
||||
return nil
|
||||
}
|
||||
```
|
||||
|
||||
**Fix 2:** Send initial heartbeat message on WebSocket connect:
|
||||
|
||||
**File:** [backend/internal/api/handlers/cerberus_logs_ws.go](../../backend/internal/api/handlers/cerberus_logs_ws.go)
|
||||
|
||||
```go
|
||||
func (h *CerberusLogsHandler) LiveLogs(c *gin.Context) {
|
||||
// ... existing upgrade code ...
|
||||
|
||||
logger.Log().WithField("subscriber_id", subscriberID).Info("Cerberus logs WebSocket connected")
|
||||
|
||||
// Send connection confirmation immediately
|
||||
_ = conn.WriteJSON(map[string]interface{}{
|
||||
"type": "connected",
|
||||
"timestamp": time.Now().Format(time.RFC3339),
|
||||
})
|
||||
|
||||
// ... rest unchanged ...
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Summary of Required Changes
|
||||
|
||||
### File 1: [backend/internal/api/handlers/crowdsec_exec.go](../../backend/internal/api/handlers/crowdsec_exec.go)
|
||||
|
||||
**Change:** Make `Stop()` handle missing PID file gracefully
|
||||
|
||||
```go
|
||||
// Add import for exec
|
||||
import "os/exec"
|
||||
|
||||
// Add this method
|
||||
func (e *DefaultCrowdsecExecutor) stopByProcessName(ctx context.Context) error {
|
||||
cmd := exec.CommandContext(ctx, "pkill", "-TERM", "crowdsec")
|
||||
err := cmd.Run()
|
||||
if err != nil {
|
||||
if exitErr, ok := err.(*exec.ExitError); ok && exitErr.ExitCode() == 1 {
|
||||
return nil
|
||||
}
|
||||
return fmt.Errorf("failed to stop crowdsec by process name: %w", err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// Modify Stop()
|
||||
func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error {
|
||||
b, err := os.ReadFile(e.pidFile(configDir))
|
||||
if err != nil {
|
||||
if os.IsNotExist(err) {
|
||||
return e.stopByProcessName(ctx)
|
||||
}
|
||||
return fmt.Errorf("pid file read: %w", err)
|
||||
}
|
||||
// ... rest unchanged ...
|
||||
}
|
||||
```
|
||||
|
||||
### File 2: [backend/internal/services/log_watcher.go](../../backend/internal/services/log_watcher.go)
|
||||
|
||||
**Change:** Ensure log file exists before starting tail
|
||||
|
||||
```go
|
||||
import "path/filepath"
|
||||
|
||||
func (w *LogWatcher) Start(ctx context.Context) error {
|
||||
w.mu.Lock()
|
||||
if w.started {
|
||||
w.mu.Unlock()
|
||||
return nil
|
||||
}
|
||||
w.started = true
|
||||
w.mu.Unlock()
|
||||
|
||||
// Ensure log file exists
|
||||
logDir := filepath.Dir(w.logPath)
|
||||
if err := os.MkdirAll(logDir, 0755); err != nil {
|
||||
logger.Log().WithError(err).Warn("Failed to create log directory")
|
||||
}
|
||||
if _, err := os.Stat(w.logPath); os.IsNotExist(err) {
|
||||
if f, err := os.Create(w.logPath); err == nil {
|
||||
f.Close()
|
||||
}
|
||||
}
|
||||
|
||||
go w.tailFile()
|
||||
logger.Log().WithField("path", w.logPath).Info("LogWatcher started")
|
||||
return nil
|
||||
}
|
||||
```
|
||||
|
||||
### File 3: [backend/internal/api/handlers/cerberus_logs_ws.go](../../backend/internal/api/handlers/cerberus_logs_ws.go)
|
||||
|
||||
**Change:** Send connection confirmation on WebSocket connect
|
||||
|
||||
```go
|
||||
func (h *CerberusLogsHandler) LiveLogs(c *gin.Context) {
|
||||
// ... existing upgrade code ...
|
||||
|
||||
logger.Log().WithField("subscriber_id", subscriberID).Info("Cerberus logs WebSocket connected")
|
||||
|
||||
// Send connection confirmation immediately
|
||||
_ = conn.WriteJSON(map[string]interface{}{
|
||||
"type": "connected",
|
||||
"timestamp": time.Now().Format(time.RFC3339),
|
||||
})
|
||||
|
||||
// ... rest unchanged ...
|
||||
}
|
||||
```
|
||||
|
||||
### File 4: [backend/internal/api/handlers/crowdsec_handler.go](../../backend/internal/api/handlers/crowdsec_handler.go)
|
||||
|
||||
**Change:** Add setting reconciliation hint in Status response
|
||||
|
||||
```go
|
||||
func (h *CrowdsecHandler) Status(c *gin.Context) {
|
||||
ctx := c.Request.Context()
|
||||
running, pid, err := h.Executor.Status(ctx, h.DataDir)
|
||||
if err != nil {
|
||||
c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
|
||||
return
|
||||
}
|
||||
|
||||
// Check setting state
|
||||
settingEnabled := false
|
||||
if h.DB != nil {
|
||||
var setting models.Setting
|
||||
if err := h.DB.Where("key = ?", "security.crowdsec.enabled").First(&setting).Error; err == nil {
|
||||
settingEnabled = strings.EqualFold(strings.TrimSpace(setting.Value), "true")
|
||||
}
|
||||
}
|
||||
|
||||
// Check LAPI connectivity if process is running
|
||||
lapiReady := false
|
||||
if running {
|
||||
// ... existing LAPI check ...
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, gin.H{
|
||||
"running": running,
|
||||
"pid": pid,
|
||||
"lapi_ready": lapiReady,
|
||||
"setting_enabled": settingEnabled,
|
||||
"needs_start": settingEnabled && !running,
|
||||
})
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing Steps
|
||||
|
||||
### Test Issue 1: CrowdSec Status Consistency
|
||||
|
||||
1. Start container fresh
|
||||
2. Check Security dashboard - should show CrowdSec as "Disabled"
|
||||
3. Toggle CrowdSec on - should start process and show "Running"
|
||||
4. Restart container
|
||||
5. Check Security dashboard - should show "needs restart" or auto-start
|
||||
|
||||
### Test Issue 2: Stop CrowdSec Without Error
|
||||
|
||||
1. With CrowdSec not running, try to stop via UI toggle
|
||||
2. Should NOT return 500 error
|
||||
3. Should return success or "already stopped"
|
||||
4. Check logs for graceful handling
|
||||
|
||||
### Test Issue 3: Live Logs Connection
|
||||
|
||||
1. Start container fresh
|
||||
2. Navigate to Cerberus Dashboard
|
||||
3. Live Log Viewer should show "Connected" status
|
||||
4. Make a request to trigger log entry
|
||||
5. Entry should appear in viewer
|
||||
|
||||
### Integration Test
|
||||
|
||||
```bash
|
||||
# Run in container
|
||||
cd /projects/Charon/backend
|
||||
go test ./internal/api/handlers/... -run TestCrowdsec -v
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Debug Commands
|
||||
|
||||
```bash
|
||||
# Check if CrowdSec PID file exists
|
||||
ls -la /app/data/crowdsec/crowdsec.pid
|
||||
|
||||
# Check CrowdSec process status
|
||||
pgrep -la crowdsec
|
||||
|
||||
# Check access log file
|
||||
ls -la /var/log/caddy/access.log
|
||||
|
||||
# Test LAPI health
|
||||
curl http://127.0.0.1:8085/health
|
||||
|
||||
# Check WebSocket endpoint
|
||||
# In browser console:
|
||||
# new WebSocket('ws://localhost:8080/api/v1/cerberus/logs/ws')
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
All three issues stem from **state synchronization problems** after container restart:
|
||||
|
||||
1. **CrowdSec**: Database setting doesn't match process state
|
||||
2. **Stop Error**: Handler assumes PID file exists when it may not
|
||||
3. **Live Logs**: Log file may not exist, causing LogWatcher to wait indefinitely
|
||||
|
||||
The fixes are defensive programming patterns:
|
||||
- Handle missing PID file gracefully
|
||||
- Create log files if they don't exist
|
||||
- Add reconciliation hints in status responses
|
||||
- Send WebSocket heartbeats immediately on connect
|
||||
|
||||
---
|
||||
|
||||
## Commit Message Template
|
||||
|
||||
```
|
||||
fix: handle container restart edge cases for CrowdSec and Live Logs
|
||||
|
||||
Issue 1 - CrowdSec "not running" status:
|
||||
- Add setting_enabled and needs_start fields to Status() response
|
||||
- Frontend can now show proper "needs restart" state
|
||||
|
||||
Issue 2 - 500 error on Stop:
|
||||
- Handle missing PID file gracefully in Stop()
|
||||
- Fallback to pkill if PID file doesn't exist
|
||||
- Return success if process already stopped
|
||||
|
||||
Issue 3 - Live Logs disconnected:
|
||||
- Create log file if it doesn't exist on LogWatcher.Start()
|
||||
- Send WebSocket connection confirmation immediately
|
||||
- Ensure clients know connection is alive before first log entry
|
||||
|
||||
All fixes are defensive programming patterns for container restart scenarios.
|
||||
```
|
||||
Reference in New Issue
Block a user