- Issue 1: Corrected CrowdSec status reporting by adding `setting_enabled` and `needs_start` fields to the Status() response, allowing the frontend to accurately reflect the need for a restart. - Issue 2: Resolved 500 error on stopping CrowdSec by implementing graceful handling of missing PID files in the Stop() method, with a fallback to process termination via pkill. - Issue 3: Fixed Live Logs disconnection issue by ensuring the log file is created if it doesn't exist during LogWatcher.Start() and sending an immediate WebSocket connection confirmation to clients. These changes enhance the robustness of the application in handling container restart scenarios.
387 lines
13 KiB
Markdown
387 lines
13 KiB
Markdown
# Fix Plan: Critical Issues After Docker Rebuild
|
|
|
|
**Date:** December 14, 2025
|
|
**Status:** Planning Phase
|
|
**Priority:** P0 - Urgent
|
|
|
|
---
|
|
|
|
## Issue Summary
|
|
|
|
After a Docker container rebuild, three critical issues were identified:
|
|
|
|
1. **500 error on CrowdSec Stop()** - Toggling CrowdSec OFF returns "Failed to stop CrowdSec: Request failed with status code 500"
|
|
2. **CrowdSec shows "not running"** - Despite database setting being enabled, the process isn't running after container restart
|
|
3. **Live Logs Disconnected** - WebSocket shows disconnected state even when logs are being generated
|
|
|
|
---
|
|
|
|
## Root Cause Analysis
|
|
|
|
### Issue 1: 500 Error on Stop()
|
|
|
|
**Location:** [crowdsec_exec.go](../../backend/internal/api/handlers/crowdsec_exec.go#L36-L51)
|
|
|
|
**Root Cause:** The `Stop()` method in `DefaultCrowdsecExecutor` fails when there's no PID file.
|
|
|
|
```go
|
|
func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error {
|
|
b, err := os.ReadFile(e.pidFile(configDir))
|
|
if err != nil {
|
|
return fmt.Errorf("pid file read: %w", err) // ← FAILS HERE with 500
|
|
}
|
|
// ...
|
|
}
|
|
```
|
|
|
|
**Problem:** When the container restarts:
|
|
1. The PID file (`/app/data/crowdsec/crowdsec.pid`) is deleted (ephemeral or process cleanup removes it)
|
|
2. The database still has CrowdSec "enabled" = true
|
|
3. User clicks "Disable" (which calls Stop())
|
|
4. Stop() tries to read the PID file → fails → returns error → 500 response
|
|
|
|
**Why it fails:** The code assumes a PID file always exists when stopping. But after container restart, there's no PID file because the CrowdSec process wasn't running (GUI-controlled lifecycle, NOT auto-started).
|
|
|
|
### Issue 2: CrowdSec Shows "Not Running" After Restart
|
|
|
|
**Location:** [docker-entrypoint.sh](../../docker-entrypoint.sh#L91-L97)
|
|
|
|
**Root Cause:** CrowdSec is **intentionally NOT auto-started** in the entrypoint. The design is "GUI-controlled lifecycle."
|
|
|
|
From the entrypoint:
|
|
```bash
|
|
# CrowdSec Lifecycle Management:
|
|
# CrowdSec configuration is initialized above (symlinks, directories, hub updates)
|
|
# However, the CrowdSec agent is NOT auto-started in the entrypoint.
|
|
# Instead, CrowdSec lifecycle is managed by the backend handlers via GUI controls.
|
|
```
|
|
|
|
**Problem:** The database stores the user's "enabled" preference, but there's no reconciliation at startup:
|
|
1. Container restarts
|
|
2. Database says CrowdSec `enabled = true` (user's previous preference)
|
|
3. But CrowdSec process is NOT started (by design)
|
|
4. UI shows "enabled" but status shows "not running" → confusing state mismatch
|
|
|
|
**Missing Logic:** No startup reconciliation to check "if DB says enabled, start CrowdSec process."
|
|
|
|
### Issue 3: Live Logs Disconnected
|
|
|
|
**Location:** [logs_ws.go](../../backend/internal/api/handlers/logs_ws.go) and [log_watcher.go](../../backend/internal/services/log_watcher.go)
|
|
|
|
**Root Cause:** There are **two separate WebSocket log systems** that may be misconfigured:
|
|
|
|
1. **`/api/v1/logs/live`** (logs_ws.go) - Streams application logs via `logger.GetBroadcastHook()`
|
|
2. **`/api/v1/cerberus/logs/ws`** (cerberus_logs_ws.go) - Streams Caddy access logs via `LogWatcher`
|
|
|
|
**Potential Issues:**
|
|
|
|
a) **LogWatcher not started:** The `LogWatcher` must be explicitly started with `Start(ctx)`. If the watcher isn't started during server initialization, no logs are broadcast.
|
|
|
|
b) **Log file doesn't exist:** The LogWatcher waits for `/var/log/caddy/access.log` to exist. After container restart with no traffic, this file may not exist yet.
|
|
|
|
c) **WebSocket connection path mismatch:** Frontend might connect to wrong endpoint or with invalid token.
|
|
|
|
d) **CSP blocking WebSocket:** Security middleware's Content-Security-Policy must allow `ws:` and `wss:` protocols.
|
|
|
|
---
|
|
|
|
## Detailed Code Analysis
|
|
|
|
### Stop() Method - Full Code Review
|
|
|
|
```go
|
|
// File: backend/internal/api/handlers/crowdsec_exec.go
|
|
func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error {
|
|
b, err := os.ReadFile(e.pidFile(configDir))
|
|
if err != nil {
|
|
return fmt.Errorf("pid file read: %w", err) // ← CRITICAL: Returns error on ENOENT
|
|
}
|
|
pid, err := strconv.Atoi(string(b))
|
|
if err != nil {
|
|
return fmt.Errorf("invalid pid: %w", err)
|
|
}
|
|
proc, err := os.FindProcess(pid)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
if err := proc.Signal(syscall.SIGTERM); err != nil {
|
|
return err
|
|
}
|
|
_ = os.Remove(e.pidFile(configDir))
|
|
return nil
|
|
}
|
|
```
|
|
|
|
The problem is clear: `os.ReadFile()` returns `os.ErrNotExist` when the PID file doesn't exist, and this is propagated as a 500 error.
|
|
|
|
### Status() Method - Already Handles Missing PID Gracefully
|
|
|
|
```go
|
|
func (e *DefaultCrowdsecExecutor) Status(ctx context.Context, configDir string) (running bool, pid int, err error) {
|
|
b, err := os.ReadFile(e.pidFile(configDir))
|
|
if err != nil {
|
|
// Missing pid file is treated as not running ← GOOD PATTERN
|
|
return false, 0, nil
|
|
}
|
|
// ...
|
|
}
|
|
```
|
|
|
|
**Key Insight:** `Status()` already handles missing PID file gracefully by returning `(false, 0, nil)`. The `Stop()` method should follow the same pattern.
|
|
|
|
---
|
|
|
|
## Fix Plan
|
|
|
|
### Fix 1: Make Stop() Idempotent (No Error When Already Stopped)
|
|
|
|
**File:** `backend/internal/api/handlers/crowdsec_exec.go`
|
|
|
|
**Current Code (lines 36-51):**
|
|
```go
|
|
func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error {
|
|
b, err := os.ReadFile(e.pidFile(configDir))
|
|
if err != nil {
|
|
return fmt.Errorf("pid file read: %w", err)
|
|
}
|
|
pid, err := strconv.Atoi(string(b))
|
|
if err != nil {
|
|
return fmt.Errorf("invalid pid: %w", err)
|
|
}
|
|
proc, err := os.FindProcess(pid)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
if err := proc.Signal(syscall.SIGTERM); err != nil {
|
|
return err
|
|
}
|
|
// best-effort remove pid file
|
|
_ = os.Remove(e.pidFile(configDir))
|
|
return nil
|
|
}
|
|
```
|
|
|
|
**Fixed Code:**
|
|
```go
|
|
func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error {
|
|
b, err := os.ReadFile(e.pidFile(configDir))
|
|
if err != nil {
|
|
if os.IsNotExist(err) {
|
|
// No PID file means process isn't running - that's OK for Stop()
|
|
// This makes Stop() idempotent (safe to call multiple times)
|
|
return nil
|
|
}
|
|
return fmt.Errorf("pid file read: %w", err)
|
|
}
|
|
pid, err := strconv.Atoi(string(b))
|
|
if err != nil {
|
|
// Malformed PID file - remove it and treat as not running
|
|
_ = os.Remove(e.pidFile(configDir))
|
|
return nil
|
|
}
|
|
proc, err := os.FindProcess(pid)
|
|
if err != nil {
|
|
// Process lookup failed - clean up PID file
|
|
_ = os.Remove(e.pidFile(configDir))
|
|
return nil
|
|
}
|
|
if err := proc.Signal(syscall.SIGTERM); err != nil {
|
|
// Process may already be dead - clean up PID file
|
|
_ = os.Remove(e.pidFile(configDir))
|
|
// Only return error if it's not "process doesn't exist"
|
|
if !errors.Is(err, os.ErrProcessDone) && !errors.Is(err, syscall.ESRCH) {
|
|
return err
|
|
}
|
|
return nil
|
|
}
|
|
// best-effort remove pid file
|
|
_ = os.Remove(e.pidFile(configDir))
|
|
return nil
|
|
}
|
|
```
|
|
|
|
**Rationale:** Stop() should be idempotent. Stopping an already-stopped service shouldn't error.
|
|
|
|
### Fix 2: Add CrowdSec Startup Reconciliation
|
|
|
|
**File:** `backend/internal/api/routes/routes.go` (or create `backend/internal/services/crowdsec_startup.go`)
|
|
|
|
**New Function:**
|
|
```go
|
|
// ReconcileCrowdSecOnStartup checks if CrowdSec should be running based on DB settings
|
|
// and starts it if necessary. This handles the case where the container restarts
|
|
// but the user's preference was to have CrowdSec enabled.
|
|
func ReconcileCrowdSecOnStartup(db *gorm.DB, executor handlers.CrowdsecExecutor, binPath, dataDir string) {
|
|
if db == nil {
|
|
return
|
|
}
|
|
|
|
var secCfg models.SecurityConfig
|
|
if err := db.First(&secCfg).Error; err != nil {
|
|
// No config yet or error - nothing to reconcile
|
|
logger.Log().WithError(err).Debug("No security config found for CrowdSec reconciliation")
|
|
return
|
|
}
|
|
|
|
// Check if CrowdSec should be running based on mode
|
|
if secCfg.CrowdSecMode != "local" {
|
|
logger.Log().WithField("mode", secCfg.CrowdSecMode).Debug("CrowdSec mode is not 'local', skipping auto-start")
|
|
return
|
|
}
|
|
|
|
// Check if already running
|
|
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
|
|
defer cancel()
|
|
|
|
running, _, _ := executor.Status(ctx, dataDir)
|
|
if running {
|
|
logger.Log().Info("CrowdSec already running, no action needed")
|
|
return
|
|
}
|
|
|
|
// Start CrowdSec since DB says it should be enabled
|
|
logger.Log().Info("CrowdSec mode is 'local' but process not running, starting...")
|
|
_, err := executor.Start(ctx, binPath, dataDir)
|
|
if err != nil {
|
|
logger.Log().WithError(err).Warn("Failed to auto-start CrowdSec on startup reconciliation")
|
|
} else {
|
|
logger.Log().Info("CrowdSec started successfully via startup reconciliation")
|
|
}
|
|
}
|
|
```
|
|
|
|
**Integration Point:** Call this function after database migration in server initialization:
|
|
```go
|
|
// In routes.go or server.go, after DB is ready and handlers are created
|
|
if crowdsecHandler != nil {
|
|
ReconcileCrowdSecOnStartup(db, crowdsecHandler.Executor, crowdsecHandler.BinPath, crowdsecHandler.DataDir)
|
|
}
|
|
```
|
|
|
|
### Fix 3: Ensure LogWatcher is Started and Log File Exists
|
|
|
|
**File:** `backend/internal/api/routes/routes.go`
|
|
|
|
**Check that LogWatcher.Start() is called:**
|
|
```go
|
|
// Ensure LogWatcher is started with proper log path
|
|
logPath := "/var/log/caddy/access.log"
|
|
|
|
// Ensure log directory exists
|
|
if err := os.MkdirAll(filepath.Dir(logPath), 0755); err != nil {
|
|
logger.Log().WithError(err).Warn("Failed to create log directory")
|
|
}
|
|
|
|
// Create empty log file if it doesn't exist (allows LogWatcher to start tailing immediately)
|
|
if _, err := os.Stat(logPath); os.IsNotExist(err) {
|
|
if f, err := os.Create(logPath); err == nil {
|
|
f.Close()
|
|
logger.Log().WithField("path", logPath).Info("Created empty log file for LogWatcher")
|
|
}
|
|
}
|
|
|
|
// Create and start the LogWatcher
|
|
watcher := services.NewLogWatcher(logPath)
|
|
if err := watcher.Start(context.Background()); err != nil {
|
|
logger.Log().WithError(err).Error("Failed to start LogWatcher")
|
|
}
|
|
```
|
|
|
|
**Additionally, verify CSP allows WebSocket:**
|
|
The security middleware in `backend/internal/api/middleware/security.go` already has:
|
|
```go
|
|
directives["connect-src"] = "'self' ws: wss:" // WebSocket for HMR
|
|
```
|
|
|
|
This should allow WebSocket connections.
|
|
|
|
---
|
|
|
|
## Files to Modify
|
|
|
|
| File | Change |
|
|
|------|--------|
|
|
| `backend/internal/api/handlers/crowdsec_exec.go` | Make `Stop()` idempotent for missing PID file |
|
|
| `backend/internal/api/routes/routes.go` | Add CrowdSec startup reconciliation call |
|
|
| `backend/internal/services/crowdsec_startup.go` | (NEW) Create startup reconciliation function |
|
|
| `backend/internal/api/handlers/crowdsec_exec_test.go` | Add tests for Stop() idempotency |
|
|
|
|
---
|
|
|
|
## Testing Plan
|
|
|
|
### Test 1: Stop() Idempotency
|
|
```bash
|
|
# Start container fresh (no CrowdSec running)
|
|
docker compose down -v && docker compose up -d
|
|
|
|
# Call Stop() without starting CrowdSec first
|
|
curl -X POST http://localhost:8080/api/v1/admin/crowdsec/stop \
|
|
-H "Authorization: Bearer $TOKEN"
|
|
|
|
# Expected: 200 OK {"status": "stopped"}
|
|
# NOT: 500 Internal Server Error
|
|
```
|
|
|
|
### Test 2: Startup Reconciliation
|
|
```bash
|
|
# Enable CrowdSec via API
|
|
curl -X POST http://localhost:8080/api/v1/admin/crowdsec/start \
|
|
-H "Authorization: Bearer $TOKEN"
|
|
|
|
# Verify running
|
|
curl http://localhost:8080/api/v1/admin/crowdsec/status \
|
|
-H "Authorization: Bearer $TOKEN"
|
|
# Expected: {"running": true, "pid": xxx, "lapi_ready": true}
|
|
|
|
# Restart container
|
|
docker compose restart charon
|
|
|
|
# Wait for startup
|
|
sleep 10
|
|
|
|
# Verify CrowdSec auto-started
|
|
curl http://localhost:8080/api/v1/admin/crowdsec/status \
|
|
-H "Authorization: Bearer $TOKEN"
|
|
# Expected: {"running": true, "pid": xxx, "lapi_ready": true}
|
|
```
|
|
|
|
### Test 3: Live Logs
|
|
```bash
|
|
# Connect to WebSocket
|
|
websocat "ws://localhost:8080/api/v1/logs/live?token=$TOKEN"
|
|
|
|
# In another terminal, generate traffic
|
|
curl http://localhost:8080/api/v1/health
|
|
|
|
# Verify log entries appear in WebSocket connection
|
|
```
|
|
|
|
---
|
|
|
|
## Implementation Order
|
|
|
|
1. **Fix 1 (Stop Idempotency)** - Quick fix, high impact, unblocks users immediately
|
|
2. **Fix 2 (Startup Reconciliation)** - Core fix for state mismatch after restart
|
|
3. **Fix 3 (Live Logs)** - May need more investigation; likely already working if LogWatcher is started
|
|
|
|
---
|
|
|
|
## Risk Assessment
|
|
|
|
| Change | Risk | Mitigation |
|
|
|--------|------|------------|
|
|
| Stop() idempotency | Very Low | Makes existing code more robust |
|
|
| Startup reconciliation | Low | Only runs once at startup, respects DB state |
|
|
| Log file creation | Very Low | Standard file operations with error handling |
|
|
|
|
---
|
|
|
|
## Notes
|
|
|
|
- The CrowdSec PID file path is `${dataDir}/crowdsec.pid` (e.g., `/app/data/crowdsec/crowdsec.pid`)
|
|
- The LogWatcher monitors `/var/log/caddy/access.log` by default
|
|
- WebSocket ping interval is 30 seconds for keep-alive
|
|
- CrowdSec LAPI runs on port 8085 (not 8080) to avoid conflict with Charon
|
|
- The `Status()` handler already includes `lapi_ready` field (from previous fix)
|