- Issue 1: Corrected CrowdSec status reporting by adding `setting_enabled` and `needs_start` fields to the Status() response, allowing the frontend to accurately reflect the need for a restart. - Issue 2: Resolved 500 error on stopping CrowdSec by implementing graceful handling of missing PID files in the Stop() method, with a fallback to process termination via pkill. - Issue 3: Fixed Live Logs disconnection issue by ensuring the log file is created if it doesn't exist during LogWatcher.Start() and sending an immediate WebSocket connection confirmation to clients. These changes enhance the robustness of the application in handling container restart scenarios.
13 KiB
Fix Plan: Critical Issues After Docker Rebuild
Date: December 14, 2025 Status: Planning Phase Priority: P0 - Urgent
Issue Summary
After a Docker container rebuild, three critical issues were identified:
- 500 error on CrowdSec Stop() - Toggling CrowdSec OFF returns "Failed to stop CrowdSec: Request failed with status code 500"
- CrowdSec shows "not running" - Despite database setting being enabled, the process isn't running after container restart
- Live Logs Disconnected - WebSocket shows disconnected state even when logs are being generated
Root Cause Analysis
Issue 1: 500 Error on Stop()
Location: crowdsec_exec.go
Root Cause: The Stop() method in DefaultCrowdsecExecutor fails when there's no PID file.
func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error {
b, err := os.ReadFile(e.pidFile(configDir))
if err != nil {
return fmt.Errorf("pid file read: %w", err) // ← FAILS HERE with 500
}
// ...
}
Problem: When the container restarts:
- The PID file (
/app/data/crowdsec/crowdsec.pid) is deleted (ephemeral or process cleanup removes it) - The database still has CrowdSec "enabled" = true
- User clicks "Disable" (which calls Stop())
- Stop() tries to read the PID file → fails → returns error → 500 response
Why it fails: The code assumes a PID file always exists when stopping. But after container restart, there's no PID file because the CrowdSec process wasn't running (GUI-controlled lifecycle, NOT auto-started).
Issue 2: CrowdSec Shows "Not Running" After Restart
Location: docker-entrypoint.sh
Root Cause: CrowdSec is intentionally NOT auto-started in the entrypoint. The design is "GUI-controlled lifecycle."
From the entrypoint:
# CrowdSec Lifecycle Management:
# CrowdSec configuration is initialized above (symlinks, directories, hub updates)
# However, the CrowdSec agent is NOT auto-started in the entrypoint.
# Instead, CrowdSec lifecycle is managed by the backend handlers via GUI controls.
Problem: The database stores the user's "enabled" preference, but there's no reconciliation at startup:
- Container restarts
- Database says CrowdSec
enabled = true(user's previous preference) - But CrowdSec process is NOT started (by design)
- UI shows "enabled" but status shows "not running" → confusing state mismatch
Missing Logic: No startup reconciliation to check "if DB says enabled, start CrowdSec process."
Issue 3: Live Logs Disconnected
Location: logs_ws.go and log_watcher.go
Root Cause: There are two separate WebSocket log systems that may be misconfigured:
/api/v1/logs/live(logs_ws.go) - Streams application logs vialogger.GetBroadcastHook()/api/v1/cerberus/logs/ws(cerberus_logs_ws.go) - Streams Caddy access logs viaLogWatcher
Potential Issues:
a) LogWatcher not started: The LogWatcher must be explicitly started with Start(ctx). If the watcher isn't started during server initialization, no logs are broadcast.
b) Log file doesn't exist: The LogWatcher waits for /var/log/caddy/access.log to exist. After container restart with no traffic, this file may not exist yet.
c) WebSocket connection path mismatch: Frontend might connect to wrong endpoint or with invalid token.
d) CSP blocking WebSocket: Security middleware's Content-Security-Policy must allow ws: and wss: protocols.
Detailed Code Analysis
Stop() Method - Full Code Review
// File: backend/internal/api/handlers/crowdsec_exec.go
func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error {
b, err := os.ReadFile(e.pidFile(configDir))
if err != nil {
return fmt.Errorf("pid file read: %w", err) // ← CRITICAL: Returns error on ENOENT
}
pid, err := strconv.Atoi(string(b))
if err != nil {
return fmt.Errorf("invalid pid: %w", err)
}
proc, err := os.FindProcess(pid)
if err != nil {
return err
}
if err := proc.Signal(syscall.SIGTERM); err != nil {
return err
}
_ = os.Remove(e.pidFile(configDir))
return nil
}
The problem is clear: os.ReadFile() returns os.ErrNotExist when the PID file doesn't exist, and this is propagated as a 500 error.
Status() Method - Already Handles Missing PID Gracefully
func (e *DefaultCrowdsecExecutor) Status(ctx context.Context, configDir string) (running bool, pid int, err error) {
b, err := os.ReadFile(e.pidFile(configDir))
if err != nil {
// Missing pid file is treated as not running ← GOOD PATTERN
return false, 0, nil
}
// ...
}
Key Insight: Status() already handles missing PID file gracefully by returning (false, 0, nil). The Stop() method should follow the same pattern.
Fix Plan
Fix 1: Make Stop() Idempotent (No Error When Already Stopped)
File: backend/internal/api/handlers/crowdsec_exec.go
Current Code (lines 36-51):
func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error {
b, err := os.ReadFile(e.pidFile(configDir))
if err != nil {
return fmt.Errorf("pid file read: %w", err)
}
pid, err := strconv.Atoi(string(b))
if err != nil {
return fmt.Errorf("invalid pid: %w", err)
}
proc, err := os.FindProcess(pid)
if err != nil {
return err
}
if err := proc.Signal(syscall.SIGTERM); err != nil {
return err
}
// best-effort remove pid file
_ = os.Remove(e.pidFile(configDir))
return nil
}
Fixed Code:
func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error {
b, err := os.ReadFile(e.pidFile(configDir))
if err != nil {
if os.IsNotExist(err) {
// No PID file means process isn't running - that's OK for Stop()
// This makes Stop() idempotent (safe to call multiple times)
return nil
}
return fmt.Errorf("pid file read: %w", err)
}
pid, err := strconv.Atoi(string(b))
if err != nil {
// Malformed PID file - remove it and treat as not running
_ = os.Remove(e.pidFile(configDir))
return nil
}
proc, err := os.FindProcess(pid)
if err != nil {
// Process lookup failed - clean up PID file
_ = os.Remove(e.pidFile(configDir))
return nil
}
if err := proc.Signal(syscall.SIGTERM); err != nil {
// Process may already be dead - clean up PID file
_ = os.Remove(e.pidFile(configDir))
// Only return error if it's not "process doesn't exist"
if !errors.Is(err, os.ErrProcessDone) && !errors.Is(err, syscall.ESRCH) {
return err
}
return nil
}
// best-effort remove pid file
_ = os.Remove(e.pidFile(configDir))
return nil
}
Rationale: Stop() should be idempotent. Stopping an already-stopped service shouldn't error.
Fix 2: Add CrowdSec Startup Reconciliation
File: backend/internal/api/routes/routes.go (or create backend/internal/services/crowdsec_startup.go)
New Function:
// ReconcileCrowdSecOnStartup checks if CrowdSec should be running based on DB settings
// and starts it if necessary. This handles the case where the container restarts
// but the user's preference was to have CrowdSec enabled.
func ReconcileCrowdSecOnStartup(db *gorm.DB, executor handlers.CrowdsecExecutor, binPath, dataDir string) {
if db == nil {
return
}
var secCfg models.SecurityConfig
if err := db.First(&secCfg).Error; err != nil {
// No config yet or error - nothing to reconcile
logger.Log().WithError(err).Debug("No security config found for CrowdSec reconciliation")
return
}
// Check if CrowdSec should be running based on mode
if secCfg.CrowdSecMode != "local" {
logger.Log().WithField("mode", secCfg.CrowdSecMode).Debug("CrowdSec mode is not 'local', skipping auto-start")
return
}
// Check if already running
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
running, _, _ := executor.Status(ctx, dataDir)
if running {
logger.Log().Info("CrowdSec already running, no action needed")
return
}
// Start CrowdSec since DB says it should be enabled
logger.Log().Info("CrowdSec mode is 'local' but process not running, starting...")
_, err := executor.Start(ctx, binPath, dataDir)
if err != nil {
logger.Log().WithError(err).Warn("Failed to auto-start CrowdSec on startup reconciliation")
} else {
logger.Log().Info("CrowdSec started successfully via startup reconciliation")
}
}
Integration Point: Call this function after database migration in server initialization:
// In routes.go or server.go, after DB is ready and handlers are created
if crowdsecHandler != nil {
ReconcileCrowdSecOnStartup(db, crowdsecHandler.Executor, crowdsecHandler.BinPath, crowdsecHandler.DataDir)
}
Fix 3: Ensure LogWatcher is Started and Log File Exists
File: backend/internal/api/routes/routes.go
Check that LogWatcher.Start() is called:
// Ensure LogWatcher is started with proper log path
logPath := "/var/log/caddy/access.log"
// Ensure log directory exists
if err := os.MkdirAll(filepath.Dir(logPath), 0755); err != nil {
logger.Log().WithError(err).Warn("Failed to create log directory")
}
// Create empty log file if it doesn't exist (allows LogWatcher to start tailing immediately)
if _, err := os.Stat(logPath); os.IsNotExist(err) {
if f, err := os.Create(logPath); err == nil {
f.Close()
logger.Log().WithField("path", logPath).Info("Created empty log file for LogWatcher")
}
}
// Create and start the LogWatcher
watcher := services.NewLogWatcher(logPath)
if err := watcher.Start(context.Background()); err != nil {
logger.Log().WithError(err).Error("Failed to start LogWatcher")
}
Additionally, verify CSP allows WebSocket:
The security middleware in backend/internal/api/middleware/security.go already has:
directives["connect-src"] = "'self' ws: wss:" // WebSocket for HMR
This should allow WebSocket connections.
Files to Modify
| File | Change |
|---|---|
backend/internal/api/handlers/crowdsec_exec.go |
Make Stop() idempotent for missing PID file |
backend/internal/api/routes/routes.go |
Add CrowdSec startup reconciliation call |
backend/internal/services/crowdsec_startup.go |
(NEW) Create startup reconciliation function |
backend/internal/api/handlers/crowdsec_exec_test.go |
Add tests for Stop() idempotency |
Testing Plan
Test 1: Stop() Idempotency
# Start container fresh (no CrowdSec running)
docker compose down -v && docker compose up -d
# Call Stop() without starting CrowdSec first
curl -X POST http://localhost:8080/api/v1/admin/crowdsec/stop \
-H "Authorization: Bearer $TOKEN"
# Expected: 200 OK {"status": "stopped"}
# NOT: 500 Internal Server Error
Test 2: Startup Reconciliation
# Enable CrowdSec via API
curl -X POST http://localhost:8080/api/v1/admin/crowdsec/start \
-H "Authorization: Bearer $TOKEN"
# Verify running
curl http://localhost:8080/api/v1/admin/crowdsec/status \
-H "Authorization: Bearer $TOKEN"
# Expected: {"running": true, "pid": xxx, "lapi_ready": true}
# Restart container
docker compose restart charon
# Wait for startup
sleep 10
# Verify CrowdSec auto-started
curl http://localhost:8080/api/v1/admin/crowdsec/status \
-H "Authorization: Bearer $TOKEN"
# Expected: {"running": true, "pid": xxx, "lapi_ready": true}
Test 3: Live Logs
# Connect to WebSocket
websocat "ws://localhost:8080/api/v1/logs/live?token=$TOKEN"
# In another terminal, generate traffic
curl http://localhost:8080/api/v1/health
# Verify log entries appear in WebSocket connection
Implementation Order
- Fix 1 (Stop Idempotency) - Quick fix, high impact, unblocks users immediately
- Fix 2 (Startup Reconciliation) - Core fix for state mismatch after restart
- Fix 3 (Live Logs) - May need more investigation; likely already working if LogWatcher is started
Risk Assessment
| Change | Risk | Mitigation |
|---|---|---|
| Stop() idempotency | Very Low | Makes existing code more robust |
| Startup reconciliation | Low | Only runs once at startup, respects DB state |
| Log file creation | Very Low | Standard file operations with error handling |
Notes
- The CrowdSec PID file path is
${dataDir}/crowdsec.pid(e.g.,/app/data/crowdsec/crowdsec.pid) - The LogWatcher monitors
/var/log/caddy/access.logby default - WebSocket ping interval is 30 seconds for keep-alive
- CrowdSec LAPI runs on port 8085 (not 8080) to avoid conflict with Charon
- The
Status()handler already includeslapi_readyfield (from previous fix)