Files

GitHub Actions 3f06fe850f fix: address post-rebuild issues with CrowdSec and Live Logs

- Issue 1: Corrected CrowdSec status reporting by adding `setting_enabled` and `needs_start` fields to the Status() response, allowing the frontend to accurately reflect the need for a restart.
- Issue 2: Resolved 500 error on stopping CrowdSec by implementing graceful handling of missing PID files in the Stop() method, with a fallback to process termination via pkill.
- Issue 3: Fixed Live Logs disconnection issue by ensuring the log file is created if it doesn't exist during LogWatcher.Start() and sending an immediate WebSocket connection confirmation to clients.

These changes enhance the robustness of the application in handling container restart scenarios.

2025-12-15 07:30:35 +00:00

13 KiB

Raw Blame History

Fix Plan: Critical Issues After Docker Rebuild

Date: December 14, 2025 Status: Planning Phase Priority: P0 - Urgent

Issue Summary

After a Docker container rebuild, three critical issues were identified:

500 error on CrowdSec Stop() - Toggling CrowdSec OFF returns "Failed to stop CrowdSec: Request failed with status code 500"
CrowdSec shows "not running" - Despite database setting being enabled, the process isn't running after container restart
Live Logs Disconnected - WebSocket shows disconnected state even when logs are being generated

Root Cause Analysis

Issue 1: 500 Error on Stop()

Location: crowdsec_exec.go

Root Cause: The Stop() method in DefaultCrowdsecExecutor fails when there's no PID file.

func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error {
    b, err := os.ReadFile(e.pidFile(configDir))
    if err != nil {
        return fmt.Errorf("pid file read: %w", err)  // ← FAILS HERE with 500
    }
    // ...
}

Problem: When the container restarts:

The PID file (/app/data/crowdsec/crowdsec.pid) is deleted (ephemeral or process cleanup removes it)
The database still has CrowdSec "enabled" = true
User clicks "Disable" (which calls Stop())
Stop() tries to read the PID file → fails → returns error → 500 response

Why it fails: The code assumes a PID file always exists when stopping. But after container restart, there's no PID file because the CrowdSec process wasn't running (GUI-controlled lifecycle, NOT auto-started).

Issue 2: CrowdSec Shows "Not Running" After Restart

Location: docker-entrypoint.sh

Root Cause: CrowdSec is intentionally NOT auto-started in the entrypoint. The design is "GUI-controlled lifecycle."

From the entrypoint:

# CrowdSec Lifecycle Management:
# CrowdSec configuration is initialized above (symlinks, directories, hub updates)
# However, the CrowdSec agent is NOT auto-started in the entrypoint.
# Instead, CrowdSec lifecycle is managed by the backend handlers via GUI controls.

Problem: The database stores the user's "enabled" preference, but there's no reconciliation at startup:

Container restarts
Database says CrowdSec enabled = true (user's previous preference)
But CrowdSec process is NOT started (by design)
UI shows "enabled" but status shows "not running" → confusing state mismatch

Missing Logic: No startup reconciliation to check "if DB says enabled, start CrowdSec process."

Issue 3: Live Logs Disconnected

Location: logs_ws.go and log_watcher.go

Root Cause: There are two separate WebSocket log systems that may be misconfigured:

/api/v1/logs/live (logs_ws.go) - Streams application logs via logger.GetBroadcastHook()
/api/v1/cerberus/logs/ws (cerberus_logs_ws.go) - Streams Caddy access logs via LogWatcher

Potential Issues:

a) LogWatcher not started: The LogWatcher must be explicitly started with Start(ctx). If the watcher isn't started during server initialization, no logs are broadcast.

b) Log file doesn't exist: The LogWatcher waits for /var/log/caddy/access.log to exist. After container restart with no traffic, this file may not exist yet.

c) WebSocket connection path mismatch: Frontend might connect to wrong endpoint or with invalid token.

d) CSP blocking WebSocket: Security middleware's Content-Security-Policy must allow ws: and wss: protocols.

Detailed Code Analysis

Stop() Method - Full Code Review

// File: backend/internal/api/handlers/crowdsec_exec.go
func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error {
    b, err := os.ReadFile(e.pidFile(configDir))
    if err != nil {
        return fmt.Errorf("pid file read: %w", err)  // ← CRITICAL: Returns error on ENOENT
    }
    pid, err := strconv.Atoi(string(b))
    if err != nil {
        return fmt.Errorf("invalid pid: %w", err)
    }
    proc, err := os.FindProcess(pid)
    if err != nil {
        return err
    }
    if err := proc.Signal(syscall.SIGTERM); err != nil {
        return err
    }
    _ = os.Remove(e.pidFile(configDir))
    return nil
}

The problem is clear: os.ReadFile() returns os.ErrNotExist when the PID file doesn't exist, and this is propagated as a 500 error.

Status() Method - Already Handles Missing PID Gracefully

func (e *DefaultCrowdsecExecutor) Status(ctx context.Context, configDir string) (running bool, pid int, err error) {
    b, err := os.ReadFile(e.pidFile(configDir))
    if err != nil {
        // Missing pid file is treated as not running ← GOOD PATTERN
        return false, 0, nil
    }
    // ...
}

Key Insight: Status() already handles missing PID file gracefully by returning (false, 0, nil). The Stop() method should follow the same pattern.

Fix Plan

Fix 1: Make Stop() Idempotent (No Error When Already Stopped)

File: backend/internal/api/handlers/crowdsec_exec.go

Current Code (lines 36-51):

func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error {
	b, err := os.ReadFile(e.pidFile(configDir))
	if err != nil {
		return fmt.Errorf("pid file read: %w", err)
	}
	pid, err := strconv.Atoi(string(b))
	if err != nil {
		return fmt.Errorf("invalid pid: %w", err)
	}
	proc, err := os.FindProcess(pid)
	if err != nil {
		return err
	}
	if err := proc.Signal(syscall.SIGTERM); err != nil {
		return err
	}
	// best-effort remove pid file
	_ = os.Remove(e.pidFile(configDir))
	return nil
}

Fixed Code:

func (e *DefaultCrowdsecExecutor) Stop(ctx context.Context, configDir string) error {
	b, err := os.ReadFile(e.pidFile(configDir))
	if err != nil {
		if os.IsNotExist(err) {
			// No PID file means process isn't running - that's OK for Stop()
			// This makes Stop() idempotent (safe to call multiple times)
			return nil
		}
		return fmt.Errorf("pid file read: %w", err)
	}
	pid, err := strconv.Atoi(string(b))
	if err != nil {
		// Malformed PID file - remove it and treat as not running
		_ = os.Remove(e.pidFile(configDir))
		return nil
	}
	proc, err := os.FindProcess(pid)
	if err != nil {
		// Process lookup failed - clean up PID file
		_ = os.Remove(e.pidFile(configDir))
		return nil
	}
	if err := proc.Signal(syscall.SIGTERM); err != nil {
		// Process may already be dead - clean up PID file
		_ = os.Remove(e.pidFile(configDir))
		// Only return error if it's not "process doesn't exist"
		if !errors.Is(err, os.ErrProcessDone) && !errors.Is(err, syscall.ESRCH) {
			return err
		}
		return nil
	}
	// best-effort remove pid file
	_ = os.Remove(e.pidFile(configDir))
	return nil
}

Rationale: Stop() should be idempotent. Stopping an already-stopped service shouldn't error.

Fix 2: Add CrowdSec Startup Reconciliation

File: backend/internal/api/routes/routes.go (or create backend/internal/services/crowdsec_startup.go)

New Function:

// ReconcileCrowdSecOnStartup checks if CrowdSec should be running based on DB settings
// and starts it if necessary. This handles the case where the container restarts
// but the user's preference was to have CrowdSec enabled.
func ReconcileCrowdSecOnStartup(db *gorm.DB, executor handlers.CrowdsecExecutor, binPath, dataDir string) {
    if db == nil {
        return
    }

    var secCfg models.SecurityConfig
    if err := db.First(&secCfg).Error; err != nil {
        // No config yet or error - nothing to reconcile
        logger.Log().WithError(err).Debug("No security config found for CrowdSec reconciliation")
        return
    }

    // Check if CrowdSec should be running based on mode
    if secCfg.CrowdSecMode != "local" {
        logger.Log().WithField("mode", secCfg.CrowdSecMode).Debug("CrowdSec mode is not 'local', skipping auto-start")
        return
    }

    // Check if already running
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()

    running, _, _ := executor.Status(ctx, dataDir)
    if running {
        logger.Log().Info("CrowdSec already running, no action needed")
        return
    }

    // Start CrowdSec since DB says it should be enabled
    logger.Log().Info("CrowdSec mode is 'local' but process not running, starting...")
    _, err := executor.Start(ctx, binPath, dataDir)
    if err != nil {
        logger.Log().WithError(err).Warn("Failed to auto-start CrowdSec on startup reconciliation")
    } else {
        logger.Log().Info("CrowdSec started successfully via startup reconciliation")
    }
}

Integration Point: Call this function after database migration in server initialization:

// In routes.go or server.go, after DB is ready and handlers are created
if crowdsecHandler != nil {
    ReconcileCrowdSecOnStartup(db, crowdsecHandler.Executor, crowdsecHandler.BinPath, crowdsecHandler.DataDir)
}

Fix 3: Ensure LogWatcher is Started and Log File Exists

File: backend/internal/api/routes/routes.go

Check that LogWatcher.Start() is called:

// Ensure LogWatcher is started with proper log path
logPath := "/var/log/caddy/access.log"

// Ensure log directory exists
if err := os.MkdirAll(filepath.Dir(logPath), 0755); err != nil {
    logger.Log().WithError(err).Warn("Failed to create log directory")
}

// Create empty log file if it doesn't exist (allows LogWatcher to start tailing immediately)
if _, err := os.Stat(logPath); os.IsNotExist(err) {
    if f, err := os.Create(logPath); err == nil {
        f.Close()
        logger.Log().WithField("path", logPath).Info("Created empty log file for LogWatcher")
    }
}

// Create and start the LogWatcher
watcher := services.NewLogWatcher(logPath)
if err := watcher.Start(context.Background()); err != nil {
    logger.Log().WithError(err).Error("Failed to start LogWatcher")
}

Additionally, verify CSP allows WebSocket: The security middleware in backend/internal/api/middleware/security.go already has:

directives["connect-src"] = "'self' ws: wss:" // WebSocket for HMR

This should allow WebSocket connections.

Files to Modify

File	Change
`backend/internal/api/handlers/crowdsec_exec.go`	Make `Stop()` idempotent for missing PID file
`backend/internal/api/routes/routes.go`	Add CrowdSec startup reconciliation call
`backend/internal/services/crowdsec_startup.go`	(NEW) Create startup reconciliation function
`backend/internal/api/handlers/crowdsec_exec_test.go`	Add tests for Stop() idempotency

Testing Plan

Test 1: Stop() Idempotency

# Start container fresh (no CrowdSec running)
docker compose down -v && docker compose up -d

# Call Stop() without starting CrowdSec first
curl -X POST http://localhost:8080/api/v1/admin/crowdsec/stop \
  -H "Authorization: Bearer $TOKEN"

# Expected: 200 OK {"status": "stopped"}
# NOT: 500 Internal Server Error

Test 2: Startup Reconciliation

# Enable CrowdSec via API
curl -X POST http://localhost:8080/api/v1/admin/crowdsec/start \
  -H "Authorization: Bearer $TOKEN"

# Verify running
curl http://localhost:8080/api/v1/admin/crowdsec/status \
  -H "Authorization: Bearer $TOKEN"
# Expected: {"running": true, "pid": xxx, "lapi_ready": true}

# Restart container
docker compose restart charon

# Wait for startup
sleep 10

# Verify CrowdSec auto-started
curl http://localhost:8080/api/v1/admin/crowdsec/status \
  -H "Authorization: Bearer $TOKEN"
# Expected: {"running": true, "pid": xxx, "lapi_ready": true}

Test 3: Live Logs

# Connect to WebSocket
websocat "ws://localhost:8080/api/v1/logs/live?token=$TOKEN"

# In another terminal, generate traffic
curl http://localhost:8080/api/v1/health

# Verify log entries appear in WebSocket connection

Implementation Order

Fix 1 (Stop Idempotency) - Quick fix, high impact, unblocks users immediately
Fix 2 (Startup Reconciliation) - Core fix for state mismatch after restart
Fix 3 (Live Logs) - May need more investigation; likely already working if LogWatcher is started

Risk Assessment

Change	Risk	Mitigation
Stop() idempotency	Very Low	Makes existing code more robust
Startup reconciliation	Low	Only runs once at startup, respects DB state
Log file creation	Very Low	Standard file operations with error handling

Notes

The CrowdSec PID file path is ${dataDir}/crowdsec.pid (e.g., /app/data/crowdsec/crowdsec.pid)
The LogWatcher monitors /var/log/caddy/access.log by default
WebSocket ping interval is 30 seconds for keep-alive
CrowdSec LAPI runs on port 8085 (not 8080) to avoid conflict with Charon
The Status() handler already includes lapi_ready field (from previous fix)

13 KiB Raw Blame History