diff --git a/docs/reports/HOTFIX_CROWDSEC_INTEGRATION_ISSUES.md b/docs/reports/HOTFIX_CROWDSEC_INTEGRATION_ISSUES.md new file mode 100644 index 00000000..1a38ca5b --- /dev/null +++ b/docs/reports/HOTFIX_CROWDSEC_INTEGRATION_ISSUES.md @@ -0,0 +1,627 @@ +# CrowdSec Integration Issues - Hotfix Plan + +**Date:** December 14, 2025 +**Priority:** HOTFIX - Critical +**Status:** Investigation Complete, Ready for Implementation + +## Executive Summary + +Three critical issues have been identified in the CrowdSec integration that prevent proper operation: + +1. **CrowdSec process not actually running** - Message displays but process isn't started +2. **Toggle state management broken** - CrowdSec toggle on Cerberus Dashboard won't turn off +3. **Security log viewer shows wrong logs** - Displays Plex/application logs instead of security logs + +## Investigation Findings + +### Container Status + +```bash +Container: charon (1cc717562976) +Status: Up 4 hours (healthy) +Processes Running: + - PID 1: /bin/sh /docker-entrypoint.sh + - PID 31: caddy run --config /config/caddy.json + - PID 43: /usr/local/bin/dlv exec /app/charon (debugger) + - PID 52: /app/charon (main process) + +CrowdSec Process: NOT RUNNING ❌ +No PID file found at: /app/data/crowdsec/crowdsec.pid +``` + +### Issue #1: CrowdSec Not Running + +**Root Cause:** +- The error message "CrowdSec is not running" is **accurate** +- `crowdsec` binary process is not executing in the container +- PID file `/app/data/crowdsec/crowdsec.pid` does not exist +- Process detection in `crowdsec_exec.go:Status()` correctly returns `running=false` + +**Code Path:** +``` +backend/internal/api/handlers/crowdsec_exec.go:85 +├── Status() checks PID file at: filepath.Join(configDir, "crowdsec.pid") +├── PID file missing → returns (running=false, pid=0, err=nil) +└── Frontend displays: "CrowdSec is not running" +``` + +**Why CrowdSec Isn't Starting:** +1. `ReconcileCrowdSecOnStartup()` runs at container boot (routes.go:360) +2. Checks `SecurityConfig` table for `crowdsec_mode = "local"` +3. **BUT**: The mode might not be set to "local" or the process start is failing silently +4. No error logs visible in container logs about CrowdSec startup failures + +**Files Involved:** +- `backend/internal/services/crowdsec_startup.go` - Reconciliation logic +- `backend/internal/api/handlers/crowdsec_exec.go` - Process executor +- `backend/internal/api/handlers/crowdsec_handler.go` - Status endpoint + +--- + +### Issue #2: Toggle Won't Turn Off + +**Root Cause:** +Frontend state management has optimistic updates that don't properly reconcile with backend state. + +**Code Path:** +```typescript +frontend/src/pages/Security.tsx:94-113 (crowdsecPowerMutation) +├── onMutate: Optimistically sets crowdsec.enabled = new value +├── mutationFn: Calls updateSetting() then startCrowdsec() or stopCrowdsec() +├── onError: Reverts optimistic update but may not fully sync +└── onSuccess: Calls fetchCrowdsecStatus() but state may be stale +``` + +**The Problem:** +```typescript +// Optimistic update sets enabled immediately +queryClient.setQueryData(['security-status'], (old) => { + copy.crowdsec = { ...copy.crowdsec, enabled } // ← State updated BEFORE API call +}) + +// If API fails or times out, toggle appears stuck +``` + +**Why Toggle Appears Stuck:** +1. User clicks toggle → Frontend immediately updates UI to "enabled" +2. Backend API is called to start CrowdSec +3. CrowdSec process fails to start (see Issue #1) +4. API returns success (because the *setting* was updated) +5. Frontend thinks CrowdSec is enabled, but `Status()` API says `running=false` +6. Toggle now in inconsistent state - shows "on" but status says "not running" + +**Files Involved:** +- `frontend/src/pages/Security.tsx:94-136` - Toggle mutation logic +- `frontend/src/pages/CrowdSecConfig.tsx:105` - Status check +- `backend/internal/api/handlers/security_handler.go:60-175` - GetStatus priority chain + +--- + +### Issue #3: Security Log Viewer Shows Wrong Logs + +**Root Cause:** +The `LiveLogViewer` component connects to the correct `/api/v1/cerberus/logs/ws` endpoint, but the `LogWatcher` service is reading from `/var/log/caddy/access.log` which may not exist or may contain the wrong logs. + +**Code Path:** +``` +frontend/src/pages/Security.tsx:411 +├── +└── Connects to: ws://localhost:8080/api/v1/cerberus/logs/ws + +backend/internal/api/routes/routes.go:362-390 +├── LogWatcher initialized with: accessLogPath = "/var/log/caddy/access.log" +├── File exists check: Creates empty file if missing +└── Starts tailing: services.LogWatcher.tailFile() + +backend/internal/services/log_watcher.go:139-186 +├── Opens /var/log/caddy/access.log +├── Seeks to end of file +└── Reads new lines, parses as Caddy JSON logs +``` + +**The Problem:** +The log file path `/var/log/caddy/access.log` is hardcoded and may not match where Caddy is actually writing logs. The user reports seeing Plex logs, which suggests: + +1. **Wrong log file** - The LogWatcher might be reading an old/wrong log file +2. **Parsing issue** - Caddy logs aren't properly formatted as expected +3. **Source detection broken** - Logs are being classified as "normal" instead of security events + +**Verification Needed:** +```bash +# Check where Caddy is actually logging +docker exec charon cat /config/caddy.json | jq '.logging' + +# Check if the access.log file exists and contains recent entries +docker exec charon tail -50 /var/log/caddy/access.log + +# Check Caddy data directory +docker exec charon ls -la /app/data/caddy/ +``` + +**Files Involved:** +- `backend/internal/api/routes/routes.go:366` - accessLogPath definition +- `backend/internal/services/log_watcher.go` - File tailing and parsing +- `backend/internal/api/handlers/cerberus_logs_ws.go` - WebSocket handler +- `frontend/src/components/LiveLogViewer.tsx` - Frontend component + +--- + +## Root Cause Summary + +| Issue | Root Cause | Impact | +|-------|------------|--------| +| CrowdSec not running | Process start fails silently OR mode not set to "local" in DB | User cannot use CrowdSec features | +| Toggle stuck | Optimistic UI updates + API success despite process failure | Confusing UX, user can't disable | +| Wrong logs displayed | LogWatcher reading wrong file OR parsing application logs | User can't monitor security events | + +--- + +## Proposed Fixes + +### Fix #1: CrowdSec Process Start Issues + +**Change X → Y Impact:** + +```diff +File: backend/internal/services/crowdsec_startup.go + +IF Change: Add detailed logging + retry mechanism +THEN Impact: + ✓ Startup failures become visible in logs + ✓ Transient failures (DB not ready) are retried + ✓ CrowdSec has better chance of starting on boot + ⚠ Retry logic could delay boot by a few seconds + +IF Change: Validate binPath exists before calling Start() +THEN Impact: + ✓ Prevent calling Start() if crowdsec binary missing + ✓ Clear error message to user + ⚠ Additional filesystem check on every reconcile +``` + +**Implementation:** + +```go +// backend/internal/services/crowdsec_startup.go + +func ReconcileCrowdSecOnStartup(db *gorm.DB, executor CrowdsecProcessManager, binPath, dataDir string) { + logger.Log().Info("Starting CrowdSec reconciliation on startup") + + // ... existing checks ... + + // VALIDATE: Ensure binary exists + if _, err := os.Stat(binPath); os.IsNotExist(err) { + logger.Log().WithField("path", binPath).Error("CrowdSec binary not found, cannot start") + return + } + + // VALIDATE: Ensure config directory exists + if _, err := os.Stat(dataDir); os.IsNotExist(err) { + logger.Log().WithField("path", dataDir).Error("CrowdSec config directory not found, cannot start") + return + } + + // ... existing status check ... + + // START with better error handling + logger.Log().WithFields(logrus.Fields{ + "bin_path": binPath, + "data_dir": dataDir, + }).Info("Attempting to start CrowdSec process") + + startCtx, startCancel := context.WithTimeout(context.Background(), 30*time.Second) + defer startCancel() + + newPid, err := executor.Start(startCtx, binPath, dataDir) + if err != nil { + logger.Log().WithError(err).WithFields(logrus.Fields{ + "bin_path": binPath, + "data_dir": dataDir, + }).Error("CrowdSec reconciliation: FAILED to start CrowdSec - check binary path and config") + return + } + + // VERIFY: Wait for PID file to be written + time.Sleep(2 * time.Second) + running, pid, err := executor.Status(ctx, dataDir) + if err != nil || !running { + logger.Log().WithFields(logrus.Fields{ + "expected_pid": newPid, + "actual_pid": pid, + "running": running, + }).Error("CrowdSec process started but not running - process may have crashed") + return + } + + logger.Log().WithField("pid", newPid).Info("CrowdSec reconciliation: successfully started and verified CrowdSec") +} +``` + +--- + +### Fix #2: Toggle State Management + +**Change X → Y Impact:** + +```diff +File: frontend/src/pages/Security.tsx + +IF Change: Remove optimistic updates, wait for API confirmation +THEN Impact: + ✓ Toggle always reflects actual backend state + ✓ No "stuck toggle" UX issue + ⚠ Toggle feels slightly slower (100-200ms delay) + ⚠ User must wait for API response before seeing change + +IF Change: Add explicit error handling + status reconciliation +THEN Impact: + ✓ Errors are clearly shown to user + ✓ Toggle reverts on failure + ✓ Status check after mutation ensures consistency + ⚠ Additional API call overhead +``` + +**Implementation:** + +```typescript +// frontend/src/pages/Security.tsx + +const crowdsecPowerMutation = useMutation({ + mutationFn: async (enabled: boolean) => { + // Update setting first + await updateSetting('security.crowdsec.enabled', enabled ? 'true' : 'false', 'security', 'bool') + + if (enabled) { + toast.info('Starting CrowdSec... This may take up to 30 seconds') + const result = await startCrowdsec() + + // VERIFY: Check if it actually started + const status = await statusCrowdsec() + if (!status.running) { + throw new Error('CrowdSec setting enabled but process failed to start. Check server logs.') + } + + return result + } else { + await stopCrowdsec() + + // VERIFY: Check if it actually stopped + const status = await statusCrowdsec() + if (status.running) { + throw new Error('CrowdSec setting disabled but process still running. Check server logs.') + } + + return { enabled: false } + } + }, + + // REMOVE OPTIMISTIC UPDATES + onMutate: undefined, + + onError: (err: unknown, enabled: boolean) => { + const msg = err instanceof Error ? err.message : String(err) + toast.error(enabled ? `Failed to start CrowdSec: ${msg}` : `Failed to stop CrowdSec: ${msg}`) + + // Force refresh status from backend + queryClient.invalidateQueries({ queryKey: ['security-status'] }) + fetchCrowdsecStatus() + }, + + onSuccess: async () => { + // Refresh all related queries to ensure consistency + await Promise.all([ + queryClient.invalidateQueries({ queryKey: ['security-status'] }), + queryClient.invalidateQueries({ queryKey: ['settings'] }), + fetchCrowdsecStatus(), + ]) + + toast.success('CrowdSec status updated successfully') + }, +}) +``` + +--- + +### Fix #3: Security Log Viewer + +**Change X → Y Impact:** + +```diff +File: backend/internal/api/routes/routes.go + backend/internal/services/log_watcher.go + +IF Change: Make log path configurable + validate it exists +THEN Impact: + ✓ Can specify correct log file via env var + ✓ Graceful fallback if file doesn't exist + ✓ Clear error logging about file path issues + ⚠ Requires updating deployment/env vars + +IF Change: Improve log parsing + source detection +THEN Impact: + ✓ Better classification of security events + ✓ Clearer distinction between app logs and security logs + ⚠ More CPU overhead for regex matching +``` + +**Implementation Plan:** + +1. **Verify Current Log Configuration:** +```bash +# Check Caddy config for logging directive +docker exec charon cat /config/caddy.json | jq '.logging.logs' + +# Find where Caddy is actually writing logs +docker exec charon find /app/data /var/log -name "*.log" -type f 2>/dev/null + +# Check if access.log has recent entries +docker exec charon tail -20 /var/log/caddy/access.log +``` + +2. **Add Log Path Validation:** +```go +// backend/internal/api/routes/routes.go:366 + +accessLogPath := os.Getenv("CHARON_CADDY_ACCESS_LOG") +if accessLogPath == "" { + // Try multiple paths in order of preference + candidatePaths := []string{ + "/var/log/caddy/access.log", + filepath.Join(cfg.CaddyConfigDir, "logs", "access.log"), + filepath.Join(dataDir, "logs", "access.log"), + } + + for _, path := range candidatePaths { + if _, err := os.Stat(path); err == nil { + accessLogPath = path + logger.Log().WithField("path", path).Info("Found existing Caddy access log") + break + } + } + + // If none exist, use default and create it + if accessLogPath == "" { + accessLogPath = "/var/log/caddy/access.log" + logger.Log().WithField("path", accessLogPath).Warn("No existing access log found, will create at default path") + } +} + +logger.Log().WithField("path", accessLogPath).Info("Initializing LogWatcher with access log path") +``` + +3. **Improve Source Detection:** +```go +// backend/internal/services/log_watcher.go:221 + +func (w *LogWatcher) detectSecurityEvent(entry *models.SecurityLogEntry, caddyLog *models.CaddyAccessLog) { + // Enhanced logger name checking + loggerLower := strings.ToLower(caddyLog.Logger) + + // Check for WAF/Coraza + if caddyLog.Status == 403 && ( + strings.Contains(loggerLower, "waf") || + strings.Contains(loggerLower, "coraza") || + hasHeader(caddyLog.RespHeaders, "X-Coraza-Id")) { + entry.Blocked = true + entry.Source = "waf" + entry.Level = "warn" + entry.BlockReason = "WAF rule triggered" + // ... extract rule ID ... + return + } + + // Check for CrowdSec + if caddyLog.Status == 403 && ( + strings.Contains(loggerLower, "crowdsec") || + strings.Contains(loggerLower, "bouncer") || + hasHeader(caddyLog.RespHeaders, "X-Crowdsec-Decision")) { + entry.Blocked = true + entry.Source = "crowdsec" + entry.Level = "warn" + entry.BlockReason = "CrowdSec decision" + return + } + + // Check for ACL + if caddyLog.Status == 403 && ( + strings.Contains(loggerLower, "acl") || + hasHeader(caddyLog.RespHeaders, "X-Acl-Denied")) { + entry.Blocked = true + entry.Source = "acl" + entry.Level = "warn" + entry.BlockReason = "Access list denied" + return + } + + // Check for rate limiting + if caddyLog.Status == 429 { + entry.Blocked = true + entry.Source = "ratelimit" + entry.Level = "warn" + entry.BlockReason = "Rate limit exceeded" + // ... extract rate limit headers ... + return + } + + // If it's a proxy log (reverse_proxy logger), mark as normal traffic + if strings.Contains(loggerLower, "reverse_proxy") || + strings.Contains(loggerLower, "access_log") { + entry.Source = "normal" + entry.Blocked = false + // Don't set level to warn for successful requests + if caddyLog.Status < 400 { + entry.Level = "info" + } + return + } + + // Default for unclassified 403s + if caddyLog.Status == 403 { + entry.Blocked = true + entry.Source = "cerberus" + entry.Level = "warn" + entry.BlockReason = "Access denied" + } +} +``` + +--- + +## Testing Plan + +### Pre-Checks +```bash +# 1. Verify container is running +docker ps | grep charon + +# 2. Check if crowdsec binary exists +docker exec charon which crowdsec +docker exec charon ls -la /usr/bin/crowdsec # Or wherever it's installed + +# 3. Check database config +docker exec charon cat /app/data/charon.db # Would need sqlite3 or Go query + +# 4. Check Caddy log configuration +docker exec charon cat /config/caddy.json | jq '.logging' + +# 5. Find actual log files +docker exec charon find /var/log /app/data -name "*.log" -type f 2>/dev/null +``` + +### Test Scenario 1: CrowdSec Startup +```bash +# Given: Container restarts +docker restart charon + +# When: Container boots +# Then: +# - Check logs for CrowdSec reconciliation messages +# - Verify PID file created: /app/data/crowdsec/crowdsec.pid +# - Verify process running: docker exec charon ps aux | grep crowdsec +# - Verify status API returns running=true + +docker logs charon --tail 100 | grep -i "crowdsec" +docker exec charon ps aux | grep crowdsec +docker exec charon ls -la /app/data/crowdsec/crowdsec.pid +``` + +### Test Scenario 2: Toggle Behavior +```bash +# Given: CrowdSec is running +# When: User clicks toggle to disable +# Then: +# - Frontend shows loading state +# - API call succeeds +# - Process stops (no crowdsec in ps) +# - PID file removed +# - Toggle reflects OFF state +# - Status API returns running=false + +# When: User clicks toggle to enable +# Then: +# - Frontend shows loading state +# - API call succeeds +# - Process starts +# - PID file created +# - Toggle reflects ON state +# - Status API returns running=true +``` + +### Test Scenario 3: Security Log Viewer +```bash +# Given: CrowdSec is enabled and blocking traffic +# When: User opens Cerberus Dashboard +# Then: +# - WebSocket connects successfully (check browser console) +# - Logs appear in real-time +# - Blocked requests show with red indicator +# - Source badges show correct module (crowdsec, waf, etc.) + +# Test blocked request: +curl -H "User-Agent: BadBot" https://your-charon-instance.com +# Should see blocked log entry in dashboard +``` + +--- + +## Implementation Order + +1. **Phase 1: Diagnostics** (15 minutes) + - Run all pre-checks + - Document actual state of system + - Identify which issue is the primary blocker + +2. **Phase 2: CrowdSec Startup** (30 minutes) + - Implement enhanced logging in `crowdsec_startup.go` + - Add binary/config validation + - Test container restart + +3. **Phase 3: Toggle Fix** (20 minutes) + - Remove optimistic updates from `Security.tsx` + - Add status verification + - Test toggle on/off cycle + +4. **Phase 4: Log Viewer** (30 minutes) + - Verify log file path + - Implement log path detection + - Improve source detection + - Test with actual traffic + +5. **Phase 5: Integration Testing** (30 minutes) + - Full end-to-end test + - Verify all three issues resolved + - Check for regressions + +**Total Estimated Time:** 2 hours + +--- + +## Success Criteria + +✅ **CrowdSec Running:** +- `docker exec charon ps aux | grep crowdsec` shows running process +- PID file exists at `/app/data/crowdsec/crowdsec.pid` +- `/api/v1/admin/crowdsec/status` returns `{"running": true, "pid": }` + +✅ **Toggle Working:** +- Toggle can be turned on and off without getting stuck +- UI state matches backend process state +- Clear error messages if operations fail + +✅ **Logs Correct:** +- Security log viewer shows Caddy access logs +- Blocked requests appear with proper indicators +- Source badges correctly identify security module +- WebSocket stays connected + +--- + +## Rollback Plan + +If hotfix causes issues: + +1. **Revert Commits:** +```bash +git revert HEAD~3..HEAD # Revert last 3 commits +git push origin feature/beta-release +``` + +2. **Restart Container:** +```bash +docker restart charon +``` + +3. **Verify Basic Functionality:** +- Proxy hosts still work +- SSL still works +- No new errors in logs + +--- + +## Notes for QA + +- Test on clean container (no previous CrowdSec state) +- Test with existing CrowdSec config +- Test rapid toggle on/off cycles +- Monitor container logs during testing +- Check browser console for WebSocket errors +- Verify memory usage doesn't spike (log file tailing)