- Added TestMigrateCommand_Succeeds to validate migration functionality. - Introduced TestStartupVerification_MissingTables to ensure proper handling of missing security tables. - Updated crowdsec_startup.go to log warnings for missing SecurityConfig table. - Enhanced documentation for database migrations during upgrades, including steps and expected outputs. - Created a detailed migration QA report outlining testing results and recommendations. - Added troubleshooting guidance for CrowdSec not starting after upgrades due to missing tables. - Established a new plan for addressing CrowdSec reconciliation failures, including root cause analysis and proposed fixes.
20 KiB
CrowdSec Integration Issues - Hotfix Plan
Date: December 14, 2025 Priority: HOTFIX - Critical Status: Investigation Complete, Ready for Implementation
Executive Summary
Three critical issues have been identified in the CrowdSec integration that prevent proper operation:
- CrowdSec process not actually running - Message displays but process isn't started
- Toggle state management broken - CrowdSec toggle on Cerberus Dashboard won't turn off
- Security log viewer shows wrong logs - Displays Plex/application logs instead of security logs
Investigation Findings
Container Status
Container: charon (1cc717562976)
Status: Up 4 hours (healthy)
Processes Running:
- PID 1: /bin/sh /docker-entrypoint.sh
- PID 31: caddy run --config /config/caddy.json
- PID 43: /usr/local/bin/dlv exec /app/charon (debugger)
- PID 52: /app/charon (main process)
CrowdSec Process: NOT RUNNING ❌
No PID file found at: /app/data/crowdsec/crowdsec.pid
Issue #1: CrowdSec Not Running
Root Cause:
- The error message "CrowdSec is not running" is accurate
crowdsecbinary process is not executing in the container- PID file
/app/data/crowdsec/crowdsec.piddoes not exist - Process detection in
crowdsec_exec.go:Status()correctly returnsrunning=false
Code Path:
backend/internal/api/handlers/crowdsec_exec.go:85
├── Status() checks PID file at: filepath.Join(configDir, "crowdsec.pid")
├── PID file missing → returns (running=false, pid=0, err=nil)
└── Frontend displays: "CrowdSec is not running"
Why CrowdSec Isn't Starting:
ReconcileCrowdSecOnStartup()runs at container boot (routes.go:360)- Checks
SecurityConfigtable forcrowdsec_mode = "local" - BUT: The mode might not be set to "local" or the process start is failing silently
- No error logs visible in container logs about CrowdSec startup failures
Files Involved:
backend/internal/services/crowdsec_startup.go- Reconciliation logicbackend/internal/api/handlers/crowdsec_exec.go- Process executorbackend/internal/api/handlers/crowdsec_handler.go- Status endpoint
Issue #2: Toggle Won't Turn Off
Root Cause: Frontend state management has optimistic updates that don't properly reconcile with backend state.
Code Path:
frontend/src/pages/Security.tsx:94-113 (crowdsecPowerMutation)
├── onMutate: Optimistically sets crowdsec.enabled = new value
├── mutationFn: Calls updateSetting() then startCrowdsec() or stopCrowdsec()
├── onError: Reverts optimistic update but may not fully sync
└── onSuccess: Calls fetchCrowdsecStatus() but state may be stale
The Problem:
// Optimistic update sets enabled immediately
queryClient.setQueryData(['security-status'], (old) => {
copy.crowdsec = { ...copy.crowdsec, enabled } // ← State updated BEFORE API call
})
// If API fails or times out, toggle appears stuck
Why Toggle Appears Stuck:
- User clicks toggle → Frontend immediately updates UI to "enabled"
- Backend API is called to start CrowdSec
- CrowdSec process fails to start (see Issue #1)
- API returns success (because the setting was updated)
- Frontend thinks CrowdSec is enabled, but
Status()API saysrunning=false - Toggle now in inconsistent state - shows "on" but status says "not running"
Files Involved:
frontend/src/pages/Security.tsx:94-136- Toggle mutation logicfrontend/src/pages/CrowdSecConfig.tsx:105- Status checkbackend/internal/api/handlers/security_handler.go:60-175- GetStatus priority chain
Issue #3: Security Log Viewer Shows Wrong Logs
Root Cause:
The LiveLogViewer component connects to the correct /api/v1/cerberus/logs/ws endpoint, but the LogWatcher service is reading from /var/log/caddy/access.log which may not exist or may contain the wrong logs.
Code Path:
frontend/src/pages/Security.tsx:411
├── <LiveLogViewer mode="security" securityFilters={{}} />
└── Connects to: ws://localhost:8080/api/v1/cerberus/logs/ws
backend/internal/api/routes/routes.go:362-390
├── LogWatcher initialized with: accessLogPath = "/var/log/caddy/access.log"
├── File exists check: Creates empty file if missing
└── Starts tailing: services.LogWatcher.tailFile()
backend/internal/services/log_watcher.go:139-186
├── Opens /var/log/caddy/access.log
├── Seeks to end of file
└── Reads new lines, parses as Caddy JSON logs
The Problem:
The log file path /var/log/caddy/access.log is hardcoded and may not match where Caddy is actually writing logs. The user reports seeing Plex logs, which suggests:
- Wrong log file - The LogWatcher might be reading an old/wrong log file
- Parsing issue - Caddy logs aren't properly formatted as expected
- Source detection broken - Logs are being classified as "normal" instead of security events
Verification Needed:
# Check where Caddy is actually logging
docker exec charon cat /config/caddy.json | jq '.logging'
# Check if the access.log file exists and contains recent entries
docker exec charon tail -50 /var/log/caddy/access.log
# Check Caddy data directory
docker exec charon ls -la /app/data/caddy/
Files Involved:
backend/internal/api/routes/routes.go:366- accessLogPath definitionbackend/internal/services/log_watcher.go- File tailing and parsingbackend/internal/api/handlers/cerberus_logs_ws.go- WebSocket handlerfrontend/src/components/LiveLogViewer.tsx- Frontend component
Root Cause Summary
| Issue | Root Cause | Impact |
|---|---|---|
| CrowdSec not running | Process start fails silently OR mode not set to "local" in DB | User cannot use CrowdSec features |
| Toggle stuck | Optimistic UI updates + API success despite process failure | Confusing UX, user can't disable |
| Wrong logs displayed | LogWatcher reading wrong file OR parsing application logs | User can't monitor security events |
Proposed Fixes
Fix #1: CrowdSec Process Start Issues
Change X → Y Impact:
File: backend/internal/services/crowdsec_startup.go
IF Change: Add detailed logging + retry mechanism
THEN Impact:
✓ Startup failures become visible in logs
✓ Transient failures (DB not ready) are retried
✓ CrowdSec has better chance of starting on boot
⚠ Retry logic could delay boot by a few seconds
IF Change: Validate binPath exists before calling Start()
THEN Impact:
✓ Prevent calling Start() if crowdsec binary missing
✓ Clear error message to user
⚠ Additional filesystem check on every reconcile
Implementation:
// backend/internal/services/crowdsec_startup.go
func ReconcileCrowdSecOnStartup(db *gorm.DB, executor CrowdsecProcessManager, binPath, dataDir string) {
logger.Log().Info("Starting CrowdSec reconciliation on startup")
// ... existing checks ...
// VALIDATE: Ensure binary exists
if _, err := os.Stat(binPath); os.IsNotExist(err) {
logger.Log().WithField("path", binPath).Error("CrowdSec binary not found, cannot start")
return
}
// VALIDATE: Ensure config directory exists
if _, err := os.Stat(dataDir); os.IsNotExist(err) {
logger.Log().WithField("path", dataDir).Error("CrowdSec config directory not found, cannot start")
return
}
// ... existing status check ...
// START with better error handling
logger.Log().WithFields(logrus.Fields{
"bin_path": binPath,
"data_dir": dataDir,
}).Info("Attempting to start CrowdSec process")
startCtx, startCancel := context.WithTimeout(context.Background(), 30*time.Second)
defer startCancel()
newPid, err := executor.Start(startCtx, binPath, dataDir)
if err != nil {
logger.Log().WithError(err).WithFields(logrus.Fields{
"bin_path": binPath,
"data_dir": dataDir,
}).Error("CrowdSec reconciliation: FAILED to start CrowdSec - check binary path and config")
return
}
// VERIFY: Wait for PID file to be written
time.Sleep(2 * time.Second)
running, pid, err := executor.Status(ctx, dataDir)
if err != nil || !running {
logger.Log().WithFields(logrus.Fields{
"expected_pid": newPid,
"actual_pid": pid,
"running": running,
}).Error("CrowdSec process started but not running - process may have crashed")
return
}
logger.Log().WithField("pid", newPid).Info("CrowdSec reconciliation: successfully started and verified CrowdSec")
}
Fix #2: Toggle State Management
Change X → Y Impact:
File: frontend/src/pages/Security.tsx
IF Change: Remove optimistic updates, wait for API confirmation
THEN Impact:
✓ Toggle always reflects actual backend state
✓ No "stuck toggle" UX issue
⚠ Toggle feels slightly slower (100-200ms delay)
⚠ User must wait for API response before seeing change
IF Change: Add explicit error handling + status reconciliation
THEN Impact:
✓ Errors are clearly shown to user
✓ Toggle reverts on failure
✓ Status check after mutation ensures consistency
⚠ Additional API call overhead
Implementation:
// frontend/src/pages/Security.tsx
const crowdsecPowerMutation = useMutation({
mutationFn: async (enabled: boolean) => {
// Update setting first
await updateSetting('security.crowdsec.enabled', enabled ? 'true' : 'false', 'security', 'bool')
if (enabled) {
toast.info('Starting CrowdSec... This may take up to 30 seconds')
const result = await startCrowdsec()
// VERIFY: Check if it actually started
const status = await statusCrowdsec()
if (!status.running) {
throw new Error('CrowdSec setting enabled but process failed to start. Check server logs.')
}
return result
} else {
await stopCrowdsec()
// VERIFY: Check if it actually stopped
const status = await statusCrowdsec()
if (status.running) {
throw new Error('CrowdSec setting disabled but process still running. Check server logs.')
}
return { enabled: false }
}
},
// REMOVE OPTIMISTIC UPDATES
onMutate: undefined,
onError: (err: unknown, enabled: boolean) => {
const msg = err instanceof Error ? err.message : String(err)
toast.error(enabled ? `Failed to start CrowdSec: ${msg}` : `Failed to stop CrowdSec: ${msg}`)
// Force refresh status from backend
queryClient.invalidateQueries({ queryKey: ['security-status'] })
fetchCrowdsecStatus()
},
onSuccess: async () => {
// Refresh all related queries to ensure consistency
await Promise.all([
queryClient.invalidateQueries({ queryKey: ['security-status'] }),
queryClient.invalidateQueries({ queryKey: ['settings'] }),
fetchCrowdsecStatus(),
])
toast.success('CrowdSec status updated successfully')
},
})
Fix #3: Security Log Viewer
Change X → Y Impact:
File: backend/internal/api/routes/routes.go + backend/internal/services/log_watcher.go
IF Change: Make log path configurable + validate it exists
THEN Impact:
✓ Can specify correct log file via env var
✓ Graceful fallback if file doesn't exist
✓ Clear error logging about file path issues
⚠ Requires updating deployment/env vars
IF Change: Improve log parsing + source detection
THEN Impact:
✓ Better classification of security events
✓ Clearer distinction between app logs and security logs
⚠ More CPU overhead for regex matching
Implementation Plan:
- Verify Current Log Configuration:
# Check Caddy config for logging directive
docker exec charon cat /config/caddy.json | jq '.logging.logs'
# Find where Caddy is actually writing logs
docker exec charon find /app/data /var/log -name "*.log" -type f 2>/dev/null
# Check if access.log has recent entries
docker exec charon tail -20 /var/log/caddy/access.log
- Add Log Path Validation:
// backend/internal/api/routes/routes.go:366
accessLogPath := os.Getenv("CHARON_CADDY_ACCESS_LOG")
if accessLogPath == "" {
// Try multiple paths in order of preference
candidatePaths := []string{
"/var/log/caddy/access.log",
filepath.Join(cfg.CaddyConfigDir, "logs", "access.log"),
filepath.Join(dataDir, "logs", "access.log"),
}
for _, path := range candidatePaths {
if _, err := os.Stat(path); err == nil {
accessLogPath = path
logger.Log().WithField("path", path).Info("Found existing Caddy access log")
break
}
}
// If none exist, use default and create it
if accessLogPath == "" {
accessLogPath = "/var/log/caddy/access.log"
logger.Log().WithField("path", accessLogPath).Warn("No existing access log found, will create at default path")
}
}
logger.Log().WithField("path", accessLogPath).Info("Initializing LogWatcher with access log path")
- Improve Source Detection:
// backend/internal/services/log_watcher.go:221
func (w *LogWatcher) detectSecurityEvent(entry *models.SecurityLogEntry, caddyLog *models.CaddyAccessLog) {
// Enhanced logger name checking
loggerLower := strings.ToLower(caddyLog.Logger)
// Check for WAF/Coraza
if caddyLog.Status == 403 && (
strings.Contains(loggerLower, "waf") ||
strings.Contains(loggerLower, "coraza") ||
hasHeader(caddyLog.RespHeaders, "X-Coraza-Id")) {
entry.Blocked = true
entry.Source = "waf"
entry.Level = "warn"
entry.BlockReason = "WAF rule triggered"
// ... extract rule ID ...
return
}
// Check for CrowdSec
if caddyLog.Status == 403 && (
strings.Contains(loggerLower, "crowdsec") ||
strings.Contains(loggerLower, "bouncer") ||
hasHeader(caddyLog.RespHeaders, "X-Crowdsec-Decision")) {
entry.Blocked = true
entry.Source = "crowdsec"
entry.Level = "warn"
entry.BlockReason = "CrowdSec decision"
return
}
// Check for ACL
if caddyLog.Status == 403 && (
strings.Contains(loggerLower, "acl") ||
hasHeader(caddyLog.RespHeaders, "X-Acl-Denied")) {
entry.Blocked = true
entry.Source = "acl"
entry.Level = "warn"
entry.BlockReason = "Access list denied"
return
}
// Check for rate limiting
if caddyLog.Status == 429 {
entry.Blocked = true
entry.Source = "ratelimit"
entry.Level = "warn"
entry.BlockReason = "Rate limit exceeded"
// ... extract rate limit headers ...
return
}
// If it's a proxy log (reverse_proxy logger), mark as normal traffic
if strings.Contains(loggerLower, "reverse_proxy") ||
strings.Contains(loggerLower, "access_log") {
entry.Source = "normal"
entry.Blocked = false
// Don't set level to warn for successful requests
if caddyLog.Status < 400 {
entry.Level = "info"
}
return
}
// Default for unclassified 403s
if caddyLog.Status == 403 {
entry.Blocked = true
entry.Source = "cerberus"
entry.Level = "warn"
entry.BlockReason = "Access denied"
}
}
Testing Plan
Pre-Checks
# 1. Verify container is running
docker ps | grep charon
# 2. Check if crowdsec binary exists
docker exec charon which crowdsec
docker exec charon ls -la /usr/bin/crowdsec # Or wherever it's installed
# 3. Check database config
docker exec charon cat /app/data/charon.db # Would need sqlite3 or Go query
# 4. Check Caddy log configuration
docker exec charon cat /config/caddy.json | jq '.logging'
# 5. Find actual log files
docker exec charon find /var/log /app/data -name "*.log" -type f 2>/dev/null
Test Scenario 1: CrowdSec Startup
# Given: Container restarts
docker restart charon
# When: Container boots
# Then:
# - Check logs for CrowdSec reconciliation messages
# - Verify PID file created: /app/data/crowdsec/crowdsec.pid
# - Verify process running: docker exec charon ps aux | grep crowdsec
# - Verify status API returns running=true
docker logs charon --tail 100 | grep -i "crowdsec"
docker exec charon ps aux | grep crowdsec
docker exec charon ls -la /app/data/crowdsec/crowdsec.pid
Test Scenario 2: Toggle Behavior
# Given: CrowdSec is running
# When: User clicks toggle to disable
# Then:
# - Frontend shows loading state
# - API call succeeds
# - Process stops (no crowdsec in ps)
# - PID file removed
# - Toggle reflects OFF state
# - Status API returns running=false
# When: User clicks toggle to enable
# Then:
# - Frontend shows loading state
# - API call succeeds
# - Process starts
# - PID file created
# - Toggle reflects ON state
# - Status API returns running=true
Test Scenario 3: Security Log Viewer
# Given: CrowdSec is enabled and blocking traffic
# When: User opens Cerberus Dashboard
# Then:
# - WebSocket connects successfully (check browser console)
# - Logs appear in real-time
# - Blocked requests show with red indicator
# - Source badges show correct module (crowdsec, waf, etc.)
# Test blocked request:
curl -H "User-Agent: BadBot" https://your-charon-instance.com
# Should see blocked log entry in dashboard
Implementation Order
-
Phase 1: Diagnostics (15 minutes)
- Run all pre-checks
- Document actual state of system
- Identify which issue is the primary blocker
-
Phase 2: CrowdSec Startup (30 minutes)
- Implement enhanced logging in
crowdsec_startup.go - Add binary/config validation
- Test container restart
- Implement enhanced logging in
-
Phase 3: Toggle Fix (20 minutes)
- Remove optimistic updates from
Security.tsx - Add status verification
- Test toggle on/off cycle
- Remove optimistic updates from
-
Phase 4: Log Viewer (30 minutes)
- Verify log file path
- Implement log path detection
- Improve source detection
- Test with actual traffic
-
Phase 5: Integration Testing (30 minutes)
- Full end-to-end test
- Verify all three issues resolved
- Check for regressions
Total Estimated Time: 2 hours
Success Criteria
✅ CrowdSec Running:
docker exec charon ps aux | grep crowdsecshows running process- PID file exists at
/app/data/crowdsec/crowdsec.pid /api/v1/admin/crowdsec/statusreturns{"running": true, "pid": <number>}
✅ Toggle Working:
- Toggle can be turned on and off without getting stuck
- UI state matches backend process state
- Clear error messages if operations fail
✅ Logs Correct:
- Security log viewer shows Caddy access logs
- Blocked requests appear with proper indicators
- Source badges correctly identify security module
- WebSocket stays connected
Rollback Plan
If hotfix causes issues:
- Revert Commits:
git revert HEAD~3..HEAD # Revert last 3 commits
git push origin feature/beta-release
- Restart Container:
docker restart charon
- Verify Basic Functionality:
- Proxy hosts still work
- SSL still works
- No new errors in logs
Notes for QA
- Test on clean container (no previous CrowdSec state)
- Test with existing CrowdSec config
- Test rapid toggle on/off cycles
- Monitor container logs during testing
- Check browser console for WebSocket errors
- Verify memory usage doesn't spike (log file tailing)
QA Testing Results (December 15, 2025)
Tester: QA_Security Build: charon:local (post-migration implementation) Test Date: 2025-12-15 03:24 UTC
Phase 1: Migration Implementation Testing
Test 1.1: Migration Command Execution
- Status: ✅ PASSED
- Command:
docker exec charon /app/charon migrate - Result: All 6 security tables created successfully
- Evidence: See crowdsec_migration_qa_report.md
Test 1.2: CrowdSec Auto-Start Behavior
- Status: ⚠️ EXPECTED BEHAVIOR (Not a Bug)
- Observation: CrowdSec did NOT auto-start after restart
- Reason: Fresh database has no SecurityConfig record, only table structure
- Resolution: This is correct first-boot behavior
Phase 2: Code Quality Validation
- Pre-commit: ✅ All hooks passed
- Backend Tests: ✅ 9/9 packages passed (including 3 new migration tests)
- Frontend Tests: ✅ 772 tests passed | 2 skipped
- Code Cleanliness: ✅ No debug statements, zero linter issues
Phase 3: Regression Testing
- Schema Impact: ✅ No changes to existing tables
- Feature Validation: ✅ All 772 tests passed, no regressions
Summary
QA Sign-Off: ✅ APPROVED FOR PRODUCTION
Detailed Report: crowdsec_migration_qa_report.md