427 lines
13 KiB
Markdown
427 lines
13 KiB
Markdown
# CrowdSec Reconciliation Failure Root Cause Analysis
|
|
|
|
**Date:** December 15, 2025
|
|
**Status:** CRITICAL - CrowdSec NOT starting despite 7+ commits attempting fixes
|
|
**Location:** `backend/internal/services/crowdsec_startup.go`
|
|
|
|
## Executive Summary
|
|
|
|
**The CrowdSec reconciliation function starts but exits silently** because the `security_configs` table **DOES NOT EXIST** in the production database. The table was added to AutoMigrate but the container was never rebuilt/restarted with a fresh database state after the migration code was added.
|
|
|
|
## The Silent Exit Point
|
|
|
|
Looking at the container logs:
|
|
|
|
```
|
|
{"bin_path":"crowdsec","data_dir":"/app/data/crowdsec","level":"info","msg":"CrowdSec reconciliation: starting startup check","time":"2025-12-14T20:55:39-05:00"}
|
|
```
|
|
|
|
Then... NOTHING. The function exits silently.
|
|
|
|
### Why It Exits
|
|
|
|
In `backend/internal/services/crowdsec_startup.go`, line 33-36:
|
|
|
|
```go
|
|
// Check if SecurityConfig table exists and has a record with CrowdSecMode = "local"
|
|
if !db.Migrator().HasTable(&models.SecurityConfig{}) {
|
|
logger.Log().Debug("CrowdSec reconciliation skipped: SecurityConfig table not found")
|
|
return
|
|
}
|
|
```
|
|
|
|
**This guard clause triggers because the table doesn't exist**, but it logs at **DEBUG** level, not INFO/WARN/ERROR. Since the container is running in production mode (not debug), this log message is never shown.
|
|
|
|
### Database Evidence
|
|
|
|
```bash
|
|
$ sqlite3 data/charon.db ".tables"
|
|
access_lists remote_servers
|
|
caddy_configs settings
|
|
domains ssl_certificates
|
|
import_sessions uptime_heartbeats
|
|
locations uptime_hosts
|
|
proxy_hosts uptime_monitors
|
|
notification_providers uptime_notification_events
|
|
notifications users
|
|
```
|
|
|
|
**NO `security_configs` TABLE EXISTS.** Yet the code in `backend/internal/api/routes/routes.go` clearly calls:
|
|
|
|
```go
|
|
if err := db.AutoMigrate(
|
|
// ... other models ...
|
|
&models.SecurityConfig{},
|
|
&models.SecurityDecision{},
|
|
&models.SecurityAudit{},
|
|
&models.SecurityRuleSet{},
|
|
// ...
|
|
); err != nil {
|
|
return fmt.Errorf("auto migrate: %w", err)
|
|
}
|
|
```
|
|
|
|
## Why AutoMigrate Didn't Create the Tables
|
|
|
|
### Theory 1: Database Persistence Across Rebuilds ✅ MOST LIKELY
|
|
|
|
The `charon.db` file is mounted as a volume in the Docker container:
|
|
|
|
```yaml
|
|
# docker-compose.yml
|
|
volumes:
|
|
- ./data:/app/data
|
|
```
|
|
|
|
**What happened:**
|
|
|
|
1. SecurityConfig model was added to AutoMigrate in recent commits
|
|
2. Container was rebuilt with `docker build -t charon:local .`
|
|
3. Container started with `docker compose up -d`
|
|
4. **BUT** the existing `data/charon.db` file (from before the migration code existed) was reused
|
|
5. GORM's AutoMigrate is **non-destructive** - it only adds new tables if they don't exist
|
|
6. The tables were never created because the database predates the migration code
|
|
|
|
### Theory 2: AutoMigrate Failed Silently
|
|
|
|
Looking at the logs, there is **NO** indication that AutoMigrate failed:
|
|
|
|
```
|
|
{"level":"info","msg":"starting Charon backend on version dev","time":"2025-12-14T20:55:39-05:00"}
|
|
{"bin_path":"crowdsec","data_dir":"/app/data/crowdsec","level":"info","msg":"CrowdSec reconciliation: starting startup check","time":"2025-12-14T20:55:39-05:00"}
|
|
{"level":"info","msg":"starting Charon backend on :8080","time":"2025-12-14T20:55:39-05:00"}
|
|
```
|
|
|
|
If AutoMigrate had failed, we would see an error from `routes.Register()` because it has:
|
|
|
|
```go
|
|
if err := db.AutoMigrate(...); err != nil {
|
|
return fmt.Errorf("auto migrate: %w", err)
|
|
}
|
|
```
|
|
|
|
Since the server started successfully, AutoMigrate either:
|
|
|
|
- Ran successfully but found the DB already in sync (no new tables to add)
|
|
- Never ran because the DB was opened but the tables already existed from a previous run
|
|
|
|
## The Cascading Failures
|
|
|
|
Because `security_configs` doesn't exist:
|
|
|
|
1. ✅ Reconciliation exits at line 33-36 (HasTable check)
|
|
2. ✅ CrowdSec is never started
|
|
3. ✅ Frontend shows "CrowdSec is not running" in Console Enrollment
|
|
4. ✅ Security page toggle is stuck ON (because there's no DB record to persist the state)
|
|
5. ✅ Log viewer shows "disconnected" (CrowdSec process doesn't exist)
|
|
6. ✅ All subsequent API calls fail because they expect the table to exist
|
|
|
|
## Why This Wasn't Caught During Development
|
|
|
|
Looking at the test files, **EVERY TEST** manually calls AutoMigrate:
|
|
|
|
```go
|
|
// backend/internal/services/crowdsec_startup_test.go:75
|
|
err = db.AutoMigrate(&models.SecurityConfig{})
|
|
|
|
// backend/internal/api/handlers/security_handler_coverage_test.go:25
|
|
require.NoError(t, db.AutoMigrate(&models.SecurityConfig{}, ...))
|
|
```
|
|
|
|
So tests **always create the table fresh**, hiding the issue that would occur in production with a persistent database.
|
|
|
|
## The Fix
|
|
|
|
### Option 1: Manual Database Migration (IMMEDIATE FIX)
|
|
|
|
Run this on the production container:
|
|
|
|
```bash
|
|
# Connect to running container
|
|
docker exec -it charon /bin/sh
|
|
|
|
# Run migration command (create a new CLI command in main.go)
|
|
./backend migrate
|
|
|
|
# OR manually create tables with sqlite3
|
|
sqlite3 /app/data/charon.db << EOF
|
|
CREATE TABLE IF NOT EXISTS security_configs (
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
|
uuid TEXT UNIQUE NOT NULL,
|
|
name TEXT,
|
|
enabled BOOLEAN DEFAULT false,
|
|
admin_whitelist TEXT,
|
|
break_glass_hash TEXT,
|
|
crowdsec_mode TEXT DEFAULT 'disabled',
|
|
crowdsec_api_url TEXT,
|
|
waf_mode TEXT DEFAULT 'disabled',
|
|
waf_rules_source TEXT,
|
|
waf_learning BOOLEAN DEFAULT false,
|
|
waf_paranoia_level INTEGER DEFAULT 1,
|
|
waf_exclusions TEXT,
|
|
rate_limit_mode TEXT DEFAULT 'disabled',
|
|
rate_limit_enable BOOLEAN DEFAULT false,
|
|
rate_limit_burst INTEGER DEFAULT 10,
|
|
rate_limit_requests INTEGER DEFAULT 100,
|
|
rate_limit_window_sec INTEGER DEFAULT 60,
|
|
rate_limit_bypass_list TEXT,
|
|
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
|
|
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
|
|
);
|
|
|
|
CREATE TABLE IF NOT EXISTS security_decisions (
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
|
uuid TEXT UNIQUE NOT NULL,
|
|
ip TEXT NOT NULL,
|
|
reason TEXT,
|
|
action TEXT DEFAULT 'ban',
|
|
duration INTEGER,
|
|
expires_at DATETIME,
|
|
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
|
|
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
|
|
);
|
|
|
|
CREATE TABLE IF NOT EXISTS security_audits (
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
|
uuid TEXT UNIQUE NOT NULL,
|
|
event_type TEXT,
|
|
ip_address TEXT,
|
|
details TEXT,
|
|
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
|
|
);
|
|
|
|
CREATE TABLE IF NOT EXISTS security_rule_sets (
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
|
uuid TEXT UNIQUE NOT NULL,
|
|
name TEXT NOT NULL,
|
|
type TEXT DEFAULT 'ip_list',
|
|
content TEXT,
|
|
enabled BOOLEAN DEFAULT true,
|
|
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
|
|
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
|
|
);
|
|
|
|
CREATE TABLE IF NOT EXISTS crowdsec_preset_events (
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
|
uuid TEXT UNIQUE NOT NULL,
|
|
name TEXT NOT NULL,
|
|
description TEXT,
|
|
enabled BOOLEAN DEFAULT false,
|
|
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
|
|
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
|
|
);
|
|
|
|
CREATE TABLE IF NOT EXISTS crowdsec_console_enrollments (
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
|
uuid TEXT UNIQUE NOT NULL,
|
|
enrollment_key TEXT,
|
|
organization_id TEXT,
|
|
instance_name TEXT,
|
|
enrolled_at DATETIME,
|
|
status TEXT DEFAULT 'pending',
|
|
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
|
|
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
|
|
);
|
|
EOF
|
|
|
|
# Restart container
|
|
exit
|
|
docker restart charon
|
|
```
|
|
|
|
### Option 2: Add Migration CLI Command (CLEAN SOLUTION)
|
|
|
|
Add to `backend/cmd/api/main.go`:
|
|
|
|
```go
|
|
// Handle CLI commands
|
|
if len(os.Args) > 1 {
|
|
switch os.Args[1] {
|
|
case "migrate":
|
|
cfg, err := config.Load()
|
|
if err != nil {
|
|
log.Fatalf("load config: %v", err)
|
|
}
|
|
|
|
db, err := database.Connect(cfg.DatabasePath)
|
|
if err != nil {
|
|
log.Fatalf("connect database: %v", err)
|
|
}
|
|
|
|
logger.Log().Info("Running database migrations...")
|
|
if err := db.AutoMigrate(
|
|
&models.SecurityConfig{},
|
|
&models.SecurityDecision{},
|
|
&models.SecurityAudit{},
|
|
&models.SecurityRuleSet{},
|
|
&models.CrowdsecPresetEvent{},
|
|
&models.CrowdsecConsoleEnrollment{},
|
|
); err != nil {
|
|
log.Fatalf("migration failed: %v", err)
|
|
}
|
|
|
|
logger.Log().Info("Migration completed successfully")
|
|
return
|
|
|
|
case "reset-password":
|
|
// existing reset-password code
|
|
}
|
|
}
|
|
```
|
|
|
|
Then run:
|
|
|
|
```bash
|
|
docker exec charon /app/backend migrate
|
|
docker restart charon
|
|
```
|
|
|
|
### Option 3: Nuclear Option - Reset Database (DESTRUCTIVE)
|
|
|
|
```bash
|
|
# BACKUP FIRST
|
|
docker exec charon cp /app/data/charon.db /app/data/backups/charon-pre-security-migration.db
|
|
|
|
# Remove database
|
|
rm data/charon.db data/charon.db-shm data/charon.db-wal
|
|
|
|
# Restart container (will recreate fresh DB with all tables)
|
|
docker restart charon
|
|
```
|
|
|
|
## Fix Verification Checklist
|
|
|
|
After applying any fix, verify:
|
|
|
|
1. ✅ Check table exists:
|
|
|
|
```bash
|
|
docker exec charon sqlite3 /app/data/charon.db "SELECT name FROM sqlite_master WHERE type='table' AND name='security_configs';"
|
|
```
|
|
|
|
Expected: `security_configs`
|
|
|
|
2. ✅ Check reconciliation logs:
|
|
|
|
```bash
|
|
docker logs charon 2>&1 | grep -i "crowdsec reconciliation"
|
|
```
|
|
|
|
Expected: "starting CrowdSec" or "already running" (NOT "skipped: SecurityConfig table not found")
|
|
|
|
3. ✅ Check CrowdSec is running:
|
|
|
|
```bash
|
|
docker exec charon ps aux | grep crowdsec
|
|
```
|
|
|
|
Expected: `crowdsec -c /app/data/crowdsec/config/config.yaml`
|
|
|
|
4. ✅ Check frontend Console Enrollment:
|
|
- Navigate to `/security` page
|
|
- Click "Console Enrollment" tab
|
|
- Should show CrowdSec status as "Running"
|
|
|
|
5. ✅ Check toggle state persists:
|
|
- Toggle CrowdSec OFF
|
|
- Refresh page
|
|
- Toggle should remain OFF
|
|
|
|
## Code Improvements Needed
|
|
|
|
### 1. Change Debug Log to Warning
|
|
|
|
**File:** `backend/internal/services/crowdsec_startup.go:35`
|
|
|
|
```go
|
|
// BEFORE (line 35)
|
|
logger.Log().Debug("CrowdSec reconciliation skipped: SecurityConfig table not found")
|
|
|
|
// AFTER
|
|
logger.Log().Warn("CrowdSec reconciliation skipped: SecurityConfig table not found - run migrations")
|
|
```
|
|
|
|
**Rationale:** This is NOT a debug-level issue. If the table doesn't exist, it's a critical setup problem that should always be logged, regardless of debug mode.
|
|
|
|
### 2. Add Startup Migration Check
|
|
|
|
**File:** `backend/cmd/api/main.go` (after database.Connect())
|
|
|
|
```go
|
|
// Verify critical tables exist before starting server
|
|
requiredTables := []interface{}{
|
|
&models.SecurityConfig{},
|
|
&models.SecurityDecision{},
|
|
&models.SecurityAudit{},
|
|
&models.SecurityRuleSet{},
|
|
}
|
|
|
|
for _, model := range requiredTables {
|
|
if !db.Migrator().HasTable(model) {
|
|
logger.Log().Warnf("Missing table for %T - running migration", model)
|
|
if err := db.AutoMigrate(model); err != nil {
|
|
log.Fatalf("failed to migrate %T: %v", model, err)
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### 3. Add Health Check for Tables
|
|
|
|
**File:** `backend/internal/api/handlers/health.go`
|
|
|
|
```go
|
|
func HealthHandler(c *gin.Context) {
|
|
db := c.MustGet("db").(*gorm.DB)
|
|
|
|
health := gin.H{
|
|
"status": "healthy",
|
|
"database": "connected",
|
|
"migrations": checkMigrations(db),
|
|
}
|
|
|
|
c.JSON(200, health)
|
|
}
|
|
|
|
func checkMigrations(db *gorm.DB) map[string]bool {
|
|
return map[string]bool{
|
|
"security_configs": db.Migrator().HasTable(&models.SecurityConfig{}),
|
|
"security_decisions": db.Migrator().HasTable(&models.SecurityDecision{}),
|
|
"security_audits": db.Migrator().HasTable(&models.SecurityAudit{}),
|
|
"security_rule_sets": db.Migrator().HasTable(&models.SecurityRuleSet{}),
|
|
}
|
|
}
|
|
```
|
|
|
|
## Related Issues
|
|
|
|
- Frontend toggle stuck in ON position → Database issue (no table to persist state)
|
|
- Console Enrollment says "not running" → CrowdSec never started (reconciliation exits)
|
|
- Log viewer disconnected → CrowdSec process doesn't exist
|
|
- All 7 previous commits failed because they addressed symptoms, not the root cause
|
|
|
|
## Lessons Learned
|
|
|
|
1. **Always log critical guard clauses at WARN level or higher** - Debug logs are invisible in production
|
|
2. **Verify database state matches code expectations** - AutoMigrate is non-destructive and won't fix missing tables from before the migration code existed
|
|
3. **Add database health checks** - Make missing tables visible in /api/v1/health endpoint
|
|
4. **Test with persistent databases** - All unit tests use fresh in-memory DBs, hiding this issue
|
|
5. **Add migration CLI command** - Allow operators to manually trigger migrations without container restart
|
|
|
|
## Recommended Action Plan
|
|
|
|
1. **IMMEDIATE:** Run Option 2 (Add migrate CLI command) and execute migration
|
|
2. **SHORT-TERM:** Apply Code Improvements #1 and #2
|
|
3. **LONG-TERM:** Add health check endpoint and integration tests with persistent DBs
|
|
4. **DOCUMENTATION:** Update deployment docs to mention migration requirement
|
|
|
|
## Status
|
|
|
|
- [x] Root cause identified (missing tables due to persistent DB from before migration code)
|
|
- [x] Silent exit point found (HasTable check with DEBUG logging)
|
|
- [x] Fix options documented
|
|
- [ ] Fix implemented
|
|
- [ ] Fix verified
|
|
- [ ] Code improvements applied
|
|
- [ ] Documentation updated
|