- Added TestMigrateCommand_Succeeds to validate migration functionality. - Introduced TestStartupVerification_MissingTables to ensure proper handling of missing security tables. - Updated crowdsec_startup.go to log warnings for missing SecurityConfig table. - Enhanced documentation for database migrations during upgrades, including steps and expected outputs. - Created a detailed migration QA report outlining testing results and recommendations. - Added troubleshooting guidance for CrowdSec not starting after upgrades due to missing tables. - Established a new plan for addressing CrowdSec reconciliation failures, including root cause analysis and proposed fixes.
13 KiB
CrowdSec Reconciliation Failure Root Cause Analysis
Date: December 15, 2025
Status: CRITICAL - CrowdSec NOT starting despite 7+ commits attempting fixes
Location: backend/internal/services/crowdsec_startup.go
Executive Summary
The CrowdSec reconciliation function starts but exits silently because the security_configs table DOES NOT EXIST in the production database. The table was added to AutoMigrate but the container was never rebuilt/restarted with a fresh database state after the migration code was added.
The Silent Exit Point
Looking at the container logs:
{"bin_path":"crowdsec","data_dir":"/app/data/crowdsec","level":"info","msg":"CrowdSec reconciliation: starting startup check","time":"2025-12-14T20:55:39-05:00"}
Then... NOTHING. The function exits silently.
Why It Exits
In backend/internal/services/crowdsec_startup.go, line 33-36:
// Check if SecurityConfig table exists and has a record with CrowdSecMode = "local"
if !db.Migrator().HasTable(&models.SecurityConfig{}) {
logger.Log().Debug("CrowdSec reconciliation skipped: SecurityConfig table not found")
return
}
This guard clause triggers because the table doesn't exist, but it logs at DEBUG level, not INFO/WARN/ERROR. Since the container is running in production mode (not debug), this log message is never shown.
Database Evidence
$ sqlite3 data/charon.db ".tables"
access_lists remote_servers
caddy_configs settings
domains ssl_certificates
import_sessions uptime_heartbeats
locations uptime_hosts
proxy_hosts uptime_monitors
notification_providers uptime_notification_events
notifications users
NO security_configs TABLE EXISTS. Yet the code in backend/internal/api/routes/routes.go clearly calls:
if err := db.AutoMigrate(
// ... other models ...
&models.SecurityConfig{},
&models.SecurityDecision{},
&models.SecurityAudit{},
&models.SecurityRuleSet{},
// ...
); err != nil {
return fmt.Errorf("auto migrate: %w", err)
}
Why AutoMigrate Didn't Create the Tables
Theory 1: Database Persistence Across Rebuilds ✅ MOST LIKELY
The charon.db file is mounted as a volume in the Docker container:
# docker-compose.yml
volumes:
- ./data:/app/data
What happened:
- SecurityConfig model was added to AutoMigrate in recent commits
- Container was rebuilt with
docker build -t charon:local . - Container started with
docker compose up -d - BUT the existing
data/charon.dbfile (from before the migration code existed) was reused - GORM's AutoMigrate is non-destructive - it only adds new tables if they don't exist
- The tables were never created because the database predates the migration code
Theory 2: AutoMigrate Failed Silently
Looking at the logs, there is NO indication that AutoMigrate failed:
{"level":"info","msg":"starting Charon backend on version dev","time":"2025-12-14T20:55:39-05:00"}
{"bin_path":"crowdsec","data_dir":"/app/data/crowdsec","level":"info","msg":"CrowdSec reconciliation: starting startup check","time":"2025-12-14T20:55:39-05:00"}
{"level":"info","msg":"starting Charon backend on :8080","time":"2025-12-14T20:55:39-05:00"}
If AutoMigrate had failed, we would see an error from routes.Register() because it has:
if err := db.AutoMigrate(...); err != nil {
return fmt.Errorf("auto migrate: %w", err)
}
Since the server started successfully, AutoMigrate either:
- Ran successfully but found the DB already in sync (no new tables to add)
- Never ran because the DB was opened but the tables already existed from a previous run
The Cascading Failures
Because security_configs doesn't exist:
- ✅ Reconciliation exits at line 33-36 (HasTable check)
- ✅ CrowdSec is never started
- ✅ Frontend shows "CrowdSec is not running" in Console Enrollment
- ✅ Security page toggle is stuck ON (because there's no DB record to persist the state)
- ✅ Log viewer shows "disconnected" (CrowdSec process doesn't exist)
- ✅ All subsequent API calls fail because they expect the table to exist
Why This Wasn't Caught During Development
Looking at the test files, EVERY TEST manually calls AutoMigrate:
// backend/internal/services/crowdsec_startup_test.go:75
err = db.AutoMigrate(&models.SecurityConfig{})
// backend/internal/api/handlers/security_handler_coverage_test.go:25
require.NoError(t, db.AutoMigrate(&models.SecurityConfig{}, ...))
So tests always create the table fresh, hiding the issue that would occur in production with a persistent database.
The Fix
Option 1: Manual Database Migration (IMMEDIATE FIX)
Run this on the production container:
# Connect to running container
docker exec -it charon /bin/sh
# Run migration command (create a new CLI command in main.go)
./backend migrate
# OR manually create tables with sqlite3
sqlite3 /app/data/charon.db << EOF
CREATE TABLE IF NOT EXISTS security_configs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
uuid TEXT UNIQUE NOT NULL,
name TEXT,
enabled BOOLEAN DEFAULT false,
admin_whitelist TEXT,
break_glass_hash TEXT,
crowdsec_mode TEXT DEFAULT 'disabled',
crowdsec_api_url TEXT,
waf_mode TEXT DEFAULT 'disabled',
waf_rules_source TEXT,
waf_learning BOOLEAN DEFAULT false,
waf_paranoia_level INTEGER DEFAULT 1,
waf_exclusions TEXT,
rate_limit_mode TEXT DEFAULT 'disabled',
rate_limit_enable BOOLEAN DEFAULT false,
rate_limit_burst INTEGER DEFAULT 10,
rate_limit_requests INTEGER DEFAULT 100,
rate_limit_window_sec INTEGER DEFAULT 60,
rate_limit_bypass_list TEXT,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS security_decisions (
id INTEGER PRIMARY KEY AUTOINCREMENT,
uuid TEXT UNIQUE NOT NULL,
ip TEXT NOT NULL,
reason TEXT,
action TEXT DEFAULT 'ban',
duration INTEGER,
expires_at DATETIME,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS security_audits (
id INTEGER PRIMARY KEY AUTOINCREMENT,
uuid TEXT UNIQUE NOT NULL,
event_type TEXT,
ip_address TEXT,
details TEXT,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS security_rule_sets (
id INTEGER PRIMARY KEY AUTOINCREMENT,
uuid TEXT UNIQUE NOT NULL,
name TEXT NOT NULL,
type TEXT DEFAULT 'ip_list',
content TEXT,
enabled BOOLEAN DEFAULT true,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS crowdsec_preset_events (
id INTEGER PRIMARY KEY AUTOINCREMENT,
uuid TEXT UNIQUE NOT NULL,
name TEXT NOT NULL,
description TEXT,
enabled BOOLEAN DEFAULT false,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS crowdsec_console_enrollments (
id INTEGER PRIMARY KEY AUTOINCREMENT,
uuid TEXT UNIQUE NOT NULL,
enrollment_key TEXT,
organization_id TEXT,
instance_name TEXT,
enrolled_at DATETIME,
status TEXT DEFAULT 'pending',
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
EOF
# Restart container
exit
docker restart charon
Option 2: Add Migration CLI Command (CLEAN SOLUTION)
Add to backend/cmd/api/main.go:
// Handle CLI commands
if len(os.Args) > 1 {
switch os.Args[1] {
case "migrate":
cfg, err := config.Load()
if err != nil {
log.Fatalf("load config: %v", err)
}
db, err := database.Connect(cfg.DatabasePath)
if err != nil {
log.Fatalf("connect database: %v", err)
}
logger.Log().Info("Running database migrations...")
if err := db.AutoMigrate(
&models.SecurityConfig{},
&models.SecurityDecision{},
&models.SecurityAudit{},
&models.SecurityRuleSet{},
&models.CrowdsecPresetEvent{},
&models.CrowdsecConsoleEnrollment{},
); err != nil {
log.Fatalf("migration failed: %v", err)
}
logger.Log().Info("Migration completed successfully")
return
case "reset-password":
// existing reset-password code
}
}
Then run:
docker exec charon /app/backend migrate
docker restart charon
Option 3: Nuclear Option - Reset Database (DESTRUCTIVE)
# BACKUP FIRST
docker exec charon cp /app/data/charon.db /app/data/backups/charon-pre-security-migration.db
# Remove database
rm data/charon.db data/charon.db-shm data/charon.db-wal
# Restart container (will recreate fresh DB with all tables)
docker restart charon
Fix Verification Checklist
After applying any fix, verify:
-
✅ Check table exists:
docker exec charon sqlite3 /app/data/charon.db "SELECT name FROM sqlite_master WHERE type='table' AND name='security_configs';"Expected:
security_configs -
✅ Check reconciliation logs:
docker logs charon 2>&1 | grep -i "crowdsec reconciliation"Expected: "starting CrowdSec" or "already running" (NOT "skipped: SecurityConfig table not found")
-
✅ Check CrowdSec is running:
docker exec charon ps aux | grep crowdsecExpected:
crowdsec -c /app/data/crowdsec/config/config.yaml -
✅ Check frontend Console Enrollment:
- Navigate to
/securitypage - Click "Console Enrollment" tab
- Should show CrowdSec status as "Running"
- Navigate to
-
✅ Check toggle state persists:
- Toggle CrowdSec OFF
- Refresh page
- Toggle should remain OFF
Code Improvements Needed
1. Change Debug Log to Warning
File: backend/internal/services/crowdsec_startup.go:35
// BEFORE (line 35)
logger.Log().Debug("CrowdSec reconciliation skipped: SecurityConfig table not found")
// AFTER
logger.Log().Warn("CrowdSec reconciliation skipped: SecurityConfig table not found - run migrations")
Rationale: This is NOT a debug-level issue. If the table doesn't exist, it's a critical setup problem that should always be logged, regardless of debug mode.
2. Add Startup Migration Check
File: backend/cmd/api/main.go (after database.Connect())
// Verify critical tables exist before starting server
requiredTables := []interface{}{
&models.SecurityConfig{},
&models.SecurityDecision{},
&models.SecurityAudit{},
&models.SecurityRuleSet{},
}
for _, model := range requiredTables {
if !db.Migrator().HasTable(model) {
logger.Log().Warnf("Missing table for %T - running migration", model)
if err := db.AutoMigrate(model); err != nil {
log.Fatalf("failed to migrate %T: %v", model, err)
}
}
}
3. Add Health Check for Tables
File: backend/internal/api/handlers/health.go
func HealthHandler(c *gin.Context) {
db := c.MustGet("db").(*gorm.DB)
health := gin.H{
"status": "healthy",
"database": "connected",
"migrations": checkMigrations(db),
}
c.JSON(200, health)
}
func checkMigrations(db *gorm.DB) map[string]bool {
return map[string]bool{
"security_configs": db.Migrator().HasTable(&models.SecurityConfig{}),
"security_decisions": db.Migrator().HasTable(&models.SecurityDecision{}),
"security_audits": db.Migrator().HasTable(&models.SecurityAudit{}),
"security_rule_sets": db.Migrator().HasTable(&models.SecurityRuleSet{}),
}
}
Related Issues
- Frontend toggle stuck in ON position → Database issue (no table to persist state)
- Console Enrollment says "not running" → CrowdSec never started (reconciliation exits)
- Log viewer disconnected → CrowdSec process doesn't exist
- All 7 previous commits failed because they addressed symptoms, not the root cause
Lessons Learned
- Always log critical guard clauses at WARN level or higher - Debug logs are invisible in production
- Verify database state matches code expectations - AutoMigrate is non-destructive and won't fix missing tables from before the migration code existed
- Add database health checks - Make missing tables visible in /api/v1/health endpoint
- Test with persistent databases - All unit tests use fresh in-memory DBs, hiding this issue
- Add migration CLI command - Allow operators to manually trigger migrations without container restart
Recommended Action Plan
- IMMEDIATE: Run Option 2 (Add migrate CLI command) and execute migration
- SHORT-TERM: Apply Code Improvements #1 and #2
- LONG-TERM: Add health check endpoint and integration tests with persistent DBs
- DOCUMENTATION: Update deployment docs to mention migration requirement
Status
- Root cause identified (missing tables due to persistent DB from before migration code)
- Silent exit point found (HasTable check with DEBUG logging)
- Fix options documented
- Fix implemented
- Fix verified
- Code improvements applied
- Documentation updated