Files
Charon/docs/plans/crowdsec_reconciliation_failure.md
GitHub Actions 51f0a6937e feat: Implement database migration command and enhance CrowdSec startup verification
- Added TestMigrateCommand_Succeeds to validate migration functionality.
- Introduced TestStartupVerification_MissingTables to ensure proper handling of missing security tables.
- Updated crowdsec_startup.go to log warnings for missing SecurityConfig table.
- Enhanced documentation for database migrations during upgrades, including steps and expected outputs.
- Created a detailed migration QA report outlining testing results and recommendations.
- Added troubleshooting guidance for CrowdSec not starting after upgrades due to missing tables.
- Established a new plan for addressing CrowdSec reconciliation failures, including root cause analysis and proposed fixes.
2025-12-15 07:30:36 +00:00

13 KiB

CrowdSec Reconciliation Failure Root Cause Analysis

Date: December 15, 2025 Status: CRITICAL - CrowdSec NOT starting despite 7+ commits attempting fixes Location: backend/internal/services/crowdsec_startup.go

Executive Summary

The CrowdSec reconciliation function starts but exits silently because the security_configs table DOES NOT EXIST in the production database. The table was added to AutoMigrate but the container was never rebuilt/restarted with a fresh database state after the migration code was added.

The Silent Exit Point

Looking at the container logs:

{"bin_path":"crowdsec","data_dir":"/app/data/crowdsec","level":"info","msg":"CrowdSec reconciliation: starting startup check","time":"2025-12-14T20:55:39-05:00"}

Then... NOTHING. The function exits silently.

Why It Exits

In backend/internal/services/crowdsec_startup.go, line 33-36:

// Check if SecurityConfig table exists and has a record with CrowdSecMode = "local"
if !db.Migrator().HasTable(&models.SecurityConfig{}) {
    logger.Log().Debug("CrowdSec reconciliation skipped: SecurityConfig table not found")
    return
}

This guard clause triggers because the table doesn't exist, but it logs at DEBUG level, not INFO/WARN/ERROR. Since the container is running in production mode (not debug), this log message is never shown.

Database Evidence

$ sqlite3 data/charon.db ".tables"
access_lists                remote_servers
caddy_configs               settings
domains                     ssl_certificates
import_sessions             uptime_heartbeats
locations                   uptime_hosts
proxy_hosts                 uptime_monitors
notification_providers      uptime_notification_events
notifications               users

NO security_configs TABLE EXISTS. Yet the code in backend/internal/api/routes/routes.go clearly calls:

if err := db.AutoMigrate(
    // ... other models ...
    &models.SecurityConfig{},
    &models.SecurityDecision{},
    &models.SecurityAudit{},
    &models.SecurityRuleSet{},
    // ...
); err != nil {
    return fmt.Errorf("auto migrate: %w", err)
}

Why AutoMigrate Didn't Create the Tables

Theory 1: Database Persistence Across Rebuilds MOST LIKELY

The charon.db file is mounted as a volume in the Docker container:

# docker-compose.yml
volumes:
  - ./data:/app/data

What happened:

  1. SecurityConfig model was added to AutoMigrate in recent commits
  2. Container was rebuilt with docker build -t charon:local .
  3. Container started with docker compose up -d
  4. BUT the existing data/charon.db file (from before the migration code existed) was reused
  5. GORM's AutoMigrate is non-destructive - it only adds new tables if they don't exist
  6. The tables were never created because the database predates the migration code

Theory 2: AutoMigrate Failed Silently

Looking at the logs, there is NO indication that AutoMigrate failed:

{"level":"info","msg":"starting Charon backend on version dev","time":"2025-12-14T20:55:39-05:00"}
{"bin_path":"crowdsec","data_dir":"/app/data/crowdsec","level":"info","msg":"CrowdSec reconciliation: starting startup check","time":"2025-12-14T20:55:39-05:00"}
{"level":"info","msg":"starting Charon backend on :8080","time":"2025-12-14T20:55:39-05:00"}

If AutoMigrate had failed, we would see an error from routes.Register() because it has:

if err := db.AutoMigrate(...); err != nil {
    return fmt.Errorf("auto migrate: %w", err)
}

Since the server started successfully, AutoMigrate either:

  • Ran successfully but found the DB already in sync (no new tables to add)
  • Never ran because the DB was opened but the tables already existed from a previous run

The Cascading Failures

Because security_configs doesn't exist:

  1. Reconciliation exits at line 33-36 (HasTable check)
  2. CrowdSec is never started
  3. Frontend shows "CrowdSec is not running" in Console Enrollment
  4. Security page toggle is stuck ON (because there's no DB record to persist the state)
  5. Log viewer shows "disconnected" (CrowdSec process doesn't exist)
  6. All subsequent API calls fail because they expect the table to exist

Why This Wasn't Caught During Development

Looking at the test files, EVERY TEST manually calls AutoMigrate:

// backend/internal/services/crowdsec_startup_test.go:75
err = db.AutoMigrate(&models.SecurityConfig{})

// backend/internal/api/handlers/security_handler_coverage_test.go:25
require.NoError(t, db.AutoMigrate(&models.SecurityConfig{}, ...))

So tests always create the table fresh, hiding the issue that would occur in production with a persistent database.

The Fix

Option 1: Manual Database Migration (IMMEDIATE FIX)

Run this on the production container:

# Connect to running container
docker exec -it charon /bin/sh

# Run migration command (create a new CLI command in main.go)
./backend migrate

# OR manually create tables with sqlite3
sqlite3 /app/data/charon.db << EOF
CREATE TABLE IF NOT EXISTS security_configs (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    uuid TEXT UNIQUE NOT NULL,
    name TEXT,
    enabled BOOLEAN DEFAULT false,
    admin_whitelist TEXT,
    break_glass_hash TEXT,
    crowdsec_mode TEXT DEFAULT 'disabled',
    crowdsec_api_url TEXT,
    waf_mode TEXT DEFAULT 'disabled',
    waf_rules_source TEXT,
    waf_learning BOOLEAN DEFAULT false,
    waf_paranoia_level INTEGER DEFAULT 1,
    waf_exclusions TEXT,
    rate_limit_mode TEXT DEFAULT 'disabled',
    rate_limit_enable BOOLEAN DEFAULT false,
    rate_limit_burst INTEGER DEFAULT 10,
    rate_limit_requests INTEGER DEFAULT 100,
    rate_limit_window_sec INTEGER DEFAULT 60,
    rate_limit_bypass_list TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE IF NOT EXISTS security_decisions (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    uuid TEXT UNIQUE NOT NULL,
    ip TEXT NOT NULL,
    reason TEXT,
    action TEXT DEFAULT 'ban',
    duration INTEGER,
    expires_at DATETIME,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE IF NOT EXISTS security_audits (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    uuid TEXT UNIQUE NOT NULL,
    event_type TEXT,
    ip_address TEXT,
    details TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE IF NOT EXISTS security_rule_sets (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    uuid TEXT UNIQUE NOT NULL,
    name TEXT NOT NULL,
    type TEXT DEFAULT 'ip_list',
    content TEXT,
    enabled BOOLEAN DEFAULT true,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE IF NOT EXISTS crowdsec_preset_events (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    uuid TEXT UNIQUE NOT NULL,
    name TEXT NOT NULL,
    description TEXT,
    enabled BOOLEAN DEFAULT false,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE IF NOT EXISTS crowdsec_console_enrollments (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    uuid TEXT UNIQUE NOT NULL,
    enrollment_key TEXT,
    organization_id TEXT,
    instance_name TEXT,
    enrolled_at DATETIME,
    status TEXT DEFAULT 'pending',
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
EOF

# Restart container
exit
docker restart charon

Option 2: Add Migration CLI Command (CLEAN SOLUTION)

Add to backend/cmd/api/main.go:

// Handle CLI commands
if len(os.Args) > 1 {
    switch os.Args[1] {
    case "migrate":
        cfg, err := config.Load()
        if err != nil {
            log.Fatalf("load config: %v", err)
        }

        db, err := database.Connect(cfg.DatabasePath)
        if err != nil {
            log.Fatalf("connect database: %v", err)
        }

        logger.Log().Info("Running database migrations...")
        if err := db.AutoMigrate(
            &models.SecurityConfig{},
            &models.SecurityDecision{},
            &models.SecurityAudit{},
            &models.SecurityRuleSet{},
            &models.CrowdsecPresetEvent{},
            &models.CrowdsecConsoleEnrollment{},
        ); err != nil {
            log.Fatalf("migration failed: %v", err)
        }

        logger.Log().Info("Migration completed successfully")
        return

    case "reset-password":
        // existing reset-password code
    }
}

Then run:

docker exec charon /app/backend migrate
docker restart charon

Option 3: Nuclear Option - Reset Database (DESTRUCTIVE)

# BACKUP FIRST
docker exec charon cp /app/data/charon.db /app/data/backups/charon-pre-security-migration.db

# Remove database
rm data/charon.db data/charon.db-shm data/charon.db-wal

# Restart container (will recreate fresh DB with all tables)
docker restart charon

Fix Verification Checklist

After applying any fix, verify:

  1. Check table exists:

    docker exec charon sqlite3 /app/data/charon.db "SELECT name FROM sqlite_master WHERE type='table' AND name='security_configs';"
    

    Expected: security_configs

  2. Check reconciliation logs:

    docker logs charon 2>&1 | grep -i "crowdsec reconciliation"
    

    Expected: "starting CrowdSec" or "already running" (NOT "skipped: SecurityConfig table not found")

  3. Check CrowdSec is running:

    docker exec charon ps aux | grep crowdsec
    

    Expected: crowdsec -c /app/data/crowdsec/config/config.yaml

  4. Check frontend Console Enrollment:

    • Navigate to /security page
    • Click "Console Enrollment" tab
    • Should show CrowdSec status as "Running"
  5. Check toggle state persists:

    • Toggle CrowdSec OFF
    • Refresh page
    • Toggle should remain OFF

Code Improvements Needed

1. Change Debug Log to Warning

File: backend/internal/services/crowdsec_startup.go:35

// BEFORE (line 35)
logger.Log().Debug("CrowdSec reconciliation skipped: SecurityConfig table not found")

// AFTER
logger.Log().Warn("CrowdSec reconciliation skipped: SecurityConfig table not found - run migrations")

Rationale: This is NOT a debug-level issue. If the table doesn't exist, it's a critical setup problem that should always be logged, regardless of debug mode.

2. Add Startup Migration Check

File: backend/cmd/api/main.go (after database.Connect())

// Verify critical tables exist before starting server
requiredTables := []interface{}{
    &models.SecurityConfig{},
    &models.SecurityDecision{},
    &models.SecurityAudit{},
    &models.SecurityRuleSet{},
}

for _, model := range requiredTables {
    if !db.Migrator().HasTable(model) {
        logger.Log().Warnf("Missing table for %T - running migration", model)
        if err := db.AutoMigrate(model); err != nil {
            log.Fatalf("failed to migrate %T: %v", model, err)
        }
    }
}

3. Add Health Check for Tables

File: backend/internal/api/handlers/health.go

func HealthHandler(c *gin.Context) {
    db := c.MustGet("db").(*gorm.DB)

    health := gin.H{
        "status": "healthy",
        "database": "connected",
        "migrations": checkMigrations(db),
    }

    c.JSON(200, health)
}

func checkMigrations(db *gorm.DB) map[string]bool {
    return map[string]bool{
        "security_configs": db.Migrator().HasTable(&models.SecurityConfig{}),
        "security_decisions": db.Migrator().HasTable(&models.SecurityDecision{}),
        "security_audits": db.Migrator().HasTable(&models.SecurityAudit{}),
        "security_rule_sets": db.Migrator().HasTable(&models.SecurityRuleSet{}),
    }
}
  • Frontend toggle stuck in ON position → Database issue (no table to persist state)
  • Console Enrollment says "not running" → CrowdSec never started (reconciliation exits)
  • Log viewer disconnected → CrowdSec process doesn't exist
  • All 7 previous commits failed because they addressed symptoms, not the root cause

Lessons Learned

  1. Always log critical guard clauses at WARN level or higher - Debug logs are invisible in production
  2. Verify database state matches code expectations - AutoMigrate is non-destructive and won't fix missing tables from before the migration code existed
  3. Add database health checks - Make missing tables visible in /api/v1/health endpoint
  4. Test with persistent databases - All unit tests use fresh in-memory DBs, hiding this issue
  5. Add migration CLI command - Allow operators to manually trigger migrations without container restart
  1. IMMEDIATE: Run Option 2 (Add migrate CLI command) and execute migration
  2. SHORT-TERM: Apply Code Improvements #1 and #2
  3. LONG-TERM: Add health check endpoint and integration tests with persistent DBs
  4. DOCUMENTATION: Update deployment docs to mention migration requirement

Status

  • Root cause identified (missing tables due to persistent DB from before migration code)
  • Silent exit point found (HasTable check with DEBUG logging)
  • Fix options documented
  • Fix implemented
  • Fix verified
  • Code improvements applied
  • Documentation updated