chore: clean .gitignore cache
This commit is contained in:
@@ -1,426 +0,0 @@
|
||||
# CrowdSec Reconciliation Failure Root Cause Analysis
|
||||
|
||||
**Date:** December 15, 2025
|
||||
**Status:** CRITICAL - CrowdSec NOT starting despite 7+ commits attempting fixes
|
||||
**Location:** `backend/internal/services/crowdsec_startup.go`
|
||||
|
||||
## Executive Summary
|
||||
|
||||
**The CrowdSec reconciliation function starts but exits silently** because the `security_configs` table **DOES NOT EXIST** in the production database. The table was added to AutoMigrate but the container was never rebuilt/restarted with a fresh database state after the migration code was added.
|
||||
|
||||
## The Silent Exit Point
|
||||
|
||||
Looking at the container logs:
|
||||
|
||||
```
|
||||
{"bin_path":"crowdsec","data_dir":"/app/data/crowdsec","level":"info","msg":"CrowdSec reconciliation: starting startup check","time":"2025-12-14T20:55:39-05:00"}
|
||||
```
|
||||
|
||||
Then... NOTHING. The function exits silently.
|
||||
|
||||
### Why It Exits
|
||||
|
||||
In `backend/internal/services/crowdsec_startup.go`, line 33-36:
|
||||
|
||||
```go
|
||||
// Check if SecurityConfig table exists and has a record with CrowdSecMode = "local"
|
||||
if !db.Migrator().HasTable(&models.SecurityConfig{}) {
|
||||
logger.Log().Debug("CrowdSec reconciliation skipped: SecurityConfig table not found")
|
||||
return
|
||||
}
|
||||
```
|
||||
|
||||
**This guard clause triggers because the table doesn't exist**, but it logs at **DEBUG** level, not INFO/WARN/ERROR. Since the container is running in production mode (not debug), this log message is never shown.
|
||||
|
||||
### Database Evidence
|
||||
|
||||
```bash
|
||||
$ sqlite3 data/charon.db ".tables"
|
||||
access_lists remote_servers
|
||||
caddy_configs settings
|
||||
domains ssl_certificates
|
||||
import_sessions uptime_heartbeats
|
||||
locations uptime_hosts
|
||||
proxy_hosts uptime_monitors
|
||||
notification_providers uptime_notification_events
|
||||
notifications users
|
||||
```
|
||||
|
||||
**NO `security_configs` TABLE EXISTS.** Yet the code in `backend/internal/api/routes/routes.go` clearly calls:
|
||||
|
||||
```go
|
||||
if err := db.AutoMigrate(
|
||||
// ... other models ...
|
||||
&models.SecurityConfig{},
|
||||
&models.SecurityDecision{},
|
||||
&models.SecurityAudit{},
|
||||
&models.SecurityRuleSet{},
|
||||
// ...
|
||||
); err != nil {
|
||||
return fmt.Errorf("auto migrate: %w", err)
|
||||
}
|
||||
```
|
||||
|
||||
## Why AutoMigrate Didn't Create the Tables
|
||||
|
||||
### Theory 1: Database Persistence Across Rebuilds ✅ MOST LIKELY
|
||||
|
||||
The `charon.db` file is mounted as a volume in the Docker container:
|
||||
|
||||
```yaml
|
||||
# docker-compose.yml
|
||||
volumes:
|
||||
- ./data:/app/data
|
||||
```
|
||||
|
||||
**What happened:**
|
||||
|
||||
1. SecurityConfig model was added to AutoMigrate in recent commits
|
||||
2. Container was rebuilt with `docker build -t charon:local .`
|
||||
3. Container started with `docker compose up -d`
|
||||
4. **BUT** the existing `data/charon.db` file (from before the migration code existed) was reused
|
||||
5. GORM's AutoMigrate is **non-destructive** - it only adds new tables if they don't exist
|
||||
6. The tables were never created because the database predates the migration code
|
||||
|
||||
### Theory 2: AutoMigrate Failed Silently
|
||||
|
||||
Looking at the logs, there is **NO** indication that AutoMigrate failed:
|
||||
|
||||
```
|
||||
{"level":"info","msg":"starting Charon backend on version dev","time":"2025-12-14T20:55:39-05:00"}
|
||||
{"bin_path":"crowdsec","data_dir":"/app/data/crowdsec","level":"info","msg":"CrowdSec reconciliation: starting startup check","time":"2025-12-14T20:55:39-05:00"}
|
||||
{"level":"info","msg":"starting Charon backend on :8080","time":"2025-12-14T20:55:39-05:00"}
|
||||
```
|
||||
|
||||
If AutoMigrate had failed, we would see an error from `routes.Register()` because it has:
|
||||
|
||||
```go
|
||||
if err := db.AutoMigrate(...); err != nil {
|
||||
return fmt.Errorf("auto migrate: %w", err)
|
||||
}
|
||||
```
|
||||
|
||||
Since the server started successfully, AutoMigrate either:
|
||||
|
||||
- Ran successfully but found the DB already in sync (no new tables to add)
|
||||
- Never ran because the DB was opened but the tables already existed from a previous run
|
||||
|
||||
## The Cascading Failures
|
||||
|
||||
Because `security_configs` doesn't exist:
|
||||
|
||||
1. ✅ Reconciliation exits at line 33-36 (HasTable check)
|
||||
2. ✅ CrowdSec is never started
|
||||
3. ✅ Frontend shows "CrowdSec is not running" in Console Enrollment
|
||||
4. ✅ Security page toggle is stuck ON (because there's no DB record to persist the state)
|
||||
5. ✅ Log viewer shows "disconnected" (CrowdSec process doesn't exist)
|
||||
6. ✅ All subsequent API calls fail because they expect the table to exist
|
||||
|
||||
## Why This Wasn't Caught During Development
|
||||
|
||||
Looking at the test files, **EVERY TEST** manually calls AutoMigrate:
|
||||
|
||||
```go
|
||||
// backend/internal/services/crowdsec_startup_test.go:75
|
||||
err = db.AutoMigrate(&models.SecurityConfig{})
|
||||
|
||||
// backend/internal/api/handlers/security_handler_coverage_test.go:25
|
||||
require.NoError(t, db.AutoMigrate(&models.SecurityConfig{}, ...))
|
||||
```
|
||||
|
||||
So tests **always create the table fresh**, hiding the issue that would occur in production with a persistent database.
|
||||
|
||||
## The Fix
|
||||
|
||||
### Option 1: Manual Database Migration (IMMEDIATE FIX)
|
||||
|
||||
Run this on the production container:
|
||||
|
||||
```bash
|
||||
# Connect to running container
|
||||
docker exec -it charon /bin/sh
|
||||
|
||||
# Run migration command (create a new CLI command in main.go)
|
||||
./backend migrate
|
||||
|
||||
# OR manually create tables with sqlite3
|
||||
sqlite3 /app/data/charon.db << EOF
|
||||
CREATE TABLE IF NOT EXISTS security_configs (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
uuid TEXT UNIQUE NOT NULL,
|
||||
name TEXT,
|
||||
enabled BOOLEAN DEFAULT false,
|
||||
admin_whitelist TEXT,
|
||||
break_glass_hash TEXT,
|
||||
crowdsec_mode TEXT DEFAULT 'disabled',
|
||||
crowdsec_api_url TEXT,
|
||||
waf_mode TEXT DEFAULT 'disabled',
|
||||
waf_rules_source TEXT,
|
||||
waf_learning BOOLEAN DEFAULT false,
|
||||
waf_paranoia_level INTEGER DEFAULT 1,
|
||||
waf_exclusions TEXT,
|
||||
rate_limit_mode TEXT DEFAULT 'disabled',
|
||||
rate_limit_enable BOOLEAN DEFAULT false,
|
||||
rate_limit_burst INTEGER DEFAULT 10,
|
||||
rate_limit_requests INTEGER DEFAULT 100,
|
||||
rate_limit_window_sec INTEGER DEFAULT 60,
|
||||
rate_limit_bypass_list TEXT,
|
||||
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
|
||||
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS security_decisions (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
uuid TEXT UNIQUE NOT NULL,
|
||||
ip TEXT NOT NULL,
|
||||
reason TEXT,
|
||||
action TEXT DEFAULT 'ban',
|
||||
duration INTEGER,
|
||||
expires_at DATETIME,
|
||||
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
|
||||
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS security_audits (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
uuid TEXT UNIQUE NOT NULL,
|
||||
event_type TEXT,
|
||||
ip_address TEXT,
|
||||
details TEXT,
|
||||
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS security_rule_sets (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
uuid TEXT UNIQUE NOT NULL,
|
||||
name TEXT NOT NULL,
|
||||
type TEXT DEFAULT 'ip_list',
|
||||
content TEXT,
|
||||
enabled BOOLEAN DEFAULT true,
|
||||
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
|
||||
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS crowdsec_preset_events (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
uuid TEXT UNIQUE NOT NULL,
|
||||
name TEXT NOT NULL,
|
||||
description TEXT,
|
||||
enabled BOOLEAN DEFAULT false,
|
||||
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
|
||||
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS crowdsec_console_enrollments (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
uuid TEXT UNIQUE NOT NULL,
|
||||
enrollment_key TEXT,
|
||||
organization_id TEXT,
|
||||
instance_name TEXT,
|
||||
enrolled_at DATETIME,
|
||||
status TEXT DEFAULT 'pending',
|
||||
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
|
||||
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
EOF
|
||||
|
||||
# Restart container
|
||||
exit
|
||||
docker restart charon
|
||||
```
|
||||
|
||||
### Option 2: Add Migration CLI Command (CLEAN SOLUTION)
|
||||
|
||||
Add to `backend/cmd/api/main.go`:
|
||||
|
||||
```go
|
||||
// Handle CLI commands
|
||||
if len(os.Args) > 1 {
|
||||
switch os.Args[1] {
|
||||
case "migrate":
|
||||
cfg, err := config.Load()
|
||||
if err != nil {
|
||||
log.Fatalf("load config: %v", err)
|
||||
}
|
||||
|
||||
db, err := database.Connect(cfg.DatabasePath)
|
||||
if err != nil {
|
||||
log.Fatalf("connect database: %v", err)
|
||||
}
|
||||
|
||||
logger.Log().Info("Running database migrations...")
|
||||
if err := db.AutoMigrate(
|
||||
&models.SecurityConfig{},
|
||||
&models.SecurityDecision{},
|
||||
&models.SecurityAudit{},
|
||||
&models.SecurityRuleSet{},
|
||||
&models.CrowdsecPresetEvent{},
|
||||
&models.CrowdsecConsoleEnrollment{},
|
||||
); err != nil {
|
||||
log.Fatalf("migration failed: %v", err)
|
||||
}
|
||||
|
||||
logger.Log().Info("Migration completed successfully")
|
||||
return
|
||||
|
||||
case "reset-password":
|
||||
// existing reset-password code
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Then run:
|
||||
|
||||
```bash
|
||||
docker exec charon /app/backend migrate
|
||||
docker restart charon
|
||||
```
|
||||
|
||||
### Option 3: Nuclear Option - Reset Database (DESTRUCTIVE)
|
||||
|
||||
```bash
|
||||
# BACKUP FIRST
|
||||
docker exec charon cp /app/data/charon.db /app/data/backups/charon-pre-security-migration.db
|
||||
|
||||
# Remove database
|
||||
rm data/charon.db data/charon.db-shm data/charon.db-wal
|
||||
|
||||
# Restart container (will recreate fresh DB with all tables)
|
||||
docker restart charon
|
||||
```
|
||||
|
||||
## Fix Verification Checklist
|
||||
|
||||
After applying any fix, verify:
|
||||
|
||||
1. ✅ Check table exists:
|
||||
|
||||
```bash
|
||||
docker exec charon sqlite3 /app/data/charon.db "SELECT name FROM sqlite_master WHERE type='table' AND name='security_configs';"
|
||||
```
|
||||
|
||||
Expected: `security_configs`
|
||||
|
||||
2. ✅ Check reconciliation logs:
|
||||
|
||||
```bash
|
||||
docker logs charon 2>&1 | grep -i "crowdsec reconciliation"
|
||||
```
|
||||
|
||||
Expected: "starting CrowdSec" or "already running" (NOT "skipped: SecurityConfig table not found")
|
||||
|
||||
3. ✅ Check CrowdSec is running:
|
||||
|
||||
```bash
|
||||
docker exec charon ps aux | grep crowdsec
|
||||
```
|
||||
|
||||
Expected: `crowdsec -c /app/data/crowdsec/config/config.yaml`
|
||||
|
||||
4. ✅ Check frontend Console Enrollment:
|
||||
- Navigate to `/security` page
|
||||
- Click "Console Enrollment" tab
|
||||
- Should show CrowdSec status as "Running"
|
||||
|
||||
5. ✅ Check toggle state persists:
|
||||
- Toggle CrowdSec OFF
|
||||
- Refresh page
|
||||
- Toggle should remain OFF
|
||||
|
||||
## Code Improvements Needed
|
||||
|
||||
### 1. Change Debug Log to Warning
|
||||
|
||||
**File:** `backend/internal/services/crowdsec_startup.go:35`
|
||||
|
||||
```go
|
||||
// BEFORE (line 35)
|
||||
logger.Log().Debug("CrowdSec reconciliation skipped: SecurityConfig table not found")
|
||||
|
||||
// AFTER
|
||||
logger.Log().Warn("CrowdSec reconciliation skipped: SecurityConfig table not found - run migrations")
|
||||
```
|
||||
|
||||
**Rationale:** This is NOT a debug-level issue. If the table doesn't exist, it's a critical setup problem that should always be logged, regardless of debug mode.
|
||||
|
||||
### 2. Add Startup Migration Check
|
||||
|
||||
**File:** `backend/cmd/api/main.go` (after database.Connect())
|
||||
|
||||
```go
|
||||
// Verify critical tables exist before starting server
|
||||
requiredTables := []interface{}{
|
||||
&models.SecurityConfig{},
|
||||
&models.SecurityDecision{},
|
||||
&models.SecurityAudit{},
|
||||
&models.SecurityRuleSet{},
|
||||
}
|
||||
|
||||
for _, model := range requiredTables {
|
||||
if !db.Migrator().HasTable(model) {
|
||||
logger.Log().Warnf("Missing table for %T - running migration", model)
|
||||
if err := db.AutoMigrate(model); err != nil {
|
||||
log.Fatalf("failed to migrate %T: %v", model, err)
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Add Health Check for Tables
|
||||
|
||||
**File:** `backend/internal/api/handlers/health.go`
|
||||
|
||||
```go
|
||||
func HealthHandler(c *gin.Context) {
|
||||
db := c.MustGet("db").(*gorm.DB)
|
||||
|
||||
health := gin.H{
|
||||
"status": "healthy",
|
||||
"database": "connected",
|
||||
"migrations": checkMigrations(db),
|
||||
}
|
||||
|
||||
c.JSON(200, health)
|
||||
}
|
||||
|
||||
func checkMigrations(db *gorm.DB) map[string]bool {
|
||||
return map[string]bool{
|
||||
"security_configs": db.Migrator().HasTable(&models.SecurityConfig{}),
|
||||
"security_decisions": db.Migrator().HasTable(&models.SecurityDecision{}),
|
||||
"security_audits": db.Migrator().HasTable(&models.SecurityAudit{}),
|
||||
"security_rule_sets": db.Migrator().HasTable(&models.SecurityRuleSet{}),
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Related Issues
|
||||
|
||||
- Frontend toggle stuck in ON position → Database issue (no table to persist state)
|
||||
- Console Enrollment says "not running" → CrowdSec never started (reconciliation exits)
|
||||
- Log viewer disconnected → CrowdSec process doesn't exist
|
||||
- All 7 previous commits failed because they addressed symptoms, not the root cause
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
1. **Always log critical guard clauses at WARN level or higher** - Debug logs are invisible in production
|
||||
2. **Verify database state matches code expectations** - AutoMigrate is non-destructive and won't fix missing tables from before the migration code existed
|
||||
3. **Add database health checks** - Make missing tables visible in /api/v1/health endpoint
|
||||
4. **Test with persistent databases** - All unit tests use fresh in-memory DBs, hiding this issue
|
||||
5. **Add migration CLI command** - Allow operators to manually trigger migrations without container restart
|
||||
|
||||
## Recommended Action Plan
|
||||
|
||||
1. **IMMEDIATE:** Run Option 2 (Add migrate CLI command) and execute migration
|
||||
2. **SHORT-TERM:** Apply Code Improvements #1 and #2
|
||||
3. **LONG-TERM:** Add health check endpoint and integration tests with persistent DBs
|
||||
4. **DOCUMENTATION:** Update deployment docs to mention migration requirement
|
||||
|
||||
## Status
|
||||
|
||||
- [x] Root cause identified (missing tables due to persistent DB from before migration code)
|
||||
- [x] Silent exit point found (HasTable check with DEBUG logging)
|
||||
- [x] Fix options documented
|
||||
- [ ] Fix implemented
|
||||
- [ ] Fix verified
|
||||
- [ ] Code improvements applied
|
||||
- [ ] Documentation updated
|
||||
Reference in New Issue
Block a user