# Current Project Specification ## Active Issue: CrowdSec Non-Root Migration Fix - REVISED **Status**: Implementation Ready - Supervisor Review Complete **Priority**: CRITICAL **Last Updated**: 2024-12-22 (Revised after supervisor review) ### Quick Summary The container migration from root to non-root user broke CrowdSec. Supervisor review identified **7 critical issues** that would cause the original fix to fail. This revised plan addresses all issues. **Root Cause**: Permission issues, missing symlink creation logic, and incomplete config template population. ### Changes Required 1. **Dockerfile** (Line ~332): Add config template population before final COPY 2. **Entrypoint Script** (Lines 68-73): Replace symlink verification with creation logic 3. **Entrypoint Script** (Line 100): Fix LOG variable to use directory-based path 4. **Entrypoint Script** (Line 51): Add hub_cache directory creation 5. **Entrypoint Script** (Line 99): Keep CFG pointing to `/etc/crowdsec` (resolves via symlink) 6. **Entrypoint Script** (Lines 68-73): Strengthen error handling in migration 7. **Verification Checklist**: Expand from 7 to 11 steps --- ## Detailed Implementation Plan ### Issue 1: Missing Config Template Population (HIGH PRIORITY) **Location**: `Dockerfile` before line 332 (before final COPY commands) **Problem**: The Dockerfile doesn't populate `/etc/crowdsec.dist/` with CrowdSec default configs (`config.yaml`, `user.yaml`, etc.). This causes the entrypoint script to have nothing to copy when initializing persistent storage. **Current Code** (Lines 330-332): ```dockerfile # Copy CrowdSec configuration templates from source COPY configs/crowdsec/acquis.yaml /etc/crowdsec.dist/acquis.yaml COPY configs/crowdsec/install_hub_items.sh /usr/local/bin/install_hub_items.sh ``` **Required Fix** (Add BEFORE line 330): ```dockerfile # Generate CrowdSec default configs to .dist directory RUN if command -v cscli >/dev/null; then \ mkdir -p /etc/crowdsec.dist && \ cscli config restore /etc/crowdsec.dist/ || \ cp -r /etc/crowdsec/* /etc/crowdsec.dist/ 2>/dev/null || true; \ fi ``` **Rationale**: The `cscli config restore` command generates all required default configs (`config.yaml`, `user.yaml`, `local_api_credentials.yaml`, etc.). If that fails, we fall back to copying any existing configs. This ensures the `.dist` directory is always populated for the entrypoint to use. **Risk**: Low - Command has multiple fallbacks and won't fail the build if CrowdSec is unavailable. --- ### Issue 2: Symlink Not Created (HIGH PRIORITY) **Location**: `.docker/docker-entrypoint.sh` lines 68-73 **Problem**: The entrypoint only VERIFIES the symlink exists but never CREATES it. This is the root cause of CrowdSec failures. **Current Code** (Lines 68-73): ```bash # Link /etc/crowdsec to persistent config for runtime compatibility # Note: This symlink is created at build time; verify it exists if [ -L "/etc/crowdsec" ]; then echo "CrowdSec config symlink verified: /etc/crowdsec -> $CS_CONFIG_DIR" else echo "Warning: /etc/crowdsec symlink not found. CrowdSec may use volume config directly." fi ``` **Required Fix** (Replace lines 68-73): ```bash # Migrate existing directory to persistent storage if needed if [ -d "/etc/crowdsec" ] && [ ! -L "/etc/crowdsec" ]; then echo "Migrating /etc/crowdsec to persistent storage..." if [ -n "$(ls -A /etc/crowdsec 2>/dev/null)" ]; then cp -rn /etc/crowdsec/* "$CS_CONFIG_DIR/" || { echo "ERROR: Failed to migrate configs" exit 1 } fi rm -rf /etc/crowdsec || { echo "ERROR: Failed to remove old directory" exit 1 } fi # Create symlink if it doesn't exist if [ ! -L "/etc/crowdsec" ]; then ln -sf "$CS_CONFIG_DIR" /etc/crowdsec || { echo "ERROR: Failed to create symlink" exit 1 } echo "Created symlink: /etc/crowdsec -> $CS_CONFIG_DIR" fi ``` **Rationale**: This implements proper migration logic with fail-fast error handling. If `/etc/crowdsec` exists as a directory, we migrate its contents before creating the symlink. **Risk**: Medium - Changes startup flow. Must test with both fresh and existing volumes. --- ### Issue 3: Wrong LOG Environment Variable **Location**: `.docker/docker-entrypoint.sh` line 100 **Problem**: The `LOG` variable points directly to a file instead of using the log directory variable, breaking consistency. **Current Code** (Line 100): ```bash export LOG=/var/log/crowdsec.log ``` **Required Fix** (Replace line 100): ```bash export LOG="$CS_LOG_DIR/crowdsec.log" ``` **Required Addition** (Add after line 47 where other CS_* variables are defined): ```bash CS_LOG_DIR="/var/log/crowdsec" ``` **Rationale**: Ensures all CrowdSec paths are consistently managed through variables, making future changes easier. **Risk**: Low - Simple variable change with no behavioral impact. --- ### Issue 4: Missing Hub Cache Directory **Location**: `.docker/docker-entrypoint.sh` after line 51 **Problem**: The hub cache directory `/app/data/crowdsec/hub_cache/` is never explicitly created, causing hub operations to fail. **Current Code** (Lines 49-51): ```bash # Ensure persistent directories exist (within writable volume) mkdir -p "$CS_CONFIG_DIR" 2>/dev/null || echo "Warning: Cannot create $CS_CONFIG_DIR" mkdir -p "$CS_DATA_DIR" 2>/dev/null || echo "Warning: Cannot create $CS_DATA_DIR" ``` **Required Fix** (Add after line 51): ```bash mkdir -p "$CS_PERSIST_DIR/hub_cache" ``` **Rationale**: CrowdSec stores hub metadata in a separate cache directory. Without this, `cscli hub update` fails silently. **Risk**: Low - Simple directory creation with no side effects. --- ### Issue 5: CFG Variable Should Stay /etc/crowdsec **Location**: `.docker/docker-entrypoint.sh` line 99 **Problem**: The original plan incorrectly suggested changing CFG to `$CS_CONFIG_DIR`, but it should remain `/etc/crowdsec` since it resolves to persistent storage via the symlink. **Current Code** (Line 99): ```bash export CFG=/etc/crowdsec ``` **Required Action**: **KEEP AS-IS** - Do NOT change this line. **Rationale**: The CFG variable should point to `/etc/crowdsec` which resolves to `$CS_CONFIG_DIR` via symlink. This maintains compatibility with CrowdSec's expected paths while still using persistent storage. **Risk**: None - No change required. --- ### Issue 6: Weak Migration Error Handling **Location**: `.docker/docker-entrypoint.sh` lines 56-62 **Problem**: Too many `|| true` statements allow silent failures during config migration. **Current Code** (Lines 56-62): ```bash # Initialize persistent config if key files are missing if [ ! -f "$CS_CONFIG_DIR/config.yaml" ]; then echo "Initializing persistent CrowdSec configuration..." if [ -d "/etc/crowdsec.dist" ]; then cp -r /etc/crowdsec.dist/* "$CS_CONFIG_DIR/" 2>/dev/null || echo "Warning: Could not copy dist config" elif [ -d "/etc/crowdsec" ] && [ ! -L "/etc/crowdsec" ]; then # Fallback if .dist is missing cp -r /etc/crowdsec/* "$CS_CONFIG_DIR/" 2>/dev/null || echo "Warning: Could not copy config" fi fi ``` **Required Fix** (Replace lines 56-62): ```bash # Initialize persistent config if key files are missing if [ ! -f "$CS_CONFIG_DIR/config.yaml" ]; then echo "Initializing persistent CrowdSec configuration..." if [ -d "/etc/crowdsec.dist" ] && [ -n "$(ls -A /etc/crowdsec.dist 2>/dev/null)" ]; then cp -r /etc/crowdsec.dist/* "$CS_CONFIG_DIR/" || { echo "ERROR: Failed to copy config from /etc/crowdsec.dist" exit 1 } echo "Successfully initialized config from .dist directory" elif [ -d "/etc/crowdsec" ] && [ ! -L "/etc/crowdsec" ] && [ -n "$(ls -A /etc/crowdsec 2>/dev/null)" ]; then cp -r /etc/crowdsec/* "$CS_CONFIG_DIR/" || { echo "ERROR: Failed to copy config from /etc/crowdsec" exit 1 } echo "Successfully initialized config from /etc/crowdsec" else echo "ERROR: No config source found (neither .dist nor /etc/crowdsec available)" exit 1 fi fi ``` **Rationale**: Fail-fast approach ensures we detect misconfigurations early. Empty directory checks prevent copying empty directories. **Risk**: Medium - Strict error handling may reveal edge cases. Must test thoroughly. --- ### Issue 7: Incomplete Verification Checklist **Problem**: Original checklist had only 7 steps and missed critical tests for volume replacement, permissions, config persistence, and hub updates. **Original Checklist** (Steps 1-7): 1. Fresh container start with empty volumes 2. Container restart (data persists) 3. CrowdSec enable/disable via UI 4. Log file permissions and rotation 5. LAPI readiness and machine registration 6. Hub updates and parsers 7. Multi-architecture compatibility **Required Additional Steps** (8-11): 8. **Volume Replacement Test**: Start container with volume → destroy volume → recreate volume. Verify configs regenerate correctly. 9. **Permission Inheritance**: Create new files in persistent storage (e.g., `cscli decisions add`). Verify ownership is correct (1000:1000). 10. **Config Persistence**: Make config changes via `cscli` (e.g., add bouncer, modify settings). Restart container. Verify changes persist. 11. **Hub Update Test**: Run `cscli hub update && cscli hub upgrade`. Verify hub data is stored in persistent volume and survives restarts. **Rationale**: These tests cover critical failure modes discovered in production: volume loss, permission issues on newly created files, config changes not persisting, and hub data being ephemeral. **Risk**: None - This is documentation only. --- ## Implementation Order Follow this sequence to apply changes safely: ### Phase 1: Dockerfile Changes (Low Risk) 1. Add config template population to `Dockerfile` before line 330 2. Build test image: `docker build -t charon:test .` 3. Verify `/etc/crowdsec.dist/` is populated: `docker run --rm charon:test ls -la /etc/crowdsec.dist/` 4. Expected output: `config.yaml`, `user.yaml`, `local_api_credentials.yaml`, `profiles.yaml` ### Phase 2: Entrypoint Script Changes (Medium Risk) 5. Apply all 5 entrypoint script fixes in a single commit (they're interdependent) 6. Rebuild image: `docker build -t charon:test .` 7. Test with fresh volumes (see Phase 3) ### Phase 3: Testing Strategy Run all 11 verification tests in order: **Test 1: Fresh Start** ```bash docker volume create charon_data_test docker run -d --name charon_test -v charon_data_test:/app/data charon:test docker logs charon_test | grep -E "(symlink|CrowdSec config)" ``` Expected: "Created symlink: /etc/crowdsec -> /app/data/crowdsec/config" **Test 2: Container Restart** ```bash docker restart charon_test docker logs charon_test | grep "symlink verified" ``` Expected: "CrowdSec config symlink verified: /etc/crowdsec -> /app/data/crowdsec/config" **Test 3-7**: Follow existing test procedures from original plan **Test 8: Volume Replacement** ```bash docker stop charon_test docker rm charon_test docker volume rm charon_data_test docker volume create charon_data_test docker run -d --name charon_test -v charon_data_test:/app/data charon:test docker exec charon_test ls -la /app/data/crowdsec/config/ ``` Expected: `config.yaml` and other files regenerated **Test 9: Permission Inheritance** ```bash docker exec charon_test cscli decisions add -i 1.2.3.4 docker exec charon_test ls -ln /app/data/crowdsec/data/ ``` Expected: All files owned by uid 1000, gid 1000 **Test 10: Config Persistence** ```bash docker exec charon_test cscli config set api.server.log_level=debug docker restart charon_test docker exec charon_test cscli config show api.server.log_level ``` Expected: "debug" **Test 11: Hub Update** ```bash docker exec charon_test cscli hub update docker exec charon_test ls -la /app/data/crowdsec/hub_cache/ docker restart charon_test docker exec charon_test cscli hub list -o json ``` Expected: Hub cache persists, parsers/scenarios remain installed ### Phase 4: Rollback Procedure If any test fails: 1. Tag working version: `docker tag charon:current charon:rollback` 2. Revert changes to `.docker/docker-entrypoint.sh` and `Dockerfile` 3. Rebuild: `docker build -t charon:current .` 4. Document failure in issue tracker with test logs --- ## Summary of All Changes | File | Line(s) | Change Type | Priority | Risk | |------|---------|-------------|----------|------| | `Dockerfile` | Before 330 | Add config restore RUN | HIGH | Low | | `.docker/docker-entrypoint.sh` | 47 | Add CS_LOG_DIR variable | HIGH | Low | | `.docker/docker-entrypoint.sh` | 51 | Add hub_cache mkdir | HIGH | Low | | `.docker/docker-entrypoint.sh` | 56-62 | Strengthen config init | HIGH | Medium | | `.docker/docker-entrypoint.sh` | 68-73 | Implement symlink creation | HIGH | Medium | | `.docker/docker-entrypoint.sh` | 99 | Keep CFG=/etc/crowdsec | NONE | None | | `.docker/docker-entrypoint.sh` | 100 | Fix LOG variable | HIGH | Low | | Verification checklist | N/A | Add 4 new tests | HIGH | None | --- ## Risk Assessment ### Low Risk Changes (Can be applied immediately) - Dockerfile config template population - LOG variable fix - Hub cache directory creation - CFG variable (no change) ### Medium Risk Changes (Require thorough testing) - Symlink creation logic (fundamental behavior change) - Error handling strengthening (may expose edge cases) ### High Risk Scenarios to Test - Existing installations upgrading from old version - Corrupted/incomplete config directories - Simultaneous volume and config failures - Cross-architecture compatibility (arm64 especially) --- ## Acceptance Criteria All 11 verification tests must pass before merging: - [ ] Fresh container start - [ ] Container restart - [ ] CrowdSec enable/disable - [ ] Log file permissions - [ ] LAPI readiness - [ ] Hub updates - [ ] Multi-arch compatibility - [ ] Volume replacement - [ ] Permission inheritance - [ ] Config persistence - [ ] Hub update persistence --- ## References - Original issue: CrowdSec non-root migration - Supervisor review: 2024-12-22 - Related files: `Dockerfile`, `.docker/docker-entrypoint.sh` - Testing environment: Docker 24.x, volumes with uid 1000 --- # Historical Analysis: CrowdSec Reconciliation Failure Diagnostics ## Executive Summary Investigation of why CrowdSec shows "not started" in the UI when it should **already be enabled**. This is NOT a first-time enable issue—it's a **reconciliation/runtime failure** after container restart or app startup. --- ## Problem Statement User reports CrowdSec was previously enabled and working, but after container restart: - UI shows CrowdSec as "not started" - The setting in database says it should be enabled - No obvious errors in the UI --- ## 1. Reconciliation Flow Overview When Charon starts, `ReconcileCrowdSecOnStartup()` runs **asynchronously** (in a goroutine) to restore CrowdSec state. ### Flow Diagram ``` App Startup → go ReconcileCrowdSecOnStartup() → (async goroutine) │ ▼ ┌────────────────────────────────────┐ │ 1. Validate: db != nil && exec != nil │ └────────────────────────────────────┘ │ (fail → silent return) ▼ ┌────────────────────────────────────┐ │ 2. Check: SecurityConfig table exists │ └────────────────────────────────────┘ │ (no table → WARN + return) ▼ ┌────────────────────────────────────┐ │ 3. Query: SecurityConfig record │ └────────────────────────────────────┘ │ (not found → auto-create from Settings) │ (error → return) ▼ ┌────────────────────────────────────┐ │ 4. Query: Settings table override │ │ key = "security.crowdsec.enabled" │ └────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────┐ │ 5. Decide: Start if CrowdSecMode == │ │ "local" OR setting == "true" │ └────────────────────────────────────┘ │ (both false → INFO skip) ▼ ┌────────────────────────────────────┐ │ 6. Validate: Binary exists at path │ │ /usr/local/bin/crowdsec │ └────────────────────────────────────┘ │ (not found → ERROR + return) ▼ ┌────────────────────────────────────┐ │ 7. Validate: Config dir exists │ │ dataDir/config │ └────────────────────────────────────┘ │ (not found → ERROR + return) ▼ ┌────────────────────────────────────┐ │ 8. Check: Status (already running?) │ └────────────────────────────────────┘ │ (running → INFO + done) │ (error → WARN + return!) ▼ ┌────────────────────────────────────┐ │ 9. Start: CrowdSec process │ └────────────────────────────────────┘ │ (error → ERROR + return) ▼ ┌────────────────────────────────────┐ │ 10. Verify: Wait 2s + check status │ └────────────────────────────────────┘ ``` --- ## 2. Most Likely Failure Points (Priority Order) ### 2.1 Binary Not Found ⭐ HIGH LIKELIHOOD **Code:** `backend/internal/services/crowdsec_startup.go:117-120` ```go if _, err := os.Stat(binPath); os.IsNotExist(err) { logger.Log().WithField("path", binPath).Error("CrowdSec reconciliation: binary not found, cannot start") return } ``` **Diagnosis:** ```bash docker exec ls -la /usr/local/bin/crowdsec docker exec printenv CHARON_CROWDSEC_BIN ``` --- ### 2.2 Config Directory Missing ⭐ HIGH LIKELIHOOD **Code:** `backend/internal/services/crowdsec_startup.go:122-126` ```go configPath := filepath.Join(dataDir, "config") if _, err := os.Stat(configPath); os.IsNotExist(err) { logger.Log().WithField("path", configPath).Error("CrowdSec reconciliation: config directory not found") return } ``` **Diagnosis:** ```bash docker exec ls -la /data/crowdsec/config/ docker exec cat /data/crowdsec/config/config.yaml ``` --- ### 2.3 Database State Mismatch ⭐ MEDIUM LIKELIHOOD Two sources must be checked: 1. `security_configs.crowdsec_mode = "local"` 2. `settings.key = "security.crowdsec.enabled"` with `value = "true"` If **BOTH** are not "enabled", reconciliation silently skips. **Diagnosis:** ```bash docker exec sqlite3 /data/charon.db "SELECT crowdsec_mode, enabled FROM security_configs LIMIT 1;" docker exec sqlite3 /data/charon.db "SELECT key, value FROM settings WHERE key LIKE '%crowdsec%';" ``` --- ### 2.4 Stale PID File (PID Recycled) ⭐ MEDIUM LIKELIHOOD **Code:** `backend/internal/api/handlers/crowdsec_exec.go:118-147` Status check reads PID file, checks if process exists, then verifies `/proc//cmdline` contains "crowdsec". **Diagnosis:** ```bash docker exec cat /data/crowdsec/crowdsec.pid docker exec pgrep -a crowdsec ``` --- ### 2.5 Process Crashes After Start ⭐ MEDIUM LIKELIHOOD **Code:** `backend/internal/services/crowdsec_startup.go:146-159` After starting, waits 2 seconds and verifies. If crashed: ``` logger.Log().Error("CrowdSec reconciliation: process started but is no longer running - may have crashed") ``` **Diagnosis:** ```bash # Try manual start to see errors docker exec /usr/local/bin/crowdsec -c /data/crowdsec/config/config.yaml # Check for port conflicts (LAPI uses 8085) docker exec netstat -tlnp 2>/dev/null | grep 8085 ``` --- ### 2.6 Status Check Error (Silently Aborts) ⭐ LOW LIKELIHOOD **Code:** `backend/internal/services/crowdsec_startup.go:129-134` ```go if err != nil { logger.Log().WithError(err).Warn("CrowdSec reconciliation: failed to check status") return // ← Aborts without trying to start! } ``` --- ## 3. Status Handler Analysis The UI calls `GET /api/v1/admin/crowdsec/status`: **Code:** `backend/internal/api/handlers/crowdsec_handler.go:313-333` Returns `running: false` when: - PID file doesn't exist - PID doesn't correspond to a running process - PID is running but `/proc//cmdline` doesn't contain "crowdsec" --- ## 4. Diagnostic Commands Summary ```bash # 1. Check binary docker exec ls -la /usr/local/bin/crowdsec # 2. Check config directory docker exec ls -la /data/crowdsec/config/ # 3. Check database state docker exec sqlite3 /data/charon.db \ "SELECT crowdsec_mode, enabled FROM security_configs LIMIT 1;" docker exec sqlite3 /data/charon.db \ "SELECT key, value FROM settings WHERE key LIKE '%crowdsec%';" # 4. Check PID file docker exec cat /data/crowdsec/crowdsec.pid 2>/dev/null || echo "No PID file" # 5. Check running processes docker exec pgrep -a crowdsec || echo "Not running" # 6. Check logs for reconciliation docker logs 2>&1 | grep -i "crowdsec reconciliation" # 7. Try manual start docker exec /usr/local/bin/crowdsec \ -c /data/crowdsec/config/config.yaml & # 8. Check port conflicts docker exec netstat -tlnp 2>/dev/null | grep -E "8085|8080" ``` --- ## 5. Log Messages to Look For | Priority | Cause | Log Message | |----------|-------|-------------| | 1 | Binary missing | `"CrowdSec reconciliation: binary not found"` | | 2 | Config missing | `"CrowdSec reconciliation: config directory not found"` | | 3 | DB says disabled | `"CrowdSec reconciliation skipped: both SecurityConfig and Settings indicate disabled"` | | 4 | Crashed after start | `"process started but is no longer running"` | | 5 | Start failed | `"CrowdSec reconciliation: FAILED to start CrowdSec"` | | 6 | Status check failed | `"failed to check status"` | --- ## 6. Key Timeouts | Operation | Timeout | Location | |-----------|---------|----------| | Status check | 5 seconds | crowdsec_startup.go:128 | | Start timeout | 30 seconds | crowdsec_startup.go:146 | | Post-start delay | 2 seconds | crowdsec_startup.go:153 | | Verification check | 5 seconds | crowdsec_startup.go:156 | --- --- # Original Analysis: First-Time Enable Issues ## Observed Browser Console Errors ``` - 401 Unauthorized on /api/v1/auth/me - Multiple 400 Bad Request on /api/v1/settings/validate-url - Auto-logging out due to inactivity - Various ERR_NETWORK_CHANGED errors - CrowdSec appears to not be running ``` --- ## Relevant Code Files and Flow Analysis ### 2.1 CrowdSec Startup Flow #### Entry Point: Frontend Toggle **File:** [frontend/src/pages/Security.tsx](../../../frontend/src/pages/Security.tsx#L147-L183) ```typescript const crowdsecPowerMutation = useMutation({ mutationFn: async (enabled: boolean) => { await updateSetting('security.crowdsec.enabled', enabled ? 'true' : 'false', 'security', 'bool') if (enabled) { toast.info('Starting CrowdSec... This may take up to 30 seconds') const result = await startCrowdsec() const status = await statusCrowdsec() if (!status.running) { await updateSetting('security.crowdsec.enabled', 'false', 'security', 'bool') throw new Error('CrowdSec process failed to start. Check server logs for details.') } return result } else { await stopCrowdsec() // ... } }, // ... }) ``` #### API Client Configuration **File:** [frontend/src/api/client.ts](../../../frontend/src/api/client.ts#L7-L11) ```typescript const client = axios.create({ baseURL: '/api/v1', withCredentials: true, timeout: 30000, // 30 second timeout }); ``` **Issue Identified:** The frontend has a **30-second timeout**, which aligns with the backend LAPI readiness timeout. However, the startup process involves multiple sequential steps that could exceed this total. #### Backend Start Handler **File:** [backend/internal/api/handlers/crowdsec_handler.go](../../../backend/internal/api/handlers/crowdsec_handler.go#L175-L252) Key timeouts in `Start()`: - LAPI readiness polling: **30 seconds max** (line 229: `maxWait := 30 * time.Second`) - Poll interval: **500ms** (line 230: `pollInterval := 500 * time.Millisecond`) - Individual LAPI check: **2 seconds** (line 237: `context.WithTimeout(ctx, 2*time.Second)`) ```go // Wait for LAPI to be ready (with timeout) lapiReady := false maxWait := 30 * time.Second pollInterval := 500 * time.Millisecond deadline := time.Now().Add(maxWait) for time.Now().Before(deadline) { checkCtx, cancel := context.WithTimeout(ctx, 2*time.Second) _, err := h.CmdExec.Execute(checkCtx, "cscli", args...) cancel() if err == nil { lapiReady = true break } time.Sleep(pollInterval) } ``` #### Backend Executor (Process Management) **File:** [backend/internal/api/handlers/crowdsec_exec.go](../../../backend/internal/api/handlers/crowdsec_exec.go) The `DefaultCrowdsecExecutor.Start()` method (lines 39-66): - Uses `exec.Command` (not `CommandContext`) - process is detached - Sets `Setpgid: true` to create new process group - Writes PID file synchronously - Returns immediately after starting the process ```go func (e *DefaultCrowdsecExecutor) Start(ctx context.Context, binPath, configDir string) (int, error) { configFile := filepath.Join(configDir, "config", "config.yaml") cmd := exec.Command(binPath, "-c", configFile) cmd.SysProcAttr = &syscall.SysProcAttr{ Setpgid: true, // Create new process group } // ... if err := cmd.Start(); err != nil { return 0, err } // ... writes PID file go func() { _ = cmd.Wait() _ = os.Remove(e.pidFile(configDir)) }() return pid, nil } ``` #### Background Reconciliation **File:** [backend/internal/services/crowdsec_startup.go](../../../backend/internal/services/crowdsec_startup.go) Key timeouts in `ReconcileCrowdSecOnStartup()`: - Status check timeout: **5 seconds** (line 139) - Start timeout: **30 seconds** (line 150) - Verification delay: **2 seconds** (line 159: `time.Sleep(2 * time.Second)`) - Verification check timeout: **5 seconds** (line 161) ```go // Start context with 30 second timeout startCtx, startCancel := context.WithTimeout(context.Background(), 30*time.Second) defer startCancel() newPid, err := executor.Start(startCtx, binPath, dataDir) // ... // VERIFY: Wait briefly and confirm process is actually running time.Sleep(2 * time.Second) verifyCtx, verifyCancel := context.WithTimeout(context.Background(), 5*time.Second) defer verifyCancel() ``` --- ## 3. Identified Potential Root Causes ### 3.1 Timeout Race Condition (HIGH PROBABILITY) The frontend timeout (30s) and backend LAPI polling timeout (30s) are identical. Combined with: - Initial process start time - Settings database update - SecurityConfig database update - Network latency **Total time could easily exceed 30 seconds**, causing the frontend to timeout before the backend responds. ### 3.2 CrowdSec Binary/Config Not Found In [crowdsec_startup.go](../../../backend/internal/services/crowdsec_startup.go#L124-L135): ```go // VALIDATE: Ensure binary exists if _, err := os.Stat(binPath); os.IsNotExist(err) { logger.Log().WithField("path", binPath).Error("CrowdSec reconciliation: binary not found") return } // VALIDATE: Ensure config directory exists configPath := filepath.Join(dataDir, "config") if _, err := os.Stat(configPath); os.IsNotExist(err) { logger.Log().WithField("path", configPath).Error("CrowdSec reconciliation: config directory not found") return } ``` **Check:** The binary path defaults to `/usr/local/bin/crowdsec` (from `routes.go` line 292) and config dir is `data/crowdsec`. If either is missing, the startup silently fails. ### 3.3 LAPI Never Becomes Ready The handler waits for `cscli lapi status` to succeed. If CrowdSec starts but LAPI never initializes (e.g., database issues, missing configuration), the handler will timeout. ### 3.4 Authentication Issues (401 on /auth/me) The 401 errors suggest the user's session is expiring during the long-running operation. This is likely a **symptom, not the cause**: **File:** [frontend/src/api/client.ts](../../../frontend/src/api/client.ts#L25-L33) ```typescript client.interceptors.response.use( (response) => response, (error) => { if (error.response?.status === 401) { console.warn('Authentication failed:', error.config?.url); } return Promise.reject(error); } ); ``` The session timeout or network interruption during the 30+ second CrowdSec startup could cause parallel requests to `/auth/me` to fail. ### 3.5 ERR_NETWORK_CHANGED This indicates network connectivity issues on the client side. If the network changes during the long-running request, it will fail. This is external to the application but exacerbated by long timeouts. --- ## 4. Configuration Defaults | Setting | Default Value | Source | |---------|---------------|--------| | CrowdSec Binary | `/usr/local/bin/crowdsec` | `CHARON_CROWDSEC_BIN` env or hardcoded | | CrowdSec Config Dir | `data/crowdsec` | `CHARON_CROWDSEC_CONFIG_DIR` env | | CrowdSec Mode | `disabled` | `CERBERUS_SECURITY_CROWDSEC_MODE` env | | Frontend Timeout | 30 seconds | `client.ts` | | LAPI Wait Timeout | 30 seconds | `crowdsec_handler.go` | | Process Start Timeout | 30 seconds | `crowdsec_startup.go` | --- ## 5. Remediation Plan ### Phase 1: Immediate Fixes (Timeout Handling) #### 5.1.1 Increase Frontend Timeout for CrowdSec Operations **File:** `frontend/src/api/crowdsec.ts` Create a dedicated request with extended timeout for CrowdSec start: ```typescript export async function startCrowdsec(): Promise<{ status: string; pid: number; lapi_ready?: boolean }> { const resp = await client.post('/admin/crowdsec/start', {}, { timeout: 60000, // 60 second timeout for startup operations }) return resp.data } ``` #### 5.1.2 Add Progress/Status Feedback Implement polling-based status check instead of waiting for single long request: 1. Backend: Return immediately after starting process, with status "starting" 2. Frontend: Poll status endpoint until "running" or timeout #### 5.1.3 Improve Error Messages **File:** `backend/internal/api/handlers/crowdsec_handler.go` Add detailed error responses: ```go if !lapiReady { logger.Log().WithField("pid", pid).Warn("CrowdSec started but LAPI not ready within timeout") c.JSON(http.StatusOK, gin.H{ "status": "started", "pid": pid, "lapi_ready": false, "warning": "Process started but LAPI initialization may take additional time", "next_step": "Poll /admin/crowdsec/status until lapi_ready is true", }) return } ``` ### Phase 2: Diagnostic Improvements #### 5.2.1 Add Health Check Endpoint Create `/admin/crowdsec/health` that returns: - Binary path and existence check - Config directory and existence check - Process status - LAPI status - Last error (if any) #### 5.2.2 Enhanced Logging Add structured logging for all CrowdSec operations with correlation IDs. ### Phase 3: Long-term Fixes #### 5.3.1 Async Startup Pattern Convert to async pattern: 1. `POST /admin/crowdsec/start` returns immediately with job ID 2. `GET /admin/crowdsec/jobs/{id}` returns job status 3. Frontend polls job status with exponential backoff #### 5.3.2 WebSocket Status Updates Use existing WebSocket infrastructure to push status updates during startup. --- ## 6. Diagnostic Commands To investigate the issue on the running container: ```bash # Check if CrowdSec binary exists ls -la /usr/local/bin/crowdsec # Check CrowdSec config directory ls -la /app/data/crowdsec/config/ # Check if CrowdSec is running pgrep -f crowdsec ps aux | grep crowdsec # Check CrowdSec logs (if running) cat /var/log/crowdsec.log # Test LAPI status cscli lapi status # Check PID file cat /app/data/crowdsec/crowdsec.pid # Check database for CrowdSec settings sqlite3 /app/data/charon.db "SELECT * FROM settings WHERE key LIKE '%crowdsec%';" sqlite3 /app/data/charon.db "SELECT * FROM security_configs;" ``` --- ## 7. Summary | Issue | Probability | Impact | Fix Complexity | |-------|-------------|--------|----------------| | Timeout race condition | HIGH | Startup fails | Low | | Missing binary/config | MEDIUM | Startup fails silently | Low | | LAPI initialization slow | MEDIUM | Timeout | Medium | | Session expiry during startup | LOW | User sees 401 | Low | | Network instability | LOW | Request fails | N/A (external) | **Recommended Immediate Action:** Increase frontend timeout for CrowdSec start operations to 60 seconds and add polling-based status verification. --- ## 8. Files to Modify | File | Change | |------|--------| | `frontend/src/api/crowdsec.ts` | Extend timeout for start operation | | `frontend/src/pages/Security.tsx` | Add polling for status after start | | `backend/internal/api/handlers/crowdsec_handler.go` | Return partial success, add health endpoint | | `backend/internal/services/crowdsec_startup.go` | Add more diagnostic logging | --- *Investigation completed: December 22, 2025* *Author: GitHub Copilot (Research Mode)*