1014 lines
34 KiB
Markdown
1014 lines
34 KiB
Markdown
# Current Project Specification
|
|
|
|
## Active Issue: CrowdSec Non-Root Migration Fix - REVISED
|
|
|
|
**Status**: Implementation Ready - Supervisor Review Complete
|
|
**Priority**: CRITICAL
|
|
**Last Updated**: 2024-12-22 (Revised after supervisor review)
|
|
|
|
### Quick Summary
|
|
|
|
The container migration from root to non-root user broke CrowdSec. Supervisor review identified **7 critical issues** that would cause the original fix to fail. This revised plan addresses all issues.
|
|
|
|
**Root Cause**: Permission issues, missing symlink creation logic, and incomplete config template population.
|
|
|
|
### Changes Required
|
|
|
|
1. **Dockerfile** (Line ~332): Add config template population before final COPY
|
|
2. **Entrypoint Script** (Lines 68-73): Replace symlink verification with creation logic
|
|
3. **Entrypoint Script** (Line 100): Fix LOG variable to use directory-based path
|
|
4. **Entrypoint Script** (Line 51): Add hub_cache directory creation
|
|
5. **Entrypoint Script** (Line 99): Keep CFG pointing to `/etc/crowdsec` (resolves via symlink)
|
|
6. **Entrypoint Script** (Lines 68-73): Strengthen error handling in migration
|
|
7. **Verification Checklist**: Expand from 7 to 11 steps
|
|
|
|
---
|
|
|
|
## Detailed Implementation Plan
|
|
|
|
### Issue 1: Missing Config Template Population (HIGH PRIORITY)
|
|
|
|
**Location**: `Dockerfile` before line 332 (before final COPY commands)
|
|
|
|
**Problem**: The Dockerfile doesn't populate `/etc/crowdsec.dist/` with CrowdSec default configs (`config.yaml`, `user.yaml`, etc.). This causes the entrypoint script to have nothing to copy when initializing persistent storage.
|
|
|
|
**Current Code** (Lines 330-332):
|
|
```dockerfile
|
|
# Copy CrowdSec configuration templates from source
|
|
COPY configs/crowdsec/acquis.yaml /etc/crowdsec.dist/acquis.yaml
|
|
COPY configs/crowdsec/install_hub_items.sh /usr/local/bin/install_hub_items.sh
|
|
```
|
|
|
|
**Required Fix** (Add BEFORE line 330):
|
|
```dockerfile
|
|
# Generate CrowdSec default configs to .dist directory
|
|
RUN if command -v cscli >/dev/null; then \
|
|
mkdir -p /etc/crowdsec.dist && \
|
|
cscli config restore /etc/crowdsec.dist/ || \
|
|
cp -r /etc/crowdsec/* /etc/crowdsec.dist/ 2>/dev/null || true; \
|
|
fi
|
|
```
|
|
|
|
**Rationale**: The `cscli config restore` command generates all required default configs (`config.yaml`, `user.yaml`, `local_api_credentials.yaml`, etc.). If that fails, we fall back to copying any existing configs. This ensures the `.dist` directory is always populated for the entrypoint to use.
|
|
|
|
**Risk**: Low - Command has multiple fallbacks and won't fail the build if CrowdSec is unavailable.
|
|
|
|
---
|
|
|
|
### Issue 2: Symlink Not Created (HIGH PRIORITY)
|
|
|
|
**Location**: `.docker/docker-entrypoint.sh` lines 68-73
|
|
|
|
**Problem**: The entrypoint only VERIFIES the symlink exists but never CREATES it. This is the root cause of CrowdSec failures.
|
|
|
|
**Current Code** (Lines 68-73):
|
|
```bash
|
|
# Link /etc/crowdsec to persistent config for runtime compatibility
|
|
# Note: This symlink is created at build time; verify it exists
|
|
if [ -L "/etc/crowdsec" ]; then
|
|
echo "CrowdSec config symlink verified: /etc/crowdsec -> $CS_CONFIG_DIR"
|
|
else
|
|
echo "Warning: /etc/crowdsec symlink not found. CrowdSec may use volume config directly."
|
|
fi
|
|
```
|
|
|
|
**Required Fix** (Replace lines 68-73):
|
|
```bash
|
|
# Migrate existing directory to persistent storage if needed
|
|
if [ -d "/etc/crowdsec" ] && [ ! -L "/etc/crowdsec" ]; then
|
|
echo "Migrating /etc/crowdsec to persistent storage..."
|
|
if [ -n "$(ls -A /etc/crowdsec 2>/dev/null)" ]; then
|
|
cp -rn /etc/crowdsec/* "$CS_CONFIG_DIR/" || {
|
|
echo "ERROR: Failed to migrate configs"
|
|
exit 1
|
|
}
|
|
fi
|
|
rm -rf /etc/crowdsec || {
|
|
echo "ERROR: Failed to remove old directory"
|
|
exit 1
|
|
}
|
|
fi
|
|
|
|
# Create symlink if it doesn't exist
|
|
if [ ! -L "/etc/crowdsec" ]; then
|
|
ln -sf "$CS_CONFIG_DIR" /etc/crowdsec || {
|
|
echo "ERROR: Failed to create symlink"
|
|
exit 1
|
|
}
|
|
echo "Created symlink: /etc/crowdsec -> $CS_CONFIG_DIR"
|
|
fi
|
|
```
|
|
|
|
**Rationale**: This implements proper migration logic with fail-fast error handling. If `/etc/crowdsec` exists as a directory, we migrate its contents before creating the symlink.
|
|
|
|
**Risk**: Medium - Changes startup flow. Must test with both fresh and existing volumes.
|
|
|
|
---
|
|
|
|
### Issue 3: Wrong LOG Environment Variable
|
|
|
|
**Location**: `.docker/docker-entrypoint.sh` line 100
|
|
|
|
**Problem**: The `LOG` variable points directly to a file instead of using the log directory variable, breaking consistency.
|
|
|
|
**Current Code** (Line 100):
|
|
```bash
|
|
export LOG=/var/log/crowdsec.log
|
|
```
|
|
|
|
**Required Fix** (Replace line 100):
|
|
```bash
|
|
export LOG="$CS_LOG_DIR/crowdsec.log"
|
|
```
|
|
|
|
**Required Addition** (Add after line 47 where other CS_* variables are defined):
|
|
```bash
|
|
CS_LOG_DIR="/var/log/crowdsec"
|
|
```
|
|
|
|
**Rationale**: Ensures all CrowdSec paths are consistently managed through variables, making future changes easier.
|
|
|
|
**Risk**: Low - Simple variable change with no behavioral impact.
|
|
|
|
---
|
|
|
|
### Issue 4: Missing Hub Cache Directory
|
|
|
|
**Location**: `.docker/docker-entrypoint.sh` after line 51
|
|
|
|
**Problem**: The hub cache directory `/app/data/crowdsec/hub_cache/` is never explicitly created, causing hub operations to fail.
|
|
|
|
**Current Code** (Lines 49-51):
|
|
```bash
|
|
# Ensure persistent directories exist (within writable volume)
|
|
mkdir -p "$CS_CONFIG_DIR" 2>/dev/null || echo "Warning: Cannot create $CS_CONFIG_DIR"
|
|
mkdir -p "$CS_DATA_DIR" 2>/dev/null || echo "Warning: Cannot create $CS_DATA_DIR"
|
|
```
|
|
|
|
**Required Fix** (Add after line 51):
|
|
```bash
|
|
mkdir -p "$CS_PERSIST_DIR/hub_cache"
|
|
```
|
|
|
|
**Rationale**: CrowdSec stores hub metadata in a separate cache directory. Without this, `cscli hub update` fails silently.
|
|
|
|
**Risk**: Low - Simple directory creation with no side effects.
|
|
|
|
---
|
|
|
|
### Issue 5: CFG Variable Should Stay /etc/crowdsec
|
|
|
|
**Location**: `.docker/docker-entrypoint.sh` line 99
|
|
|
|
**Problem**: The original plan incorrectly suggested changing CFG to `$CS_CONFIG_DIR`, but it should remain `/etc/crowdsec` since it resolves to persistent storage via the symlink.
|
|
|
|
**Current Code** (Line 99):
|
|
```bash
|
|
export CFG=/etc/crowdsec
|
|
```
|
|
|
|
**Required Action**: **KEEP AS-IS** - Do NOT change this line.
|
|
|
|
**Rationale**: The CFG variable should point to `/etc/crowdsec` which resolves to `$CS_CONFIG_DIR` via symlink. This maintains compatibility with CrowdSec's expected paths while still using persistent storage.
|
|
|
|
**Risk**: None - No change required.
|
|
|
|
---
|
|
|
|
### Issue 6: Weak Migration Error Handling
|
|
|
|
**Location**: `.docker/docker-entrypoint.sh` lines 56-62
|
|
|
|
**Problem**: Too many `|| true` statements allow silent failures during config migration.
|
|
|
|
**Current Code** (Lines 56-62):
|
|
```bash
|
|
# Initialize persistent config if key files are missing
|
|
if [ ! -f "$CS_CONFIG_DIR/config.yaml" ]; then
|
|
echo "Initializing persistent CrowdSec configuration..."
|
|
if [ -d "/etc/crowdsec.dist" ]; then
|
|
cp -r /etc/crowdsec.dist/* "$CS_CONFIG_DIR/" 2>/dev/null || echo "Warning: Could not copy dist config"
|
|
elif [ -d "/etc/crowdsec" ] && [ ! -L "/etc/crowdsec" ]; then
|
|
# Fallback if .dist is missing
|
|
cp -r /etc/crowdsec/* "$CS_CONFIG_DIR/" 2>/dev/null || echo "Warning: Could not copy config"
|
|
fi
|
|
fi
|
|
```
|
|
|
|
**Required Fix** (Replace lines 56-62):
|
|
```bash
|
|
# Initialize persistent config if key files are missing
|
|
if [ ! -f "$CS_CONFIG_DIR/config.yaml" ]; then
|
|
echo "Initializing persistent CrowdSec configuration..."
|
|
if [ -d "/etc/crowdsec.dist" ] && [ -n "$(ls -A /etc/crowdsec.dist 2>/dev/null)" ]; then
|
|
cp -r /etc/crowdsec.dist/* "$CS_CONFIG_DIR/" || {
|
|
echo "ERROR: Failed to copy config from /etc/crowdsec.dist"
|
|
exit 1
|
|
}
|
|
echo "Successfully initialized config from .dist directory"
|
|
elif [ -d "/etc/crowdsec" ] && [ ! -L "/etc/crowdsec" ] && [ -n "$(ls -A /etc/crowdsec 2>/dev/null)" ]; then
|
|
cp -r /etc/crowdsec/* "$CS_CONFIG_DIR/" || {
|
|
echo "ERROR: Failed to copy config from /etc/crowdsec"
|
|
exit 1
|
|
}
|
|
echo "Successfully initialized config from /etc/crowdsec"
|
|
else
|
|
echo "ERROR: No config source found (neither .dist nor /etc/crowdsec available)"
|
|
exit 1
|
|
fi
|
|
fi
|
|
```
|
|
|
|
**Rationale**: Fail-fast approach ensures we detect misconfigurations early. Empty directory checks prevent copying empty directories.
|
|
|
|
**Risk**: Medium - Strict error handling may reveal edge cases. Must test thoroughly.
|
|
|
|
---
|
|
|
|
### Issue 7: Incomplete Verification Checklist
|
|
|
|
**Problem**: Original checklist had only 7 steps and missed critical tests for volume replacement, permissions, config persistence, and hub updates.
|
|
|
|
**Original Checklist** (Steps 1-7):
|
|
1. Fresh container start with empty volumes
|
|
2. Container restart (data persists)
|
|
3. CrowdSec enable/disable via UI
|
|
4. Log file permissions and rotation
|
|
5. LAPI readiness and machine registration
|
|
6. Hub updates and parsers
|
|
7. Multi-architecture compatibility
|
|
|
|
**Required Additional Steps** (8-11):
|
|
8. **Volume Replacement Test**: Start container with volume → destroy volume → recreate volume. Verify configs regenerate correctly.
|
|
9. **Permission Inheritance**: Create new files in persistent storage (e.g., `cscli decisions add`). Verify ownership is correct (1000:1000).
|
|
10. **Config Persistence**: Make config changes via `cscli` (e.g., add bouncer, modify settings). Restart container. Verify changes persist.
|
|
11. **Hub Update Test**: Run `cscli hub update && cscli hub upgrade`. Verify hub data is stored in persistent volume and survives restarts.
|
|
|
|
**Rationale**: These tests cover critical failure modes discovered in production: volume loss, permission issues on newly created files, config changes not persisting, and hub data being ephemeral.
|
|
|
|
**Risk**: None - This is documentation only.
|
|
|
|
---
|
|
|
|
## Implementation Order
|
|
|
|
Follow this sequence to apply changes safely:
|
|
|
|
### Phase 1: Dockerfile Changes (Low Risk)
|
|
1. Add config template population to `Dockerfile` before line 330
|
|
2. Build test image: `docker build -t charon:test .`
|
|
3. Verify `/etc/crowdsec.dist/` is populated: `docker run --rm charon:test ls -la /etc/crowdsec.dist/`
|
|
4. Expected output: `config.yaml`, `user.yaml`, `local_api_credentials.yaml`, `profiles.yaml`
|
|
|
|
### Phase 2: Entrypoint Script Changes (Medium Risk)
|
|
5. Apply all 5 entrypoint script fixes in a single commit (they're interdependent)
|
|
6. Rebuild image: `docker build -t charon:test .`
|
|
7. Test with fresh volumes (see Phase 3)
|
|
|
|
### Phase 3: Testing Strategy
|
|
Run all 11 verification tests in order:
|
|
|
|
**Test 1: Fresh Start**
|
|
```bash
|
|
docker volume create charon_data_test
|
|
docker run -d --name charon_test -v charon_data_test:/app/data charon:test
|
|
docker logs charon_test | grep -E "(symlink|CrowdSec config)"
|
|
```
|
|
Expected: "Created symlink: /etc/crowdsec -> /app/data/crowdsec/config"
|
|
|
|
**Test 2: Container Restart**
|
|
```bash
|
|
docker restart charon_test
|
|
docker logs charon_test | grep "symlink verified"
|
|
```
|
|
Expected: "CrowdSec config symlink verified: /etc/crowdsec -> /app/data/crowdsec/config"
|
|
|
|
**Test 3-7**: Follow existing test procedures from original plan
|
|
|
|
**Test 8: Volume Replacement**
|
|
```bash
|
|
docker stop charon_test
|
|
docker rm charon_test
|
|
docker volume rm charon_data_test
|
|
docker volume create charon_data_test
|
|
docker run -d --name charon_test -v charon_data_test:/app/data charon:test
|
|
docker exec charon_test ls -la /app/data/crowdsec/config/
|
|
```
|
|
Expected: `config.yaml` and other files regenerated
|
|
|
|
**Test 9: Permission Inheritance**
|
|
```bash
|
|
docker exec charon_test cscli decisions add -i 1.2.3.4
|
|
docker exec charon_test ls -ln /app/data/crowdsec/data/
|
|
```
|
|
Expected: All files owned by uid 1000, gid 1000
|
|
|
|
**Test 10: Config Persistence**
|
|
```bash
|
|
docker exec charon_test cscli config set api.server.log_level=debug
|
|
docker restart charon_test
|
|
docker exec charon_test cscli config show api.server.log_level
|
|
```
|
|
Expected: "debug"
|
|
|
|
**Test 11: Hub Update**
|
|
```bash
|
|
docker exec charon_test cscli hub update
|
|
docker exec charon_test ls -la /app/data/crowdsec/hub_cache/
|
|
docker restart charon_test
|
|
docker exec charon_test cscli hub list -o json
|
|
```
|
|
Expected: Hub cache persists, parsers/scenarios remain installed
|
|
|
|
### Phase 4: Rollback Procedure
|
|
If any test fails:
|
|
1. Tag working version: `docker tag charon:current charon:rollback`
|
|
2. Revert changes to `.docker/docker-entrypoint.sh` and `Dockerfile`
|
|
3. Rebuild: `docker build -t charon:current .`
|
|
4. Document failure in issue tracker with test logs
|
|
|
|
---
|
|
|
|
## Summary of All Changes
|
|
|
|
| File | Line(s) | Change Type | Priority | Risk |
|
|
|------|---------|-------------|----------|------|
|
|
| `Dockerfile` | Before 330 | Add config restore RUN | HIGH | Low |
|
|
| `.docker/docker-entrypoint.sh` | 47 | Add CS_LOG_DIR variable | HIGH | Low |
|
|
| `.docker/docker-entrypoint.sh` | 51 | Add hub_cache mkdir | HIGH | Low |
|
|
| `.docker/docker-entrypoint.sh` | 56-62 | Strengthen config init | HIGH | Medium |
|
|
| `.docker/docker-entrypoint.sh` | 68-73 | Implement symlink creation | HIGH | Medium |
|
|
| `.docker/docker-entrypoint.sh` | 99 | Keep CFG=/etc/crowdsec | NONE | None |
|
|
| `.docker/docker-entrypoint.sh` | 100 | Fix LOG variable | HIGH | Low |
|
|
| Verification checklist | N/A | Add 4 new tests | HIGH | None |
|
|
|
|
---
|
|
|
|
## Risk Assessment
|
|
|
|
### Low Risk Changes (Can be applied immediately)
|
|
- Dockerfile config template population
|
|
- LOG variable fix
|
|
- Hub cache directory creation
|
|
- CFG variable (no change)
|
|
|
|
### Medium Risk Changes (Require thorough testing)
|
|
- Symlink creation logic (fundamental behavior change)
|
|
- Error handling strengthening (may expose edge cases)
|
|
|
|
### High Risk Scenarios to Test
|
|
- Existing installations upgrading from old version
|
|
- Corrupted/incomplete config directories
|
|
- Simultaneous volume and config failures
|
|
- Cross-architecture compatibility (arm64 especially)
|
|
|
|
---
|
|
|
|
## Acceptance Criteria
|
|
|
|
All 11 verification tests must pass before merging:
|
|
- [ ] Fresh container start
|
|
- [ ] Container restart
|
|
- [ ] CrowdSec enable/disable
|
|
- [ ] Log file permissions
|
|
- [ ] LAPI readiness
|
|
- [ ] Hub updates
|
|
- [ ] Multi-arch compatibility
|
|
- [ ] Volume replacement
|
|
- [ ] Permission inheritance
|
|
- [ ] Config persistence
|
|
- [ ] Hub update persistence
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- Original issue: CrowdSec non-root migration
|
|
- Supervisor review: 2024-12-22
|
|
- Related files: `Dockerfile`, `.docker/docker-entrypoint.sh`
|
|
- Testing environment: Docker 24.x, volumes with uid 1000
|
|
|
|
---
|
|
|
|
# Historical Analysis: CrowdSec Reconciliation Failure Diagnostics
|
|
|
|
## Executive Summary
|
|
|
|
Investigation of why CrowdSec shows "not started" in the UI when it should **already be enabled**. This is NOT a first-time enable issue—it's a **reconciliation/runtime failure** after container restart or app startup.
|
|
|
|
---
|
|
|
|
## Problem Statement
|
|
|
|
User reports CrowdSec was previously enabled and working, but after container restart:
|
|
- UI shows CrowdSec as "not started"
|
|
- The setting in database says it should be enabled
|
|
- No obvious errors in the UI
|
|
|
|
---
|
|
|
|
## 1. Reconciliation Flow Overview
|
|
|
|
When Charon starts, `ReconcileCrowdSecOnStartup()` runs **asynchronously** (in a goroutine) to restore CrowdSec state.
|
|
|
|
### Flow Diagram
|
|
|
|
```
|
|
App Startup → go ReconcileCrowdSecOnStartup() → (async goroutine)
|
|
│
|
|
▼
|
|
┌────────────────────────────────────┐
|
|
│ 1. Validate: db != nil && exec != nil │
|
|
└────────────────────────────────────┘
|
|
│ (fail → silent return)
|
|
▼
|
|
┌────────────────────────────────────┐
|
|
│ 2. Check: SecurityConfig table exists │
|
|
└────────────────────────────────────┘
|
|
│ (no table → WARN + return)
|
|
▼
|
|
┌────────────────────────────────────┐
|
|
│ 3. Query: SecurityConfig record │
|
|
└────────────────────────────────────┘
|
|
│ (not found → auto-create from Settings)
|
|
│ (error → return)
|
|
▼
|
|
┌────────────────────────────────────┐
|
|
│ 4. Query: Settings table override │
|
|
│ key = "security.crowdsec.enabled" │
|
|
└────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌────────────────────────────────────┐
|
|
│ 5. Decide: Start if CrowdSecMode == │
|
|
│ "local" OR setting == "true" │
|
|
└────────────────────────────────────┘
|
|
│ (both false → INFO skip)
|
|
▼
|
|
┌────────────────────────────────────┐
|
|
│ 6. Validate: Binary exists at path │
|
|
│ /usr/local/bin/crowdsec │
|
|
└────────────────────────────────────┘
|
|
│ (not found → ERROR + return)
|
|
▼
|
|
┌────────────────────────────────────┐
|
|
│ 7. Validate: Config dir exists │
|
|
│ dataDir/config │
|
|
└────────────────────────────────────┘
|
|
│ (not found → ERROR + return)
|
|
▼
|
|
┌────────────────────────────────────┐
|
|
│ 8. Check: Status (already running?) │
|
|
└────────────────────────────────────┘
|
|
│ (running → INFO + done)
|
|
│ (error → WARN + return!)
|
|
▼
|
|
┌────────────────────────────────────┐
|
|
│ 9. Start: CrowdSec process │
|
|
└────────────────────────────────────┘
|
|
│ (error → ERROR + return)
|
|
▼
|
|
┌────────────────────────────────────┐
|
|
│ 10. Verify: Wait 2s + check status │
|
|
└────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Most Likely Failure Points (Priority Order)
|
|
|
|
### 2.1 Binary Not Found ⭐ HIGH LIKELIHOOD
|
|
|
|
**Code:** `backend/internal/services/crowdsec_startup.go:117-120`
|
|
|
|
```go
|
|
if _, err := os.Stat(binPath); os.IsNotExist(err) {
|
|
logger.Log().WithField("path", binPath).Error("CrowdSec reconciliation: binary not found, cannot start")
|
|
return
|
|
}
|
|
```
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
docker exec <container> ls -la /usr/local/bin/crowdsec
|
|
docker exec <container> printenv CHARON_CROWDSEC_BIN
|
|
```
|
|
|
|
---
|
|
|
|
### 2.2 Config Directory Missing ⭐ HIGH LIKELIHOOD
|
|
|
|
**Code:** `backend/internal/services/crowdsec_startup.go:122-126`
|
|
|
|
```go
|
|
configPath := filepath.Join(dataDir, "config")
|
|
if _, err := os.Stat(configPath); os.IsNotExist(err) {
|
|
logger.Log().WithField("path", configPath).Error("CrowdSec reconciliation: config directory not found")
|
|
return
|
|
}
|
|
```
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
docker exec <container> ls -la /data/crowdsec/config/
|
|
docker exec <container> cat /data/crowdsec/config/config.yaml
|
|
```
|
|
|
|
---
|
|
|
|
### 2.3 Database State Mismatch ⭐ MEDIUM LIKELIHOOD
|
|
|
|
Two sources must be checked:
|
|
1. `security_configs.crowdsec_mode = "local"`
|
|
2. `settings.key = "security.crowdsec.enabled"` with `value = "true"`
|
|
|
|
If **BOTH** are not "enabled", reconciliation silently skips.
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
docker exec <container> sqlite3 /data/charon.db "SELECT crowdsec_mode, enabled FROM security_configs LIMIT 1;"
|
|
docker exec <container> sqlite3 /data/charon.db "SELECT key, value FROM settings WHERE key LIKE '%crowdsec%';"
|
|
```
|
|
|
|
---
|
|
|
|
### 2.4 Stale PID File (PID Recycled) ⭐ MEDIUM LIKELIHOOD
|
|
|
|
**Code:** `backend/internal/api/handlers/crowdsec_exec.go:118-147`
|
|
|
|
Status check reads PID file, checks if process exists, then verifies `/proc/<pid>/cmdline` contains "crowdsec".
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
docker exec <container> cat /data/crowdsec/crowdsec.pid
|
|
docker exec <container> pgrep -a crowdsec
|
|
```
|
|
|
|
---
|
|
|
|
### 2.5 Process Crashes After Start ⭐ MEDIUM LIKELIHOOD
|
|
|
|
**Code:** `backend/internal/services/crowdsec_startup.go:146-159`
|
|
|
|
After starting, waits 2 seconds and verifies. If crashed:
|
|
```
|
|
logger.Log().Error("CrowdSec reconciliation: process started but is no longer running - may have crashed")
|
|
```
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Try manual start to see errors
|
|
docker exec <container> /usr/local/bin/crowdsec -c /data/crowdsec/config/config.yaml
|
|
|
|
# Check for port conflicts (LAPI uses 8085)
|
|
docker exec <container> netstat -tlnp 2>/dev/null | grep 8085
|
|
```
|
|
|
|
---
|
|
|
|
### 2.6 Status Check Error (Silently Aborts) ⭐ LOW LIKELIHOOD
|
|
|
|
**Code:** `backend/internal/services/crowdsec_startup.go:129-134`
|
|
|
|
```go
|
|
if err != nil {
|
|
logger.Log().WithError(err).Warn("CrowdSec reconciliation: failed to check status")
|
|
return // ← Aborts without trying to start!
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 3. Status Handler Analysis
|
|
|
|
The UI calls `GET /api/v1/admin/crowdsec/status`:
|
|
|
|
**Code:** `backend/internal/api/handlers/crowdsec_handler.go:313-333`
|
|
|
|
Returns `running: false` when:
|
|
- PID file doesn't exist
|
|
- PID doesn't correspond to a running process
|
|
- PID is running but `/proc/<pid>/cmdline` doesn't contain "crowdsec"
|
|
|
|
---
|
|
|
|
## 4. Diagnostic Commands Summary
|
|
|
|
```bash
|
|
# 1. Check binary
|
|
docker exec <container> ls -la /usr/local/bin/crowdsec
|
|
|
|
# 2. Check config directory
|
|
docker exec <container> ls -la /data/crowdsec/config/
|
|
|
|
# 3. Check database state
|
|
docker exec <container> sqlite3 /data/charon.db \
|
|
"SELECT crowdsec_mode, enabled FROM security_configs LIMIT 1;"
|
|
docker exec <container> sqlite3 /data/charon.db \
|
|
"SELECT key, value FROM settings WHERE key LIKE '%crowdsec%';"
|
|
|
|
# 4. Check PID file
|
|
docker exec <container> cat /data/crowdsec/crowdsec.pid 2>/dev/null || echo "No PID file"
|
|
|
|
# 5. Check running processes
|
|
docker exec <container> pgrep -a crowdsec || echo "Not running"
|
|
|
|
# 6. Check logs for reconciliation
|
|
docker logs <container> 2>&1 | grep -i "crowdsec reconciliation"
|
|
|
|
# 7. Try manual start
|
|
docker exec <container> /usr/local/bin/crowdsec \
|
|
-c /data/crowdsec/config/config.yaml &
|
|
|
|
# 8. Check port conflicts
|
|
docker exec <container> netstat -tlnp 2>/dev/null | grep -E "8085|8080"
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Log Messages to Look For
|
|
|
|
| Priority | Cause | Log Message |
|
|
|----------|-------|-------------|
|
|
| 1 | Binary missing | `"CrowdSec reconciliation: binary not found"` |
|
|
| 2 | Config missing | `"CrowdSec reconciliation: config directory not found"` |
|
|
| 3 | DB says disabled | `"CrowdSec reconciliation skipped: both SecurityConfig and Settings indicate disabled"` |
|
|
| 4 | Crashed after start | `"process started but is no longer running"` |
|
|
| 5 | Start failed | `"CrowdSec reconciliation: FAILED to start CrowdSec"` |
|
|
| 6 | Status check failed | `"failed to check status"` |
|
|
|
|
---
|
|
|
|
## 6. Key Timeouts
|
|
|
|
| Operation | Timeout | Location |
|
|
|-----------|---------|----------|
|
|
| Status check | 5 seconds | crowdsec_startup.go:128 |
|
|
| Start timeout | 30 seconds | crowdsec_startup.go:146 |
|
|
| Post-start delay | 2 seconds | crowdsec_startup.go:153 |
|
|
| Verification check | 5 seconds | crowdsec_startup.go:156 |
|
|
|
|
---
|
|
|
|
---
|
|
|
|
# Original Analysis: First-Time Enable Issues
|
|
|
|
## Observed Browser Console Errors
|
|
|
|
```
|
|
- 401 Unauthorized on /api/v1/auth/me
|
|
- Multiple 400 Bad Request on /api/v1/settings/validate-url
|
|
- Auto-logging out due to inactivity
|
|
- Various ERR_NETWORK_CHANGED errors
|
|
- CrowdSec appears to not be running
|
|
```
|
|
|
|
---
|
|
|
|
## Relevant Code Files and Flow Analysis
|
|
|
|
### 2.1 CrowdSec Startup Flow
|
|
|
|
#### Entry Point: Frontend Toggle
|
|
|
|
**File:** [frontend/src/pages/Security.tsx](../../../frontend/src/pages/Security.tsx#L147-L183)
|
|
|
|
```typescript
|
|
const crowdsecPowerMutation = useMutation({
|
|
mutationFn: async (enabled: boolean) => {
|
|
await updateSetting('security.crowdsec.enabled', enabled ? 'true' : 'false', 'security', 'bool')
|
|
if (enabled) {
|
|
toast.info('Starting CrowdSec... This may take up to 30 seconds')
|
|
const result = await startCrowdsec()
|
|
const status = await statusCrowdsec()
|
|
if (!status.running) {
|
|
await updateSetting('security.crowdsec.enabled', 'false', 'security', 'bool')
|
|
throw new Error('CrowdSec process failed to start. Check server logs for details.')
|
|
}
|
|
return result
|
|
} else {
|
|
await stopCrowdsec()
|
|
// ...
|
|
}
|
|
},
|
|
// ...
|
|
})
|
|
```
|
|
|
|
#### API Client Configuration
|
|
|
|
**File:** [frontend/src/api/client.ts](../../../frontend/src/api/client.ts#L7-L11)
|
|
|
|
```typescript
|
|
const client = axios.create({
|
|
baseURL: '/api/v1',
|
|
withCredentials: true,
|
|
timeout: 30000, // 30 second timeout
|
|
});
|
|
```
|
|
|
|
**Issue Identified:** The frontend has a **30-second timeout**, which aligns with the backend LAPI readiness timeout. However, the startup process involves multiple sequential steps that could exceed this total.
|
|
|
|
#### Backend Start Handler
|
|
|
|
**File:** [backend/internal/api/handlers/crowdsec_handler.go](../../../backend/internal/api/handlers/crowdsec_handler.go#L175-L252)
|
|
|
|
Key timeouts in `Start()`:
|
|
|
|
- LAPI readiness polling: **30 seconds max** (line 229: `maxWait := 30 * time.Second`)
|
|
- Poll interval: **500ms** (line 230: `pollInterval := 500 * time.Millisecond`)
|
|
- Individual LAPI check: **2 seconds** (line 237: `context.WithTimeout(ctx, 2*time.Second)`)
|
|
|
|
```go
|
|
// Wait for LAPI to be ready (with timeout)
|
|
lapiReady := false
|
|
maxWait := 30 * time.Second
|
|
pollInterval := 500 * time.Millisecond
|
|
deadline := time.Now().Add(maxWait)
|
|
|
|
for time.Now().Before(deadline) {
|
|
checkCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
|
|
_, err := h.CmdExec.Execute(checkCtx, "cscli", args...)
|
|
cancel()
|
|
if err == nil {
|
|
lapiReady = true
|
|
break
|
|
}
|
|
time.Sleep(pollInterval)
|
|
}
|
|
```
|
|
|
|
#### Backend Executor (Process Management)
|
|
|
|
**File:** [backend/internal/api/handlers/crowdsec_exec.go](../../../backend/internal/api/handlers/crowdsec_exec.go)
|
|
|
|
The `DefaultCrowdsecExecutor.Start()` method (lines 39-66):
|
|
|
|
- Uses `exec.Command` (not `CommandContext`) - process is detached
|
|
- Sets `Setpgid: true` to create new process group
|
|
- Writes PID file synchronously
|
|
- Returns immediately after starting the process
|
|
|
|
```go
|
|
func (e *DefaultCrowdsecExecutor) Start(ctx context.Context, binPath, configDir string) (int, error) {
|
|
configFile := filepath.Join(configDir, "config", "config.yaml")
|
|
cmd := exec.Command(binPath, "-c", configFile)
|
|
cmd.SysProcAttr = &syscall.SysProcAttr{
|
|
Setpgid: true, // Create new process group
|
|
}
|
|
// ...
|
|
if err := cmd.Start(); err != nil {
|
|
return 0, err
|
|
}
|
|
// ... writes PID file
|
|
go func() {
|
|
_ = cmd.Wait()
|
|
_ = os.Remove(e.pidFile(configDir))
|
|
}()
|
|
return pid, nil
|
|
}
|
|
```
|
|
|
|
#### Background Reconciliation
|
|
|
|
**File:** [backend/internal/services/crowdsec_startup.go](../../../backend/internal/services/crowdsec_startup.go)
|
|
|
|
Key timeouts in `ReconcileCrowdSecOnStartup()`:
|
|
|
|
- Status check timeout: **5 seconds** (line 139)
|
|
- Start timeout: **30 seconds** (line 150)
|
|
- Verification delay: **2 seconds** (line 159: `time.Sleep(2 * time.Second)`)
|
|
- Verification check timeout: **5 seconds** (line 161)
|
|
|
|
```go
|
|
// Start context with 30 second timeout
|
|
startCtx, startCancel := context.WithTimeout(context.Background(), 30*time.Second)
|
|
defer startCancel()
|
|
|
|
newPid, err := executor.Start(startCtx, binPath, dataDir)
|
|
// ...
|
|
|
|
// VERIFY: Wait briefly and confirm process is actually running
|
|
time.Sleep(2 * time.Second)
|
|
|
|
verifyCtx, verifyCancel := context.WithTimeout(context.Background(), 5*time.Second)
|
|
defer verifyCancel()
|
|
```
|
|
|
|
---
|
|
|
|
## 3. Identified Potential Root Causes
|
|
|
|
### 3.1 Timeout Race Condition (HIGH PROBABILITY)
|
|
|
|
The frontend timeout (30s) and backend LAPI polling timeout (30s) are identical. Combined with:
|
|
|
|
- Initial process start time
|
|
- Settings database update
|
|
- SecurityConfig database update
|
|
- Network latency
|
|
|
|
**Total time could easily exceed 30 seconds**, causing the frontend to timeout before the backend responds.
|
|
|
|
### 3.2 CrowdSec Binary/Config Not Found
|
|
|
|
In [crowdsec_startup.go](../../../backend/internal/services/crowdsec_startup.go#L124-L135):
|
|
|
|
```go
|
|
// VALIDATE: Ensure binary exists
|
|
if _, err := os.Stat(binPath); os.IsNotExist(err) {
|
|
logger.Log().WithField("path", binPath).Error("CrowdSec reconciliation: binary not found")
|
|
return
|
|
}
|
|
|
|
// VALIDATE: Ensure config directory exists
|
|
configPath := filepath.Join(dataDir, "config")
|
|
if _, err := os.Stat(configPath); os.IsNotExist(err) {
|
|
logger.Log().WithField("path", configPath).Error("CrowdSec reconciliation: config directory not found")
|
|
return
|
|
}
|
|
```
|
|
|
|
**Check:** The binary path defaults to `/usr/local/bin/crowdsec` (from `routes.go` line 292) and config dir is `data/crowdsec`. If either is missing, the startup silently fails.
|
|
|
|
### 3.3 LAPI Never Becomes Ready
|
|
|
|
The handler waits for `cscli lapi status` to succeed. If CrowdSec starts but LAPI never initializes (e.g., database issues, missing configuration), the handler will timeout.
|
|
|
|
### 3.4 Authentication Issues (401 on /auth/me)
|
|
|
|
The 401 errors suggest the user's session is expiring during the long-running operation. This is likely a **symptom, not the cause**:
|
|
|
|
**File:** [frontend/src/api/client.ts](../../../frontend/src/api/client.ts#L25-L33)
|
|
|
|
```typescript
|
|
client.interceptors.response.use(
|
|
(response) => response,
|
|
(error) => {
|
|
if (error.response?.status === 401) {
|
|
console.warn('Authentication failed:', error.config?.url);
|
|
}
|
|
return Promise.reject(error);
|
|
}
|
|
);
|
|
```
|
|
|
|
The session timeout or network interruption during the 30+ second CrowdSec startup could cause parallel requests to `/auth/me` to fail.
|
|
|
|
### 3.5 ERR_NETWORK_CHANGED
|
|
|
|
This indicates network connectivity issues on the client side. If the network changes during the long-running request, it will fail. This is external to the application but exacerbated by long timeouts.
|
|
|
|
---
|
|
|
|
## 4. Configuration Defaults
|
|
|
|
| Setting | Default Value | Source |
|
|
|---------|---------------|--------|
|
|
| CrowdSec Binary | `/usr/local/bin/crowdsec` | `CHARON_CROWDSEC_BIN` env or hardcoded |
|
|
| CrowdSec Config Dir | `data/crowdsec` | `CHARON_CROWDSEC_CONFIG_DIR` env |
|
|
| CrowdSec Mode | `disabled` | `CERBERUS_SECURITY_CROWDSEC_MODE` env |
|
|
| Frontend Timeout | 30 seconds | `client.ts` |
|
|
| LAPI Wait Timeout | 30 seconds | `crowdsec_handler.go` |
|
|
| Process Start Timeout | 30 seconds | `crowdsec_startup.go` |
|
|
|
|
---
|
|
|
|
## 5. Remediation Plan
|
|
|
|
### Phase 1: Immediate Fixes (Timeout Handling)
|
|
|
|
#### 5.1.1 Increase Frontend Timeout for CrowdSec Operations
|
|
|
|
**File:** `frontend/src/api/crowdsec.ts`
|
|
|
|
Create a dedicated request with extended timeout for CrowdSec start:
|
|
|
|
```typescript
|
|
export async function startCrowdsec(): Promise<{ status: string; pid: number; lapi_ready?: boolean }> {
|
|
const resp = await client.post('/admin/crowdsec/start', {}, {
|
|
timeout: 60000, // 60 second timeout for startup operations
|
|
})
|
|
return resp.data
|
|
}
|
|
```
|
|
|
|
#### 5.1.2 Add Progress/Status Feedback
|
|
|
|
Implement polling-based status check instead of waiting for single long request:
|
|
|
|
1. Backend: Return immediately after starting process, with status "starting"
|
|
2. Frontend: Poll status endpoint until "running" or timeout
|
|
|
|
#### 5.1.3 Improve Error Messages
|
|
|
|
**File:** `backend/internal/api/handlers/crowdsec_handler.go`
|
|
|
|
Add detailed error responses:
|
|
|
|
```go
|
|
if !lapiReady {
|
|
logger.Log().WithField("pid", pid).Warn("CrowdSec started but LAPI not ready within timeout")
|
|
c.JSON(http.StatusOK, gin.H{
|
|
"status": "started",
|
|
"pid": pid,
|
|
"lapi_ready": false,
|
|
"warning": "Process started but LAPI initialization may take additional time",
|
|
"next_step": "Poll /admin/crowdsec/status until lapi_ready is true",
|
|
})
|
|
return
|
|
}
|
|
```
|
|
|
|
### Phase 2: Diagnostic Improvements
|
|
|
|
#### 5.2.1 Add Health Check Endpoint
|
|
|
|
Create `/admin/crowdsec/health` that returns:
|
|
|
|
- Binary path and existence check
|
|
- Config directory and existence check
|
|
- Process status
|
|
- LAPI status
|
|
- Last error (if any)
|
|
|
|
#### 5.2.2 Enhanced Logging
|
|
|
|
Add structured logging for all CrowdSec operations with correlation IDs.
|
|
|
|
### Phase 3: Long-term Fixes
|
|
|
|
#### 5.3.1 Async Startup Pattern
|
|
|
|
Convert to async pattern:
|
|
|
|
1. `POST /admin/crowdsec/start` returns immediately with job ID
|
|
2. `GET /admin/crowdsec/jobs/{id}` returns job status
|
|
3. Frontend polls job status with exponential backoff
|
|
|
|
#### 5.3.2 WebSocket Status Updates
|
|
|
|
Use existing WebSocket infrastructure to push status updates during startup.
|
|
|
|
---
|
|
|
|
## 6. Diagnostic Commands
|
|
|
|
To investigate the issue on the running container:
|
|
|
|
```bash
|
|
# Check if CrowdSec binary exists
|
|
ls -la /usr/local/bin/crowdsec
|
|
|
|
# Check CrowdSec config directory
|
|
ls -la /app/data/crowdsec/config/
|
|
|
|
# Check if CrowdSec is running
|
|
pgrep -f crowdsec
|
|
ps aux | grep crowdsec
|
|
|
|
# Check CrowdSec logs (if running)
|
|
cat /var/log/crowdsec.log
|
|
|
|
# Test LAPI status
|
|
cscli lapi status
|
|
|
|
# Check PID file
|
|
cat /app/data/crowdsec/crowdsec.pid
|
|
|
|
# Check database for CrowdSec settings
|
|
sqlite3 /app/data/charon.db "SELECT * FROM settings WHERE key LIKE '%crowdsec%';"
|
|
sqlite3 /app/data/charon.db "SELECT * FROM security_configs;"
|
|
```
|
|
|
|
---
|
|
|
|
## 7. Summary
|
|
|
|
| Issue | Probability | Impact | Fix Complexity |
|
|
|-------|-------------|--------|----------------|
|
|
| Timeout race condition | HIGH | Startup fails | Low |
|
|
| Missing binary/config | MEDIUM | Startup fails silently | Low |
|
|
| LAPI initialization slow | MEDIUM | Timeout | Medium |
|
|
| Session expiry during startup | LOW | User sees 401 | Low |
|
|
| Network instability | LOW | Request fails | N/A (external) |
|
|
|
|
**Recommended Immediate Action:** Increase frontend timeout for CrowdSec start operations to 60 seconds and add polling-based status verification.
|
|
|
|
---
|
|
|
|
## 8. Files to Modify
|
|
|
|
| File | Change |
|
|
|------|--------|
|
|
| `frontend/src/api/crowdsec.ts` | Extend timeout for start operation |
|
|
| `frontend/src/pages/Security.tsx` | Add polling for status after start |
|
|
| `backend/internal/api/handlers/crowdsec_handler.go` | Return partial success, add health endpoint |
|
|
| `backend/internal/services/crowdsec_startup.go` | Add more diagnostic logging |
|
|
|
|
---
|
|
|
|
*Investigation completed: December 22, 2025*
|
|
*Author: GitHub Copilot (Research Mode)*
|