Files
Charon/docs/plans/crowdsec_startup_fix.md
2026-01-13 22:11:35 +00:00

350 lines
12 KiB
Markdown

# CrowdSec Startup Fix Plan
**Date:** 2025-12-22
**Updated:** 2025-12-23 (Post-Implementation Investigation)
**Status:** FAILED - Requires Additional Fixes
**Priority:** CRITICAL
## Current State (2025-12-23 Investigation)
**CrowdSec is NOT starting due to PERMISSION ERRORS.** The initial fix was implemented but did NOT address the actual root causes.
### Actual Error Messages from Container Logs
```
Failed to write to log, can't open new logfile: open /var/log/crowdsec.log: permission denied
FATAL unable to create database client: unable to set perms on /var/lib/crowdsec/data/crowdsec.db: chmod /var/lib/crowdsec/data/crowdsec.db: operation not permitted
{"level":"warning","msg":"CrowdSec started but LAPI not ready within timeout","pid":316,"time":"2025-12-22T21:04:00-05:00"}
```
### File Ownership Issues (VERIFIED)
```bash
# Database file owned by root - CrowdSec can't chmod it
$ stat -c '%U:%G %n' /var/lib/crowdsec/data/crowdsec.db
root:root /var/lib/crowdsec/data/crowdsec.db
# Config files owned by root - created by entrypoint running as root
$ stat -c '%U:%G %n' /app/data/crowdsec/config/config.yaml /app/data/crowdsec/config/user.yaml
root:root /app/data/crowdsec/config/config.yaml
root:root /app/data/crowdsec/config/user.yaml
```
### CrowdSec Config Problem (CRITICAL)
The `config.yaml` has `log_dir: /var/log/` (wrong path):
```yaml
common:
log_dir: /var/log/ # <-- WRONG: Should be /var/log/crowdsec/
log_media: file
```
CrowdSec is trying to write to `/var/log/crowdsec.log` but `/var/log/` is owned by root. The correct path should be `/var/log/crowdsec/` which is owned by charon.
## Root Cause Analysis (UPDATED)
### 1. **Entrypoint Script Runs CrowdSec Commands as Root**
**Finding:** The entrypoint script runs `cscli machines add -a --force` and `envsubst` on config files **while still running as root**. These operations:
- Create `/var/lib/crowdsec/data/crowdsec.db` owned by root
- Overwrite `config.yaml` and `user.yaml` with root ownership
**Evidence from entrypoint:**
```bash
# These run as root BEFORE `su-exec charon` is used
cscli machines add -a --force 2>/dev/null || echo "Warning: Machine registration may have failed"
envsubst < "$file" > "$file.tmp" && mv "$file.tmp" "$file"
```
### 2. **CrowdSec Log Path Configuration Error**
**Finding:** The distributed `config.yaml` has `log_dir: /var/log/` instead of `log_dir: /var/log/crowdsec/`.
**Evidence:**
```yaml
# Current (WRONG):
log_dir: /var/log/
# Should be:
log_dir: /var/log/crowdsec/
```
### 3. **ReconcileCrowdSecOnStartup IS Being Called (VERIFIED)**
**Finding:** The reconciliation function is now correctly called in [backend/cmd/api/main.go#L144](backend/cmd/api/main.go#L144) BEFORE the HTTP server starts:
```go
crowdsecExec := handlers.NewDefaultCrowdsecExecutor()
services.ReconcileCrowdSecOnStartup(db, crowdsecExec, crowdsecBinPath, crowdsecDataDir)
```
This is CORRECT but CrowdSec still fails due to permission issues.
### 4. **CrowdSec Start Method is Correct (VERIFIED)**
**Finding:** The executor's `Start` method correctly uses `os/exec` without context cancellation:
```go
cmd := exec.Command(binPath, "-c", configFile)
cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
```
The binary starts but immediately crashes due to permission denied errors.
## What Was Implemented vs What Actually Happened
| Item | Expected | Actual |
|------|----------|--------|
| Reconciliation in main.go | ✅ Added | ✅ Called on startup |
| Dockerfile chown for CrowdSec dirs | ✅ Added | ❌ Overwritten at runtime by entrypoint |
| Goroutine removed from routes.go | ✅ Removed | ✅ Confirmed removed |
| Entrypoint permission fix | ❌ Not implemented | ❌ Root operations create root-owned files |
| Config log_dir fix | ❌ Not implemented | ❌ Still pointing to /var/log/ |
## REQUIRED FIXES (Specific Code Changes)
### FIX 1: Change CrowdSec log_dir in Entrypoint (CRITICAL)
**File:** `.docker/docker-entrypoint.sh`
**Location:** After line 155 (after `sed -i 's|listen_uri.*|listen_uri: 127.0.0.1:8085|g'`)
**Add:**
```bash
# Fix log_dir path - must point to /var/log/crowdsec/ not /var/log/
if [ -f "/etc/crowdsec/config.yaml" ]; then
sed -i 's|log_dir: /var/log/$|log_dir: /var/log/crowdsec/|g' /etc/crowdsec/config.yaml
sed -i 's|log_dir: /var/log/\s*$|log_dir: /var/log/crowdsec/|g' /etc/crowdsec/config.yaml
fi
```
### FIX 2: Run cscli Commands as charon User (CRITICAL)
**File:** `.docker/docker-entrypoint.sh`
**Change:** All `cscli` commands must run as `charon` user, not root.
**Current (WRONG):**
```bash
cscli machines add -a --force 2>/dev/null || echo "Warning: Machine registration may have failed"
```
**Required (CORRECT):**
```bash
su-exec charon cscli machines add -a --force 2>/dev/null || echo "Warning: Machine registration may have failed"
```
### FIX 3: Run envsubst as charon User (CRITICAL)
**File:** `.docker/docker-entrypoint.sh`
**Change:** The envsubst operations must preserve charon ownership.
**Current (WRONG):**
```bash
for file in /etc/crowdsec/config.yaml /etc/crowdsec/user.yaml; do
if [ -f "$file" ]; then
envsubst < "$file" > "$file.tmp" && mv "$file.tmp" "$file"
fi
done
```
**Required (CORRECT):**
```bash
for file in /etc/crowdsec/config.yaml /etc/crowdsec/user.yaml; do
if [ -f "$file" ]; then
envsubst < "$file" > "$file.tmp" && mv "$file.tmp" "$file"
chown charon:charon "$file" 2>/dev/null || true
fi
done
```
### FIX 4: Fix Ownership AFTER cscli Operations (CRITICAL)
**File:** `.docker/docker-entrypoint.sh`
**Location:** After all cscli operations, before "CrowdSec configuration initialized" message
**Add:**
```bash
# Fix ownership of files created by cscli (runs as root, creates root-owned files)
# The database and config files must be owned by charon for CrowdSec to start
chown -R charon:charon /var/lib/crowdsec 2>/dev/null || true
chown -R charon:charon /app/data/crowdsec 2>/dev/null || true
chown -R charon:charon /var/log/crowdsec 2>/dev/null || true
```
### FIX 5: Update Default config.yaml in configs/crowdsec/ (PREVENTIVE)
**File:** `configs/crowdsec/config.yaml` (if exists) or modify the distributed template
**Change:** Ensure log_dir is correct in the source template:
```yaml
common:
daemonize: true
log_media: file
log_level: info
log_dir: /var/log/crowdsec/ # <-- CORRECT PATH
```
## Complete Entrypoint Script Fix
Here's the corrected CrowdSec section for `.docker/docker-entrypoint.sh`:
```bash
# ============================================================================
# CrowdSec Initialization
# ============================================================================
if command -v cscli >/dev/null; then
echo "Initializing CrowdSec configuration..."
# Define persistent paths
CS_PERSIST_DIR="/app/data/crowdsec"
CS_CONFIG_DIR="$CS_PERSIST_DIR/config"
CS_DATA_DIR="$CS_PERSIST_DIR/data"
CS_LOG_DIR="/var/log/crowdsec"
# Ensure persistent directories exist
mkdir -p "$CS_CONFIG_DIR" "$CS_DATA_DIR" "$CS_LOG_DIR" 2>/dev/null || true
mkdir -p /var/lib/crowdsec/data 2>/dev/null || true
# Initialize persistent config if key files are missing
if [ ! -f "$CS_CONFIG_DIR/config.yaml" ]; then
echo "Initializing persistent CrowdSec configuration..."
if [ -d "/etc/crowdsec.dist" ] && [ -n "$(ls -A /etc/crowdsec.dist 2>/dev/null)" ]; then
cp -r /etc/crowdsec.dist/* "$CS_CONFIG_DIR/" || exit 1
echo "Successfully initialized config from .dist directory"
fi
fi
# Create acquisition config
if [ ! -f "/etc/crowdsec/acquis.yaml" ] || [ ! -s "/etc/crowdsec/acquis.yaml" ]; then
cat > /etc/crowdsec/acquis.yaml << 'ACQUIS_EOF'
source: file
filenames:
- /var/log/caddy/access.log
- /var/log/caddy/*.log
labels:
type: caddy
ACQUIS_EOF
fi
# Environment substitution (preserving ownership after)
export CFG=/etc/crowdsec
export DATA="$CS_DATA_DIR"
export PID=/var/run/crowdsec.pid
export LOG="$CS_LOG_DIR/crowdsec.log"
for file in /etc/crowdsec/config.yaml /etc/crowdsec/user.yaml; do
if [ -f "$file" ]; then
envsubst < "$file" > "$file.tmp" && mv "$file.tmp" "$file"
chown charon:charon "$file" 2>/dev/null || true
fi
done
# Configure LAPI port (8085 instead of 8080)
if [ -f "/etc/crowdsec/config.yaml" ]; then
sed -i 's|listen_uri: 127.0.0.1:8080|listen_uri: 127.0.0.1:8085|g' /etc/crowdsec/config.yaml
sed -i 's|listen_uri: 0.0.0.0:8080|listen_uri: 127.0.0.1:8085|g' /etc/crowdsec/config.yaml
# FIX: Correct log_dir path
sed -i 's|log_dir: /var/log/$|log_dir: /var/log/crowdsec/|g' /etc/crowdsec/config.yaml
fi
# Update local_api_credentials.yaml to use correct port
if [ -f "/etc/crowdsec/local_api_credentials.yaml" ]; then
sed -i 's|url: http://127.0.0.1:8080|url: http://127.0.0.1:8085|g' /etc/crowdsec/local_api_credentials.yaml
sed -i 's|url: http://localhost:8080|url: http://127.0.0.1:8085|g' /etc/crowdsec/local_api_credentials.yaml
fi
# Update hub index
if [ ! -f "/etc/crowdsec/hub/.index.json" ]; then
echo "Updating CrowdSec hub index..."
timeout 60s cscli hub update 2>/dev/null || echo "⚠️ Hub update timed out"
fi
# Register local machine (run as charon or fix ownership after)
echo "Registering local machine..."
cscli machines add -a --force 2>/dev/null || echo "Warning: Machine registration failed"
# *** CRITICAL FIX: Fix ownership of ALL CrowdSec files after cscli operations ***
# cscli runs as root and creates root-owned files (crowdsec.db, config files)
# CrowdSec process runs as charon and needs write access
echo "Fixing CrowdSec file ownership..."
chown -R charon:charon /var/lib/crowdsec 2>/dev/null || true
chown -R charon:charon /app/data/crowdsec 2>/dev/null || true
chown -R charon:charon /var/log/crowdsec 2>/dev/null || true
echo "CrowdSec configuration initialized. Agent lifecycle is GUI-controlled."
fi
```
## Testing After Fix
1. **Rebuild container:**
```bash
docker build -t charon:local . && docker compose -f docker-compose.test.yml up -d
```
2. **Verify ownership is correct:**
```bash
docker compose -f docker-compose.test.yml exec charon ls -la /var/lib/crowdsec/data/
# Expected: all files owned by charon:charon
```
3. **Check CrowdSec logs for permission errors:**
```bash
docker compose -f docker-compose.test.yml logs charon 2>&1 | grep -i "permission\|denied\|FATAL"
# Expected: no permission errors
```
4. **Verify LAPI is listening after manual start:**
```bash
curl -X POST http://localhost:8080/api/v1/admin/crowdsec/start
docker compose -f docker-compose.test.yml exec charon ss -tuln | grep 8085
# Expected: LISTEN on :8085
```
## Success Criteria (Updated)
- [ ] All files in `/var/lib/crowdsec/` owned by `charon:charon`
- [ ] All files in `/app/data/crowdsec/` owned by `charon:charon`
- [ ] `config.yaml` has `log_dir: /var/log/crowdsec/`
- [ ] No "permission denied" errors in container logs
- [ ] CrowdSec LAPI binds to port 8085 successfully
- [ ] Manual start via GUI completes without timeout
- [ ] Reconciliation on startup works when mode=local
## References
- [CrowdSec Documentation](https://docs.crowdsec.net/)
- [CrowdSec LAPI Reference](https://docs.crowdsec.net/docs/local_api/intro)
- [Caddy CrowdSec Bouncer Plugin](https://github.com/hslatman/caddy-crowdsec-bouncer)
- [Issue #16: ACL Implementation](ISSUE_16_ACL_IMPLEMENTATION.md) (related security feature)
## Changelog
### 2025-12-23 - Investigation Update
- **Status:** FAILED - Previous implementation did not fix root cause
- **Finding:** Permission errors due to entrypoint running cscli as root
- **Finding:** log_dir config points to wrong path (/var/log/ vs /var/log/crowdsec/)
- **Action:** Updated plan with specific entrypoint script fixes
- **Priority:** Escalated to CRITICAL
### 2025-12-22 - Initial Plan
- Created initial plan based on code review
- Identified timing issue with goroutine call
- Proposed moving reconciliation to main.go (implemented)