chore: clean .gitignore cache
This commit is contained in:
@@ -1,462 +0,0 @@
|
||||
# CrowdSec Integration Final Validation Report
|
||||
|
||||
**Date:** December 15, 2025
|
||||
**Validator:** QA_Security Agent
|
||||
**Status:** ⚠️ **CRITICAL ISSUE FOUND**
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The CrowdSec integration implementation has a **critical bug** that prevents the CrowdSec LAPI (Local API) from starting after container restarts. While the bouncer registration and configuration are correct, a stale PID file causes the reconciliation logic to incorrectly believe CrowdSec is already running, preventing startup.
|
||||
|
||||
---
|
||||
|
||||
## Test Results
|
||||
|
||||
### 1. ✅ CrowdSec Integration Test (Partial Pass)
|
||||
|
||||
**Test Command:** `scripts/crowdsec_startup_test.sh`
|
||||
|
||||
**Results:**
|
||||
|
||||
- ✅ No fatal 'no datasource enabled' error
|
||||
- ❌ **LAPI health check failed** (port 8085 not responding)
|
||||
- ✅ Acquisition config exists with datasource definition
|
||||
- ✅ Parsers check passed (with warning)
|
||||
- ✅ Scenarios check passed (with warning)
|
||||
- ✅ CrowdSec process check passed (false positive)
|
||||
|
||||
**Score:** 5/6 checks passed, but **critical failure** in LAPI health
|
||||
|
||||
**Root Cause Analysis:**
|
||||
The CrowdSec process (PID 3469) **was** running during initial container startup and functioned correctly. However, after a container restart:
|
||||
|
||||
1. A stale PID file `/app/data/crowdsec/crowdsec.pid` contains PID `51`
|
||||
2. PID 51 does not exist in the process table
|
||||
3. The reconciliation logic checks if PID file exists and assumes CrowdSec is running
|
||||
4. **No validation** that the PID in the file corresponds to an actual running process
|
||||
5. CrowdSec LAPI never starts, bouncer cannot connect
|
||||
|
||||
**Evidence:**
|
||||
|
||||
```bash
|
||||
# PID file shows 51
|
||||
$ docker exec charon cat /app/data/crowdsec/crowdsec.pid
|
||||
51
|
||||
|
||||
# But no process with PID 51 exists
|
||||
$ docker exec charon ps aux | grep 51 | grep -v grep
|
||||
(no results)
|
||||
|
||||
# Reconciliation log incorrectly reports "already running"
|
||||
{"level":"info","msg":"CrowdSec reconciliation: already running","pid":51,"time":"2025-12-15T16:14:44-05:00"}
|
||||
```
|
||||
|
||||
**Bouncer Errors:**
|
||||
|
||||
```
|
||||
{"level":"error","logger":"crowdsec","msg":"auth-api: auth with api key failed return nil response,
|
||||
error: dial tcp 127.0.0.1:8085: connect: connection refused","instance_id":"2977e81e"}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. ❌ Traffic Blocking Validation (FAILED)
|
||||
|
||||
**Test Commands:**
|
||||
|
||||
```bash
|
||||
# Added test ban
|
||||
$ docker exec charon cscli decisions add --ip 203.0.113.99 --duration 10m --type ban --reason "Test ban for QA validation"
|
||||
level=info msg="Decision successfully added"
|
||||
|
||||
# Verified ban exists
|
||||
$ docker exec charon cscli decisions list
|
||||
+----+--------+-----------------+----------------------------+--------+---------+----+--------+------------+----------+
|
||||
| ID | Source | Scope:Value | Reason | Action | Country | AS | Events | expiration | Alert ID |
|
||||
+----+--------+-----------------+----------------------------+--------+---------+----+--------+------------+----------+
|
||||
| 1 | cscli | Ip:203.0.113.99 | Test ban for QA validation | ban | | | 1 | 9m59s | 1 |
|
||||
+----+--------+-----------------+----------------------------+--------+---------+----+--------+------------+----------+
|
||||
|
||||
# Tested blocked traffic
|
||||
$ curl -H "X-Forwarded-For: 203.0.113.99" http://localhost:8080/
|
||||
< HTTP/1.1 200 OK # ❌ SHOULD BE 403 Forbidden
|
||||
```
|
||||
|
||||
**Status:** ❌ **FAILED** - Traffic NOT blocked
|
||||
|
||||
**Root Cause:**
|
||||
|
||||
- CrowdSec LAPI is not running (see Test #1)
|
||||
- Caddy bouncer cannot retrieve decisions from LAPI
|
||||
- Without active decisions, all traffic passes through
|
||||
|
||||
**Bouncer Status (Before LAPI Failure):**
|
||||
|
||||
```
|
||||
----------------------------------------------------------------------------------------------
|
||||
Name IP Address Valid Last API pull Type Version Auth Type
|
||||
----------------------------------------------------------------------------------------------
|
||||
caddy-bouncer 127.0.0.1 ✔️ 2025-12-15T21:14:03Z caddy-cs-bouncer v0.9.2 api-key
|
||||
----------------------------------------------------------------------------------------------
|
||||
```
|
||||
|
||||
**Note:** When LAPI was operational (initially), the bouncer successfully authenticated and pulled decisions. The blocking failure is purely due to LAPI unavailability after restart.
|
||||
|
||||
---
|
||||
|
||||
### 3. ✅ Regression Tests
|
||||
|
||||
#### Backend Tests
|
||||
|
||||
**Command:** `cd backend && go test ./...`
|
||||
|
||||
**Result:** ✅ **PASS**
|
||||
|
||||
```
|
||||
All tests passed (cached)
|
||||
Coverage: 85.1% (meets 85% requirement)
|
||||
```
|
||||
|
||||
#### Frontend Tests
|
||||
|
||||
**Command:** `cd frontend && npm run test`
|
||||
|
||||
**Result:** ✅ **PASS**
|
||||
|
||||
```
|
||||
Test Files 91 passed (91)
|
||||
Tests 956 passed | 2 skipped (958)
|
||||
Duration 66.45s
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 4. ✅ Security Scans
|
||||
|
||||
**Command:** `cd backend && go run golang.org/x/vuln/cmd/govulncheck@latest ./...`
|
||||
|
||||
**Result:** ✅ **PASS**
|
||||
|
||||
```
|
||||
No vulnerabilities found.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 5. ✅ Pre-commit Checks
|
||||
|
||||
**Command:** `source .venv/bin/activate && pre-commit run --all-files`
|
||||
|
||||
**Result:** ✅ **PASS**
|
||||
|
||||
```
|
||||
Go Vet...................................................................Passed
|
||||
Check .version matches latest Git tag....................................Passed
|
||||
Prevent large files that are not tracked by LFS..........................Passed
|
||||
Prevent committing CodeQL DB artifacts...................................Passed
|
||||
Prevent committing data/backups files....................................Passed
|
||||
Frontend TypeScript Check................................................Passed
|
||||
Frontend Lint (Fix)......................................................Passed
|
||||
Coverage: 85.1% (minimum required 85%)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Critical Bug: PID Reuse Vulnerability
|
||||
|
||||
### Issue Location
|
||||
|
||||
**File:** `backend/internal/api/handlers/crowdsec_exec.go`
|
||||
**Function:** `DefaultCrowdsecExecutor.Status()` (lines 95-122)
|
||||
|
||||
### Root Cause: PID Reuse Without Process Name Validation
|
||||
|
||||
The Status() function checks if a process exists with the stored PID but **does NOT verify** that it's actually the CrowdSec process. This causes a critical bug when:
|
||||
|
||||
1. CrowdSec starts with PID X (e.g., 51) and writes PID file
|
||||
2. CrowdSec crashes or is killed
|
||||
3. System reuses PID X for a different process (e.g., Delve telemetry)
|
||||
4. Status() finds PID X is running and returns `running=true`
|
||||
5. Reconciliation logic thinks CrowdSec is running and skips startup
|
||||
6. CrowdSec never starts, LAPI remains unavailable
|
||||
|
||||
### Evidence
|
||||
|
||||
**PID File Content:**
|
||||
|
||||
```bash
|
||||
$ docker exec charon cat /app/data/crowdsec/crowdsec.pid
|
||||
51
|
||||
```
|
||||
|
||||
**Actual Process at PID 51:**
|
||||
|
||||
```bash
|
||||
$ docker exec charon cat /proc/51/cmdline | tr '\0' ' '
|
||||
/usr/local/bin/dlv ** telemetry **
|
||||
```
|
||||
|
||||
**NOT CrowdSec!** The PID was recycled.
|
||||
|
||||
**Reconciliation Log (Incorrect):**
|
||||
|
||||
```json
|
||||
{"level":"info","msg":"CrowdSec reconciliation: already running","pid":51,"time":"2025-12-15T16:14:44-05:00"}
|
||||
```
|
||||
|
||||
### Current Implementation (Buggy)
|
||||
|
||||
```go
|
||||
func (e *DefaultCrowdsecExecutor) Status(ctx context.Context, configDir string) (running bool, pid int, err error) {
|
||||
b, err := os.ReadFile(e.pidFile(configDir))
|
||||
if err != nil {
|
||||
return false, 0, nil
|
||||
}
|
||||
|
||||
pid, err = strconv.Atoi(string(b))
|
||||
if err != nil {
|
||||
return false, 0, nil
|
||||
}
|
||||
|
||||
proc, err := os.FindProcess(pid)
|
||||
if err != nil {
|
||||
return false, pid, nil
|
||||
}
|
||||
|
||||
// ❌ BUG: This only checks if *any* process exists with this PID
|
||||
// It does NOT verify that the process is CrowdSec!
|
||||
if err = proc.Signal(syscall.Signal(0)); err != nil {
|
||||
if errors.Is(err, os.ErrProcessDone) {
|
||||
return false, pid, nil
|
||||
}
|
||||
return false, pid, nil
|
||||
}
|
||||
|
||||
return true, pid, nil // ❌ Returns true even if PID is recycled!
|
||||
}
|
||||
```
|
||||
|
||||
### Required Fix
|
||||
|
||||
The fix requires **process name validation** to ensure the PID belongs to CrowdSec:
|
||||
|
||||
```go
|
||||
func (e *DefaultCrowdsecExecutor) Status(ctx context.Context, configDir string) (running bool, pid int, err error) {
|
||||
b, err := os.ReadFile(e.pidFile(configDir))
|
||||
if err != nil {
|
||||
return false, 0, nil
|
||||
}
|
||||
|
||||
pid, err = strconv.Atoi(string(b))
|
||||
if err != nil {
|
||||
return false, 0, nil
|
||||
}
|
||||
|
||||
proc, err := os.FindProcess(pid)
|
||||
if err != nil {
|
||||
return false, pid, nil
|
||||
}
|
||||
|
||||
// Check if process exists
|
||||
if err = proc.Signal(syscall.Signal(0)); err != nil {
|
||||
if errors.Is(err, os.ErrProcessDone) {
|
||||
return false, pid, nil
|
||||
}
|
||||
return false, pid, nil
|
||||
}
|
||||
|
||||
// ✅ NEW: Verify the process is actually CrowdSec
|
||||
if !isCrowdSecProcess(pid) {
|
||||
// PID was recycled - not CrowdSec
|
||||
return false, pid, nil
|
||||
}
|
||||
|
||||
return true, pid, nil
|
||||
}
|
||||
|
||||
// isCrowdSecProcess checks if the given PID is actually a CrowdSec process
|
||||
func isCrowdSecProcess(pid int) bool {
|
||||
cmdlinePath := filepath.Join("/proc", strconv.Itoa(pid), "cmdline")
|
||||
b, err := os.ReadFile(cmdlinePath)
|
||||
if err != nil {
|
||||
return false
|
||||
}
|
||||
|
||||
// cmdline uses null bytes as separators
|
||||
cmdline := string(b)
|
||||
|
||||
// Check if this is crowdsec binary (could be /usr/local/bin/crowdsec or similar)
|
||||
return strings.Contains(cmdline, "crowdsec")
|
||||
}
|
||||
```
|
||||
|
||||
### Implementation Details
|
||||
|
||||
The fix requires:
|
||||
|
||||
1. **Process name validation** by reading `/proc/{pid}/cmdline`
|
||||
2. **String matching** to verify "crowdsec" appears in command line
|
||||
3. **PID file cleanup** when recycled PID detected (optional, but recommended)
|
||||
4. **Logging** to track PID reuse events
|
||||
5. **Test coverage** for PID reuse scenario
|
||||
|
||||
**Alternative Approach (More Robust):**
|
||||
Store both PID and process start time in the PID file to detect reboots/recycling.
|
||||
|
||||
---
|
||||
|
||||
## Configuration Validation
|
||||
|
||||
### Environment Variables ✅
|
||||
|
||||
```bash
|
||||
CHARON_CROWDSEC_CONFIG_DIR=/app/data/crowdsec
|
||||
CHARON_SECURITY_CROWDSEC_API_KEY=charonbouncerkey2024
|
||||
CHARON_SECURITY_CROWDSEC_API_URL=http://localhost:8080
|
||||
CHARON_SECURITY_CROWDSEC_MODE=local
|
||||
FEATURE_CERBERUS_ENABLED=true
|
||||
```
|
||||
|
||||
**Status:** ✅ All correct
|
||||
|
||||
### Caddy CrowdSec App Configuration ✅
|
||||
|
||||
```json
|
||||
{
|
||||
"api_key": "charonbouncerkey2024",
|
||||
"api_url": "http://127.0.0.1:8085",
|
||||
"enable_streaming": true,
|
||||
"ticker_interval": "60s"
|
||||
}
|
||||
```
|
||||
|
||||
**Status:** ✅ Correct configuration
|
||||
|
||||
### CrowdSec Binary Installation ✅
|
||||
|
||||
```bash
|
||||
-rwxr-xr-x 1 root root 71772280 Dec 15 12:50 /usr/local/bin/crowdsec
|
||||
```
|
||||
|
||||
**Status:** ✅ Binary installed and executable
|
||||
|
||||
---
|
||||
|
||||
## Recommendations
|
||||
|
||||
### Immediate Actions (P0 - Critical)
|
||||
|
||||
1. **Fix Stale PID Detection** ⚠️ **REQUIRED BEFORE RELEASE**
|
||||
- Add process validation in reconciliation logic
|
||||
- Remove stale PID files automatically
|
||||
- **Location:** `backend/internal/crowdsec/service.go` (reconciliation function)
|
||||
- **Estimated Effort:** 30 minutes
|
||||
- **Testing:** Unit tests + integration test with restart scenario
|
||||
|
||||
2. **Add Restart Integration Test**
|
||||
- Create test that stops CrowdSec, restarts container, verifies startup
|
||||
- **Location:** `scripts/crowdsec_restart_test.sh`
|
||||
- **Acceptance Criteria:** CrowdSec starts successfully after restart
|
||||
|
||||
### Short-term Improvements (P1 - High)
|
||||
|
||||
1. **Enhanced Health Checks**
|
||||
- Add LAPI connectivity check to container healthcheck
|
||||
- Alert on prolonged bouncer connection failures
|
||||
- **Impact:** Faster detection of CrowdSec issues
|
||||
|
||||
2. **PID File Management**
|
||||
- Move PID file to `/var/run/crowdsec.pid` (standard location)
|
||||
- Use systemd-style PID management if available
|
||||
- Auto-cleanup on graceful shutdown
|
||||
|
||||
### Long-term Enhancements (P2 - Medium)
|
||||
|
||||
1. **Monitoring Dashboard**
|
||||
- Add CrowdSec status indicator to UI
|
||||
- Show LAPI health, bouncer connection status
|
||||
- Display decision count and recent blocks
|
||||
|
||||
2. **Auto-recovery**
|
||||
- Implement watchdog timer for CrowdSec process
|
||||
- Auto-restart on crash detection
|
||||
- Exponential backoff for restart attempts
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
| Category | Status | Score |
|
||||
|----------|--------|-------|
|
||||
| Integration Test | ⚠️ Partial | 5/6 (83%) |
|
||||
| Traffic Blocking | ❌ Failed | 0/1 (0%) |
|
||||
| Regression Tests | ✅ Pass | 2/2 (100%) |
|
||||
| Security Scans | ✅ Pass | 1/1 (100%) |
|
||||
| Pre-commit | ✅ Pass | 1/1 (100%) |
|
||||
| **Overall** | **❌ FAIL** | **9/11 (82%)** |
|
||||
|
||||
---
|
||||
|
||||
## Verdict
|
||||
|
||||
**⚠️ VALIDATION FAILED - CRITICAL BUG FOUND**
|
||||
|
||||
**Issue:** Stale PID file prevents CrowdSec LAPI from starting after container restart.
|
||||
|
||||
**Impact:**
|
||||
|
||||
- ❌ CrowdSec does NOT function after restart
|
||||
- ❌ Traffic blocking DOES NOT work
|
||||
- ✅ All other components (tests, security, code quality) pass
|
||||
|
||||
**Required Before Release:**
|
||||
|
||||
1. Fix stale PID detection in reconciliation logic
|
||||
2. Add restart integration test
|
||||
3. Verify traffic blocking works after container restart
|
||||
|
||||
**Timeline:**
|
||||
|
||||
- **Fix Implementation:** 30-60 minutes
|
||||
- **Testing & Validation:** 30 minutes
|
||||
- **Total:** ~1.5 hours
|
||||
|
||||
---
|
||||
|
||||
## Test Evidence
|
||||
|
||||
### Files Examined
|
||||
|
||||
- [docker-entrypoint.sh](../../docker-entrypoint.sh) - CrowdSec initialization
|
||||
- [docker-compose.override.yml](../../docker-compose.override.yml) - Environment variables
|
||||
- Backend tests: All passed (cached)
|
||||
- Frontend tests: 956 passed, 2 skipped
|
||||
|
||||
### Container State
|
||||
|
||||
- Container: `charon` (Up 43 minutes, healthy)
|
||||
- CrowdSec binary: Installed at `/usr/local/bin/crowdsec` (71MB)
|
||||
- LAPI port 8085: Not bound (process not running)
|
||||
- Bouncer: Registered but cannot connect
|
||||
|
||||
### Logs Analyzed
|
||||
|
||||
- Container logs: 50+ lines analyzed
|
||||
- CrowdSec logs: Connection refused errors every 10s
|
||||
- Reconciliation logs: False "already running" messages
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Developer:** Implement stale PID fix in `backend/internal/crowdsec/service.go`
|
||||
2. **QA:** Re-run validation after fix deployed
|
||||
3. **DevOps:** Update integration tests to include restart scenario
|
||||
4. **Documentation:** Add troubleshooting section for PID file issues
|
||||
|
||||
---
|
||||
|
||||
**Report Generated:** 2025-12-15 21:23 UTC
|
||||
**Validation Duration:** 45 minutes
|
||||
**Agent:** QA_Security
|
||||
**Version:** Charon v0.x.x (pre-release)
|
||||
Reference in New Issue
Block a user