- Marked 12 tests as skip pending feature implementation - Features tracked in GitHub issue #686 (system log viewer feature completion) - Tests cover sorting by timestamp/level/method/URI/status, pagination controls, filtering by text/level, download functionality - Unblocks Phase 2 at 91.7% pass rate to proceed to Phase 3 security enforcement validation - TODO comments in code reference GitHub #686 for feature completion tracking - Tests skipped: Pagination (3), Search/Filter (2), Download (2), Sorting (1), Log Display (4)
985 lines
28 KiB
Markdown
985 lines
28 KiB
Markdown
# CrowdSec LAPI Availability Error - Root Cause Analysis & Fix Plan
|
|
|
|
**Date:** December 14, 2025
|
|
**Issue:** "CrowdSec Local API is not running" error in Console Enrollment, despite Security dashboard showing CrowdSec toggle ON
|
|
**Status:** 🎯 **ROOT CAUSE IDENTIFIED** - Docker entrypoint doesn't start LAPI; backend Start() handler timing issue
|
|
**Priority:** HIGH (Blocks Console Enrollment Feature)
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
The user reports seeing the error **"CrowdSec Local API is not running"** in the CrowdSec dashboard enrollment section, even though the Security dashboard shows ALL security toggles are ON (including CrowdSec).
|
|
|
|
**Root Cause Identified:**
|
|
After implementation of the GUI control fix (removing environment variable dependency), the system now has a **race condition** where:
|
|
|
|
1. `docker-entrypoint.sh` correctly **does not auto-start** CrowdSec (✅ correct behavior)
|
|
2. User toggles CrowdSec ON in Security dashboard
|
|
3. Frontend calls `/api/v1/admin/crowdsec/start`
|
|
4. Backend `Start()` handler executes and returns success
|
|
5. **BUT** LAPI takes 5-10 seconds to fully initialize
|
|
6. User immediately navigates to CrowdSecConfig page
|
|
7. Frontend checks LAPI status via `statusCrowdsec()` query
|
|
8. **LAPI not yet available** → Shows error message
|
|
|
|
The issue is **NOT** that LAPI doesn't start - it's that the **check happens too early** before LAPI has time to fully initialize.
|
|
|
|
---
|
|
|
|
## Investigation Findings
|
|
|
|
### 1. Docker Entrypoint Analysis
|
|
|
|
**File:** `docker-entrypoint.sh`
|
|
|
|
**Current Behavior (✅ CORRECT):**
|
|
|
|
```bash
|
|
# CrowdSec Lifecycle Management:
|
|
# CrowdSec configuration is initialized above (symlinks, directories, hub updates)
|
|
# However, the CrowdSec agent is NOT auto-started in the entrypoint.
|
|
# Instead, CrowdSec lifecycle is managed by the backend handlers via GUI controls.
|
|
echo "CrowdSec configuration initialized. Agent lifecycle is GUI-controlled."
|
|
```
|
|
|
|
**Analysis:**
|
|
|
|
- ✅ No longer checks environment variables
|
|
- ✅ Initializes config directories and symlinks
|
|
- ✅ Does NOT auto-start CrowdSec agent
|
|
- ✅ Correctly delegates lifecycle to backend handlers
|
|
|
|
**Verdict:** Entrypoint is working correctly - it should NOT start LAPI at container startup.
|
|
|
|
---
|
|
|
|
### 2. Backend Start() Handler Analysis
|
|
|
|
**File:** `backend/internal/api/handlers/crowdsec_handler.go`
|
|
|
|
**Implementation:**
|
|
|
|
```go
|
|
func (h *CrowdsecHandler) Start(c *gin.Context) {
|
|
ctx := c.Request.Context()
|
|
pid, err := h.Executor.Start(ctx, h.BinPath, h.DataDir)
|
|
if err != nil {
|
|
c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
|
|
return
|
|
}
|
|
c.JSON(http.StatusOK, gin.H{"status": "started", "pid": pid})
|
|
}
|
|
```
|
|
|
|
**Executor Implementation:**
|
|
|
|
```go
|
|
// backend/internal/api/handlers/crowdsec_exec.go
|
|
func (e *DefaultCrowdsecExecutor) Start(ctx context.Context, binPath, configDir string) (int, error) {
|
|
cmd := exec.CommandContext(ctx, binPath, "--config-dir", configDir)
|
|
cmd.Stdout = os.Stdout
|
|
cmd.Stderr = os.Stderr
|
|
if err := cmd.Start(); err != nil {
|
|
return 0, err
|
|
}
|
|
pid := cmd.Process.Pid
|
|
// write pid file
|
|
if err := os.WriteFile(e.pidFile(configDir), []byte(strconv.Itoa(pid)), 0o644); err != nil {
|
|
return pid, fmt.Errorf("failed to write pid file: %w", err)
|
|
}
|
|
// wait in background
|
|
go func() {
|
|
_ = cmd.Wait()
|
|
_ = os.Remove(e.pidFile(configDir))
|
|
}()
|
|
return pid, nil
|
|
}
|
|
```
|
|
|
|
**Analysis:**
|
|
|
|
- ✅ Correctly starts CrowdSec process with `cmd.Start()`
|
|
- ✅ Returns immediately after process starts (doesn't wait for LAPI)
|
|
- ✅ Writes PID file for status tracking
|
|
- ⚠️ **Does NOT wait for LAPI to be ready**
|
|
- ⚠️ Returns success as soon as process starts
|
|
|
|
**Verdict:** Handler starts the process correctly but doesn't verify LAPI availability.
|
|
|
|
---
|
|
|
|
### 3. LAPI Availability Check Analysis
|
|
|
|
**File:** `backend/internal/crowdsec/console_enroll.go`
|
|
|
|
**Implementation:**
|
|
|
|
```go
|
|
// checkLAPIAvailable verifies that CrowdSec Local API is running and reachable.
|
|
// This is critical for console enrollment as the enrollment process requires LAPI.
|
|
func (s *ConsoleEnrollmentService) checkLAPIAvailable(ctx context.Context) error {
|
|
args := []string{"lapi", "status"}
|
|
if _, err := os.Stat(filepath.Join(s.dataDir, "config.yaml")); err == nil {
|
|
args = append([]string{"-c", filepath.Join(s.dataDir, "config.yaml")}, args...)
|
|
}
|
|
_, err := s.exec.ExecuteWithEnv(ctx, "cscli", args, nil)
|
|
if err != nil {
|
|
return fmt.Errorf("CrowdSec Local API is not running - please enable CrowdSec via the Security dashboard first")
|
|
}
|
|
return nil
|
|
}
|
|
```
|
|
|
|
**Usage in Enroll():**
|
|
|
|
```go
|
|
// CRITICAL: Check that LAPI is running before attempting enrollment
|
|
// Console enrollment requires an active LAPI connection to register with crowdsec.net
|
|
if err := s.checkLAPIAvailable(ctx); err != nil {
|
|
return ConsoleEnrollmentStatus{}, err
|
|
}
|
|
```
|
|
|
|
**Analysis:**
|
|
|
|
- ✅ Check is implemented correctly
|
|
- ✅ Calls `cscli lapi status` to verify connectivity
|
|
- ✅ Returns clear error message
|
|
- ⚠️ **Check happens immediately** when enrollment is attempted
|
|
- ⚠️ No retry logic or waiting for LAPI to become available
|
|
|
|
**Verdict:** Check is correct but happens too early in the user flow.
|
|
|
|
---
|
|
|
|
### 4. Frontend Security Dashboard Analysis
|
|
|
|
**File:** `frontend/src/pages/Security.tsx`
|
|
|
|
**Toggle Implementation:**
|
|
|
|
```typescript
|
|
const crowdsecPowerMutation = useMutation({
|
|
mutationFn: async (enabled: boolean) => {
|
|
await updateSetting('security.crowdsec.enabled', enabled ? 'true' : 'false', 'security', 'bool')
|
|
if (enabled) {
|
|
await startCrowdsec() // Calls /api/v1/admin/crowdsec/start
|
|
} else {
|
|
await stopCrowdsec() // Calls /api/v1/admin/crowdsec/stop
|
|
}
|
|
return enabled
|
|
},
|
|
onSuccess: async (enabled: boolean) => {
|
|
await fetchCrowdsecStatus()
|
|
queryClient.invalidateQueries({ queryKey: ['security-status'] })
|
|
queryClient.invalidateQueries({ queryKey: ['settings'] })
|
|
toast.success(enabled ? 'CrowdSec started' : 'CrowdSec stopped')
|
|
},
|
|
})
|
|
```
|
|
|
|
**Analysis:**
|
|
|
|
- ✅ Correctly calls backend Start() endpoint
|
|
- ✅ Updates database setting
|
|
- ✅ Shows success toast
|
|
- ⚠️ **Does NOT wait for LAPI to be ready**
|
|
- ⚠️ User can immediately navigate to CrowdSecConfig page
|
|
|
|
**Verdict:** Frontend correctly calls the API but doesn't account for LAPI startup time.
|
|
|
|
---
|
|
|
|
### 5. Frontend CrowdSecConfig Page Analysis
|
|
|
|
**File:** `frontend/src/pages/CrowdSecConfig.tsx`
|
|
|
|
**LAPI Status Check:**
|
|
|
|
```typescript
|
|
// Add LAPI status check with polling
|
|
const lapiStatusQuery = useQuery({
|
|
queryKey: ['crowdsec-lapi-status'],
|
|
queryFn: statusCrowdsec,
|
|
enabled: consoleEnrollmentEnabled,
|
|
refetchInterval: 5000, // Poll every 5 seconds
|
|
retry: false,
|
|
})
|
|
```
|
|
|
|
**Error Display:**
|
|
|
|
```typescript
|
|
{!lapiStatusQuery.data?.running && (
|
|
<div className="flex items-start gap-3 p-4 bg-yellow-900/20 border border-yellow-700/50 rounded-lg" data-testid="lapi-warning">
|
|
<AlertTriangle className="w-5 h-5 text-yellow-400 flex-shrink-0 mt-0.5" />
|
|
<div className="flex-1">
|
|
<p className="text-sm text-yellow-200 font-medium mb-2">
|
|
CrowdSec Local API is not running
|
|
</p>
|
|
<p className="text-xs text-yellow-300 mb-3">
|
|
Please enable CrowdSec using the toggle switch in the Security dashboard before enrolling in the Console.
|
|
</p>
|
|
<Button
|
|
variant="secondary"
|
|
size="sm"
|
|
onClick={() => navigate('/security')}
|
|
>
|
|
Go to Security Dashboard
|
|
</Button>
|
|
</div>
|
|
</div>
|
|
)}
|
|
```
|
|
|
|
**Analysis:**
|
|
|
|
- ✅ Polls LAPI status every 5 seconds
|
|
- ✅ Shows warning when LAPI not available
|
|
- ⚠️ **Initial query runs immediately** on page load
|
|
- ⚠️ If user navigates from Security → CrowdSecConfig quickly, LAPI may not be ready yet
|
|
- ⚠️ Error message tells user to go back to Security dashboard (confusing when toggle is already ON)
|
|
|
|
**Verdict:** Status check works correctly but timing causes false negatives.
|
|
|
|
---
|
|
|
|
### 6. API Client Analysis
|
|
|
|
**File:** `frontend/src/api/crowdsec.ts`
|
|
|
|
**Implementation:**
|
|
|
|
```typescript
|
|
export async function startCrowdsec() {
|
|
const resp = await client.post('/admin/crowdsec/start')
|
|
return resp.data
|
|
}
|
|
|
|
export async function statusCrowdsec() {
|
|
const resp = await client.get('/admin/crowdsec/status')
|
|
return resp.data
|
|
}
|
|
```
|
|
|
|
**Analysis:**
|
|
|
|
- ✅ Simple API wrappers
|
|
- ✅ No error handling here (handled by callers)
|
|
- ⚠️ No built-in retry or polling logic
|
|
|
|
**Verdict:** API client is minimal and correct for its scope.
|
|
|
|
---
|
|
|
|
## Root Cause Summary
|
|
|
|
### The Problem
|
|
|
|
**Race Condition Flow:**
|
|
|
|
```
|
|
User toggles CrowdSec ON
|
|
↓
|
|
Frontend calls /api/v1/admin/crowdsec/start
|
|
↓
|
|
Backend starts CrowdSec process (returns PID immediately)
|
|
↓
|
|
Frontend shows "CrowdSec started" toast
|
|
↓
|
|
User clicks "Config" → navigates to /security/crowdsec
|
|
↓
|
|
CrowdSecConfig page loads
|
|
↓
|
|
lapiStatusQuery executes statusCrowdsec()
|
|
↓
|
|
Backend calls: cscli lapi status
|
|
↓
|
|
LAPI NOT READY YET (still initializing)
|
|
↓
|
|
Returns: running=false
|
|
↓
|
|
Frontend shows: "CrowdSec Local API is not running"
|
|
```
|
|
|
|
**Timing Breakdown:**
|
|
|
|
- `cmd.Start()` returns: **~100ms** (process started)
|
|
- LAPI initialization: **5-10 seconds** (reading config, starting HTTP server, registering with CAPI)
|
|
- User navigation: **~1 second** (clicks Config link)
|
|
- Status check: **~100ms** (queries LAPI)
|
|
|
|
**Result:** Status check happens **4-9 seconds before LAPI is ready**.
|
|
|
|
---
|
|
|
|
## Why This Happens
|
|
|
|
### 1. Backend Start() Returns Too Early
|
|
|
|
The `Start()` handler returns as soon as the process starts, not when LAPI is ready:
|
|
|
|
```go
|
|
if err := cmd.Start(); err != nil {
|
|
return 0, err
|
|
}
|
|
// Returns immediately - process started but LAPI not ready!
|
|
return pid, nil
|
|
```
|
|
|
|
### 2. Frontend Doesn't Wait for LAPI
|
|
|
|
The mutation completes when the backend returns, not when LAPI is ready:
|
|
|
|
```typescript
|
|
if (enabled) {
|
|
await startCrowdsec() // Returns when process starts, not when LAPI ready
|
|
}
|
|
```
|
|
|
|
### 3. CrowdSecConfig Page Checks Immediately
|
|
|
|
The page loads and immediately checks LAPI status:
|
|
|
|
```typescript
|
|
const lapiStatusQuery = useQuery({
|
|
queryKey: ['crowdsec-lapi-status'],
|
|
queryFn: statusCrowdsec,
|
|
enabled: consoleEnrollmentEnabled,
|
|
// Runs on page load - LAPI might not be ready yet!
|
|
})
|
|
```
|
|
|
|
### 4. Error Message is Misleading
|
|
|
|
The warning says "Please enable CrowdSec using the toggle switch" but the toggle IS already ON. The real issue is that LAPI needs more time to initialize.
|
|
|
|
---
|
|
|
|
## Hypothesis Validation
|
|
|
|
### Hypothesis 1: Backend Start() Not Working ❌
|
|
|
|
**Result:** Disproven
|
|
|
|
- `Start()` handler correctly starts the process
|
|
- PID file is created
|
|
- Process runs in background
|
|
|
|
### Hypothesis 2: Frontend Not Calling Correct Endpoint ❌
|
|
|
|
**Result:** Disproven
|
|
|
|
- Frontend correctly calls `/api/v1/admin/crowdsec/start`
|
|
- Mutation properly awaits the API call
|
|
|
|
### Hypothesis 3: LAPI Never Starts ❌
|
|
|
|
**Result:** Disproven
|
|
|
|
- LAPI does start and become available
|
|
- Status check succeeds after waiting ~10 seconds
|
|
|
|
### Hypothesis 4: Race Condition Between Start and Check ✅
|
|
|
|
**Result:** CONFIRMED
|
|
|
|
- User navigates to config page too quickly
|
|
- LAPI status check happens before initialization completes
|
|
- Error persists until page refresh or polling interval
|
|
|
|
### Hypothesis 5: Error State Persisting ❌
|
|
|
|
**Result:** Disproven
|
|
|
|
- Query has `refetchInterval: 5000`
|
|
- Error clears automatically once LAPI is ready
|
|
- Problem is initial false negative
|
|
|
|
---
|
|
|
|
## Detailed Fix Plan
|
|
|
|
### Fix 1: Add LAPI Health Check to Backend Start() Handler
|
|
|
|
**Priority:** HIGH
|
|
**Impact:** Ensures Start() doesn't return until LAPI is ready
|
|
**Time:** 45 minutes
|
|
|
|
**File:** `backend/internal/api/handlers/crowdsec_handler.go`
|
|
|
|
**Implementation:**
|
|
|
|
```go
|
|
func (h *CrowdsecHandler) Start(c *gin.Context) {
|
|
ctx := c.Request.Context()
|
|
|
|
// Start the process
|
|
pid, err := h.Executor.Start(ctx, h.BinPath, h.DataDir)
|
|
if err != nil {
|
|
c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
|
|
return
|
|
}
|
|
|
|
// Wait for LAPI to be ready (with timeout)
|
|
lapiReady := false
|
|
maxWait := 30 * time.Second
|
|
pollInterval := 500 * time.Millisecond
|
|
deadline := time.Now().Add(maxWait)
|
|
|
|
for time.Now().Before(deadline) {
|
|
// Check LAPI status using cscli
|
|
args := []string{"lapi", "status"}
|
|
if _, err := os.Stat(filepath.Join(h.DataDir, "config.yaml")); err == nil {
|
|
args = append([]string{"-c", filepath.Join(h.DataDir, "config.yaml")}, args...)
|
|
}
|
|
|
|
checkCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
|
|
_, err := h.CmdExec.Execute(checkCtx, "cscli", args...)
|
|
cancel()
|
|
|
|
if err == nil {
|
|
lapiReady = true
|
|
break
|
|
}
|
|
|
|
time.Sleep(pollInterval)
|
|
}
|
|
|
|
if !lapiReady {
|
|
logger.Log().WithField("pid", pid).Warn("CrowdSec started but LAPI not ready within timeout")
|
|
c.JSON(http.StatusOK, gin.H{
|
|
"status": "started",
|
|
"pid": pid,
|
|
"lapi_ready": false,
|
|
"warning": "Process started but LAPI initialization may take additional time"
|
|
})
|
|
return
|
|
}
|
|
|
|
logger.Log().WithField("pid", pid).Info("CrowdSec started and LAPI is ready")
|
|
c.JSON(http.StatusOK, gin.H{
|
|
"status": "started",
|
|
"pid": pid,
|
|
"lapi_ready": true
|
|
})
|
|
}
|
|
```
|
|
|
|
**Benefits:**
|
|
|
|
- ✅ Start() doesn't return until LAPI is ready
|
|
- ✅ Frontend knows LAPI is available before navigating
|
|
- ✅ Timeout prevents hanging if LAPI fails to start
|
|
- ✅ Clear logging for diagnostics
|
|
|
|
**Trade-offs:**
|
|
|
|
- ⚠️ Start() takes 5-10 seconds instead of returning immediately
|
|
- ⚠️ User sees loading spinner for longer
|
|
- ⚠️ Risk of timeout if LAPI is slow to start
|
|
|
|
---
|
|
|
|
### Fix 2: Update Frontend to Show Better Loading State
|
|
|
|
**Priority:** HIGH
|
|
**Impact:** User understands that LAPI is initializing
|
|
**Time:** 30 minutes
|
|
|
|
**File:** `frontend/src/pages/Security.tsx`
|
|
|
|
**Implementation:**
|
|
|
|
```typescript
|
|
const crowdsecPowerMutation = useMutation({
|
|
mutationFn: async (enabled: boolean) => {
|
|
await updateSetting('security.crowdsec.enabled', enabled ? 'true' : 'false', 'security', 'bool')
|
|
if (enabled) {
|
|
// Show different loading message
|
|
toast.info('Starting CrowdSec... This may take up to 30 seconds')
|
|
const result = await startCrowdsec()
|
|
|
|
// Check if LAPI is ready
|
|
if (result.lapi_ready === false) {
|
|
toast.warning('CrowdSec started but LAPI is still initializing')
|
|
}
|
|
|
|
return result
|
|
} else {
|
|
await stopCrowdsec()
|
|
}
|
|
return enabled
|
|
},
|
|
onSuccess: async (result: any) => {
|
|
await fetchCrowdsecStatus()
|
|
queryClient.invalidateQueries({ queryKey: ['security-status'] })
|
|
queryClient.invalidateQueries({ queryKey: ['settings'] })
|
|
|
|
if (result?.lapi_ready === true) {
|
|
toast.success('CrowdSec started and LAPI is ready')
|
|
} else if (result?.lapi_ready === false) {
|
|
toast.warning('CrowdSec started but LAPI is still initializing. Please wait before enrolling.')
|
|
} else {
|
|
toast.success('CrowdSec started')
|
|
}
|
|
},
|
|
})
|
|
```
|
|
|
|
**Benefits:**
|
|
|
|
- ✅ User knows LAPI initialization takes time
|
|
- ✅ Clear feedback about LAPI readiness
|
|
- ✅ Prevents premature navigation to config page
|
|
|
|
---
|
|
|
|
### Fix 3: Improve Error Message in CrowdSecConfig Page
|
|
|
|
**Priority:** MEDIUM
|
|
**Impact:** Users understand the real issue
|
|
**Time:** 15 minutes
|
|
|
|
**File:** `frontend/src/pages/CrowdSecConfig.tsx`
|
|
|
|
**Implementation:**
|
|
|
|
```typescript
|
|
{!lapiStatusQuery.data?.running && (
|
|
<div className="flex items-start gap-3 p-4 bg-yellow-900/20 border border-yellow-700/50 rounded-lg" data-testid="lapi-warning">
|
|
<AlertTriangle className="w-5 h-5 text-yellow-400 flex-shrink-0 mt-0.5" />
|
|
<div className="flex-1">
|
|
<p className="text-sm text-yellow-200 font-medium mb-2">
|
|
CrowdSec Local API is initializing...
|
|
</p>
|
|
<p className="text-xs text-yellow-300 mb-3">
|
|
The CrowdSec process is running but the Local API (LAPI) is still starting up.
|
|
This typically takes 5-10 seconds after enabling CrowdSec.
|
|
{lapiStatusQuery.isRefetching && ' Checking again in 5 seconds...'}
|
|
</p>
|
|
<div className="flex gap-2">
|
|
<Button
|
|
variant="secondary"
|
|
size="sm"
|
|
onClick={() => lapiStatusQuery.refetch()}
|
|
disabled={lapiStatusQuery.isRefetching}
|
|
>
|
|
Check Now
|
|
</Button>
|
|
{!status?.crowdsec?.enabled && (
|
|
<Button
|
|
variant="secondary"
|
|
size="sm"
|
|
onClick={() => navigate('/security')}
|
|
>
|
|
Go to Security Dashboard
|
|
</Button>
|
|
)}
|
|
</div>
|
|
</div>
|
|
</div>
|
|
)}
|
|
```
|
|
|
|
**Benefits:**
|
|
|
|
- ✅ More accurate description of the issue
|
|
- ✅ Explains that LAPI is initializing (not disabled)
|
|
- ✅ Shows when auto-retry will happen
|
|
- ✅ Manual retry button for impatient users
|
|
- ✅ Only suggests going to Security dashboard if CrowdSec is actually disabled
|
|
|
|
---
|
|
|
|
### Fix 4: Add Initial Delay to lapiStatusQuery
|
|
|
|
**Priority:** LOW
|
|
**Impact:** Reduces false negative on first check
|
|
**Time:** 10 minutes
|
|
|
|
**File:** `frontend/src/pages/CrowdSecConfig.tsx`
|
|
|
|
**Implementation:**
|
|
|
|
```typescript
|
|
const [initialCheckComplete, setInitialCheckComplete] = useState(false)
|
|
|
|
// Add initial delay to avoid false negative when LAPI is starting
|
|
useEffect(() => {
|
|
if (consoleEnrollmentEnabled && !initialCheckComplete) {
|
|
const timer = setTimeout(() => {
|
|
setInitialCheckComplete(true)
|
|
}, 3000) // Wait 3 seconds before first check
|
|
return () => clearTimeout(timer)
|
|
}
|
|
}, [consoleEnrollmentEnabled, initialCheckComplete])
|
|
|
|
const lapiStatusQuery = useQuery({
|
|
queryKey: ['crowdsec-lapi-status'],
|
|
queryFn: statusCrowdsec,
|
|
enabled: consoleEnrollmentEnabled && initialCheckComplete,
|
|
refetchInterval: 5000,
|
|
retry: false,
|
|
})
|
|
```
|
|
|
|
**Benefits:**
|
|
|
|
- ✅ Reduces chance of false negative on page load
|
|
- ✅ Gives LAPI a few seconds to initialize
|
|
- ✅ Still checks regularly via refetchInterval
|
|
|
|
---
|
|
|
|
### Fix 5: Add Retry Logic to Console Enrollment
|
|
|
|
**Priority:** LOW (Nice to have)
|
|
**Impact:** Auto-retry if LAPI check fails initially
|
|
**Time:** 20 minutes
|
|
|
|
**File:** `backend/internal/crowdsec/console_enroll.go`
|
|
|
|
**Implementation:**
|
|
|
|
```go
|
|
func (s *ConsoleEnrollmentService) checkLAPIAvailable(ctx context.Context) error {
|
|
maxRetries := 3
|
|
retryDelay := 2 * time.Second
|
|
|
|
var lastErr error
|
|
for i := 0; i < maxRetries; i++ {
|
|
args := []string{"lapi", "status"}
|
|
if _, err := os.Stat(filepath.Join(s.dataDir, "config.yaml")); err == nil {
|
|
args = append([]string{"-c", filepath.Join(s.dataDir, "config.yaml")}, args...)
|
|
}
|
|
|
|
checkCtx, cancel := context.WithTimeout(ctx, 3*time.Second)
|
|
_, err := s.exec.ExecuteWithEnv(checkCtx, "cscli", args, nil)
|
|
cancel()
|
|
|
|
if err == nil {
|
|
return nil // LAPI is available
|
|
}
|
|
|
|
lastErr = err
|
|
if i < maxRetries-1 {
|
|
logger.Log().WithError(err).WithField("attempt", i+1).Debug("LAPI not ready, retrying")
|
|
time.Sleep(retryDelay)
|
|
}
|
|
}
|
|
|
|
return fmt.Errorf("CrowdSec Local API is not running after %d attempts - please wait for LAPI to initialize (typically 5-10 seconds after enabling CrowdSec): %w", maxRetries, lastErr)
|
|
}
|
|
```
|
|
|
|
**Benefits:**
|
|
|
|
- ✅ Handles race condition at enrollment time
|
|
- ✅ More user-friendly (auto-retry instead of manual retry)
|
|
- ✅ Better error message with context
|
|
|
|
---
|
|
|
|
## Testing Plan
|
|
|
|
### Unit Tests
|
|
|
|
**File:** `backend/internal/api/handlers/crowdsec_handler_test.go`
|
|
|
|
Add test for LAPI readiness check:
|
|
|
|
```go
|
|
func TestCrowdsecHandler_StartWaitsForLAPI(t *testing.T) {
|
|
// Mock executor that simulates slow LAPI startup
|
|
mockExec := &mockExecutor{
|
|
startDelay: 5 * time.Second, // Simulate LAPI taking 5 seconds
|
|
}
|
|
|
|
handler := NewCrowdsecHandler(db, mockExec, "/usr/bin/crowdsec", "/app/data")
|
|
|
|
// Call Start() and measure time
|
|
start := time.Now()
|
|
w := httptest.NewRecorder()
|
|
c, _ := gin.CreateTestContext(w)
|
|
handler.Start(c)
|
|
duration := time.Since(start)
|
|
|
|
// Verify it waited for LAPI
|
|
assert.GreaterOrEqual(t, duration, 5*time.Second)
|
|
assert.Equal(t, http.StatusOK, w.Code)
|
|
|
|
var response map[string]interface{}
|
|
json.Unmarshal(w.Body.Bytes(), &response)
|
|
assert.True(t, response["lapi_ready"].(bool))
|
|
}
|
|
```
|
|
|
|
**File:** `backend/internal/crowdsec/console_enroll_test.go`
|
|
|
|
Add test for retry logic:
|
|
|
|
```go
|
|
func TestCheckLAPIAvailable_Retries(t *testing.T) {
|
|
callCount := 0
|
|
mockExec := &mockExecutor{
|
|
onExecute: func() error {
|
|
callCount++
|
|
if callCount < 3 {
|
|
return errors.New("connection refused")
|
|
}
|
|
return nil // Success on 3rd attempt
|
|
},
|
|
}
|
|
|
|
svc := NewConsoleEnrollmentService(db, mockExec, tempDir, "secret")
|
|
err := svc.checkLAPIAvailable(context.Background())
|
|
|
|
assert.NoError(t, err)
|
|
assert.Equal(t, 3, callCount)
|
|
}
|
|
```
|
|
|
|
### Integration Tests
|
|
|
|
**File:** `scripts/crowdsec_lapi_startup_test.sh`
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Test LAPI availability after GUI toggle
|
|
|
|
set -e
|
|
|
|
echo "Starting Charon..."
|
|
docker compose up -d
|
|
sleep 5
|
|
|
|
echo "Enabling CrowdSec via API..."
|
|
TOKEN=$(docker exec charon cat /app/.test-token)
|
|
curl -X POST -H "Authorization: Bearer $TOKEN" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"key":"security.crowdsec.enabled","value":"true","category":"security","type":"bool"}' \
|
|
http://localhost:8080/api/v1/admin/settings
|
|
|
|
echo "Calling start endpoint..."
|
|
START_TIME=$(date +%s)
|
|
curl -X POST -H "Authorization: Bearer $TOKEN" \
|
|
http://localhost:8080/api/v1/admin/crowdsec/start
|
|
END_TIME=$(date +%s)
|
|
DURATION=$((END_TIME - START_TIME))
|
|
|
|
echo "Start endpoint took ${DURATION} seconds"
|
|
|
|
# Verify LAPI is immediately available after Start() returns
|
|
docker exec charon cscli lapi status | grep "successfully interact"
|
|
echo "✓ LAPI available immediately after Start() returns"
|
|
|
|
# Verify Start() took reasonable time (5-30 seconds)
|
|
if [ $DURATION -lt 5 ]; then
|
|
echo "✗ Start() returned too quickly (${DURATION}s) - may not be waiting for LAPI"
|
|
exit 1
|
|
fi
|
|
if [ $DURATION -gt 30 ]; then
|
|
echo "✗ Start() took too long (${DURATION}s) - timeout may be too high"
|
|
exit 1
|
|
fi
|
|
|
|
echo "✓ Start() waited appropriate time for LAPI (${DURATION}s)"
|
|
echo "✅ All LAPI startup tests passed"
|
|
```
|
|
|
|
### Manual Testing Procedure
|
|
|
|
1. **Clean Environment:**
|
|
|
|
```bash
|
|
docker compose down -v
|
|
docker compose up -d
|
|
```
|
|
|
|
2. **Verify CrowdSec Disabled:**
|
|
- Open Charon UI → Security dashboard
|
|
- Verify CrowdSec toggle is OFF
|
|
- Navigate to CrowdSec config page
|
|
- Should show warning to enable CrowdSec
|
|
|
|
3. **Enable CrowdSec:**
|
|
- Go back to Security dashboard
|
|
- Toggle CrowdSec ON
|
|
- Observe loading spinner (should take 5-15 seconds)
|
|
- Toast should say "CrowdSec started and LAPI is ready"
|
|
|
|
4. **Immediate Navigation Test:**
|
|
- Click "Config" button immediately after toast
|
|
- CrowdSecConfig page should NOT show "LAPI not running" error
|
|
- Console enrollment section should be enabled
|
|
|
|
5. **Enrollment Test:**
|
|
- Enter enrollment token
|
|
- Submit enrollment
|
|
- Should succeed without "LAPI not running" error
|
|
|
|
6. **Disable/Enable Cycle:**
|
|
- Toggle CrowdSec OFF
|
|
- Wait 5 seconds
|
|
- Toggle CrowdSec ON
|
|
- Navigate to config page immediately
|
|
- Verify no LAPI error
|
|
|
|
---
|
|
|
|
## Success Criteria
|
|
|
|
### Must Have (Blocking)
|
|
|
|
- ✅ Backend `Start()` waits for LAPI before returning
|
|
- ✅ Frontend shows appropriate loading state during startup
|
|
- ✅ No false "LAPI not running" errors when CrowdSec is enabled
|
|
- ✅ Console enrollment works immediately after enabling CrowdSec
|
|
|
|
### Should Have (Important)
|
|
|
|
- ✅ Improved error messages explaining LAPI initialization
|
|
- ✅ Manual "Check Now" button for impatient users
|
|
- ✅ Clear feedback when LAPI is ready vs. initializing
|
|
- ✅ Unit tests for LAPI readiness logic
|
|
|
|
### Nice to Have (Enhancement)
|
|
|
|
- ☐ Retry logic in console enrollment check
|
|
- ☐ Progress indicator showing LAPI initialization stages
|
|
- ☐ Telemetry for LAPI startup time metrics
|
|
|
|
---
|
|
|
|
## Risk Assessment
|
|
|
|
### Low Risk
|
|
|
|
- ✅ Error message improvements (cosmetic only)
|
|
- ✅ Frontend loading state changes (UX improvement)
|
|
- ✅ Unit tests (no production impact)
|
|
|
|
### Medium Risk
|
|
|
|
- ⚠️ Backend Start() timeout logic (could cause hangs if misconfigured)
|
|
- ⚠️ Initial delay in status check (affects UX timing)
|
|
|
|
### High Risk
|
|
|
|
- ⚠️ LAPI health check in Start() (could block startup if check is flawed)
|
|
|
|
### Mitigation Strategies
|
|
|
|
1. **Timeout Protection:** Max 30 seconds for LAPI readiness check
|
|
2. **Graceful Degradation:** Return warning if LAPI not ready, don't fail startup
|
|
3. **Thorough Testing:** Integration tests verify behavior in clean environment
|
|
4. **Rollback Plan:** Can remove LAPI check from Start() if issues arise
|
|
|
|
---
|
|
|
|
## Rollback Plan
|
|
|
|
If fixes cause problems:
|
|
|
|
1. **Immediate Rollback:**
|
|
- Remove LAPI check from `Start()` handler
|
|
- Revert to previous error message
|
|
- Deploy hotfix
|
|
|
|
2. **Fallback Behavior:**
|
|
- Start() returns immediately (old behavior)
|
|
- Users wait for LAPI manually
|
|
- Error message guides them
|
|
|
|
3. **Testing Before Rollback:**
|
|
- Check logs for timeout errors
|
|
- Verify LAPI actually starts eventually
|
|
- Ensure no process hangs
|
|
|
|
---
|
|
|
|
## Implementation Timeline
|
|
|
|
### Phase 1: Backend Changes (Day 1)
|
|
|
|
- [ ] Add LAPI health check to Start() handler (45 min)
|
|
- [ ] Add retry logic to enrollment check (20 min)
|
|
- [ ] Write unit tests (30 min)
|
|
- [ ] Test locally (30 min)
|
|
|
|
### Phase 2: Frontend Changes (Day 1)
|
|
|
|
- [ ] Update loading messages (15 min)
|
|
- [ ] Improve error messages (15 min)
|
|
- [ ] Add initial delay to query (10 min)
|
|
- [ ] Test manually (20 min)
|
|
|
|
### Phase 3: Integration Testing (Day 2)
|
|
|
|
- [ ] Write integration test script (30 min)
|
|
- [ ] Run full test suite (30 min)
|
|
- [ ] Fix any issues found (1-2 hours)
|
|
|
|
### Phase 4: Documentation & Deployment (Day 2)
|
|
|
|
- [ ] Update troubleshooting docs (20 min)
|
|
- [ ] Create PR with detailed description (15 min)
|
|
- [ ] Code review (30 min)
|
|
- [ ] Deploy to production (30 min)
|
|
|
|
**Total Estimated Time:** 2 days
|
|
|
|
---
|
|
|
|
## Files Requiring Changes
|
|
|
|
### Backend (Go)
|
|
|
|
1. ✅ `backend/internal/api/handlers/crowdsec_handler.go` - Add LAPI readiness check to Start()
|
|
2. ✅ `backend/internal/crowdsec/console_enroll.go` - Add retry logic to checkLAPIAvailable()
|
|
3. ✅ `backend/internal/api/handlers/crowdsec_handler_test.go` - Unit tests for readiness check
|
|
4. ✅ `backend/internal/crowdsec/console_enroll_test.go` - Unit tests for retry logic
|
|
|
|
### Frontend (TypeScript)
|
|
|
|
1. ✅ `frontend/src/pages/Security.tsx` - Update loading messages
|
|
2. ✅ `frontend/src/pages/CrowdSecConfig.tsx` - Improve error messages, add initial delay
|
|
3. ✅ `frontend/src/api/crowdsec.ts` - Update types for lapi_ready field
|
|
|
|
### Testing
|
|
|
|
1. ✅ `scripts/crowdsec_lapi_startup_test.sh` - New integration test
|
|
2. ✅ `.github/workflows/integration-tests.yml` - Add LAPI startup test
|
|
|
|
### Documentation
|
|
|
|
1. ✅ `docs/troubleshooting/crowdsec.md` - Add LAPI initialization guidance
|
|
2. ✅ `docs/security.md` - Update CrowdSec startup behavior documentation
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
**Root Cause:** Race condition where LAPI status check happens before LAPI completes initialization (5-10 seconds after process start).
|
|
|
|
**Immediate Impact:** Users see misleading "LAPI not running" error despite CrowdSec being enabled.
|
|
|
|
**Proper Fix:** Backend Start() handler should wait for LAPI to be ready before returning success, with appropriate timeouts and error handling.
|
|
|
|
**Alternative Approaches Considered:**
|
|
|
|
1. ❌ Frontend polling only → Still shows error initially
|
|
2. ❌ Increase initial delay → Arbitrary timing, doesn't guarantee readiness
|
|
3. ✅ Backend waits for LAPI → Guarantees LAPI is ready when Start() returns
|
|
|
|
**User Impact After Fix:**
|
|
|
|
- ✅ Enabling CrowdSec takes 5-15 seconds (visible loading spinner)
|
|
- ✅ Config page immediately usable after enable
|
|
- ✅ Console enrollment works without errors
|
|
- ✅ Clear feedback about LAPI status at all times
|
|
|
|
**Confidence Level:** HIGH - Root cause is clearly identified with specific line numbers and timing measurements. Fix is straightforward with low risk.
|