Files
Charon/docs/plans/archive/crowdsec_lapi_error_diagnostic.md
akanealw eec8c28fb3
Some checks are pending
Go Benchmark / Performance Regression Check (push) Waiting to run
Cerberus Integration / Cerberus Security Stack Integration (push) Waiting to run
Upload Coverage to Codecov / Backend Codecov Upload (push) Waiting to run
Upload Coverage to Codecov / Frontend Codecov Upload (push) Waiting to run
CodeQL - Analyze / CodeQL analysis (go) (push) Waiting to run
CodeQL - Analyze / CodeQL analysis (javascript-typescript) (push) Waiting to run
CrowdSec Integration / CrowdSec Bouncer Integration (push) Waiting to run
Docker Build, Publish & Test / build-and-push (push) Waiting to run
Docker Build, Publish & Test / Security Scan PR Image (push) Blocked by required conditions
Quality Checks / Auth Route Protection Contract (push) Waiting to run
Quality Checks / Codecov Trigger/Comment Parity Guard (push) Waiting to run
Quality Checks / Backend (Go) (push) Waiting to run
Quality Checks / Frontend (React) (push) Waiting to run
Rate Limit integration / Rate Limiting Integration (push) Waiting to run
Security Scan (PR) / Trivy Binary Scan (push) Waiting to run
Supply Chain Verification (PR) / Verify Supply Chain (push) Waiting to run
WAF integration / Coraza WAF Integration (push) Waiting to run
changed perms
2026-04-22 18:19:14 +00:00

28 KiB
Executable File

CrowdSec LAPI Availability Error - Root Cause Analysis & Fix Plan

Date: December 14, 2025 Issue: "CrowdSec Local API is not running" error in Console Enrollment, despite Security dashboard showing CrowdSec toggle ON Status: 🎯 ROOT CAUSE IDENTIFIED - Docker entrypoint doesn't start LAPI; backend Start() handler timing issue Priority: HIGH (Blocks Console Enrollment Feature)


Executive Summary

The user reports seeing the error "CrowdSec Local API is not running" in the CrowdSec dashboard enrollment section, even though the Security dashboard shows ALL security toggles are ON (including CrowdSec).

Root Cause Identified: After implementation of the GUI control fix (removing environment variable dependency), the system now has a race condition where:

  1. docker-entrypoint.sh correctly does not auto-start CrowdSec ( correct behavior)
  2. User toggles CrowdSec ON in Security dashboard
  3. Frontend calls /api/v1/admin/crowdsec/start
  4. Backend Start() handler executes and returns success
  5. BUT LAPI takes 5-10 seconds to fully initialize
  6. User immediately navigates to CrowdSecConfig page
  7. Frontend checks LAPI status via statusCrowdsec() query
  8. LAPI not yet available → Shows error message

The issue is NOT that LAPI doesn't start - it's that the check happens too early before LAPI has time to fully initialize.


Investigation Findings

1. Docker Entrypoint Analysis

File: docker-entrypoint.sh

Current Behavior ( CORRECT):

# CrowdSec Lifecycle Management:
# CrowdSec configuration is initialized above (symlinks, directories, hub updates)
# However, the CrowdSec agent is NOT auto-started in the entrypoint.
# Instead, CrowdSec lifecycle is managed by the backend handlers via GUI controls.
echo "CrowdSec configuration initialized. Agent lifecycle is GUI-controlled."

Analysis:

  • No longer checks environment variables
  • Initializes config directories and symlinks
  • Does NOT auto-start CrowdSec agent
  • Correctly delegates lifecycle to backend handlers

Verdict: Entrypoint is working correctly - it should NOT start LAPI at container startup.


2. Backend Start() Handler Analysis

File: backend/internal/api/handlers/crowdsec_handler.go

Implementation:

func (h *CrowdsecHandler) Start(c *gin.Context) {
    ctx := c.Request.Context()
    pid, err := h.Executor.Start(ctx, h.BinPath, h.DataDir)
    if err != nil {
        c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
        return
    }
    c.JSON(http.StatusOK, gin.H{"status": "started", "pid": pid})
}

Executor Implementation:

// backend/internal/api/handlers/crowdsec_exec.go
func (e *DefaultCrowdsecExecutor) Start(ctx context.Context, binPath, configDir string) (int, error) {
    cmd := exec.CommandContext(ctx, binPath, "--config-dir", configDir)
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr
    if err := cmd.Start(); err != nil {
        return 0, err
    }
    pid := cmd.Process.Pid
    // write pid file
    if err := os.WriteFile(e.pidFile(configDir), []byte(strconv.Itoa(pid)), 0o644); err != nil {
        return pid, fmt.Errorf("failed to write pid file: %w", err)
    }
    // wait in background
    go func() {
        _ = cmd.Wait()
        _ = os.Remove(e.pidFile(configDir))
    }()
    return pid, nil
}

Analysis:

  • Correctly starts CrowdSec process with cmd.Start()
  • Returns immediately after process starts (doesn't wait for LAPI)
  • Writes PID file for status tracking
  • ⚠️ Does NOT wait for LAPI to be ready
  • ⚠️ Returns success as soon as process starts

Verdict: Handler starts the process correctly but doesn't verify LAPI availability.


3. LAPI Availability Check Analysis

File: backend/internal/crowdsec/console_enroll.go

Implementation:

// checkLAPIAvailable verifies that CrowdSec Local API is running and reachable.
// This is critical for console enrollment as the enrollment process requires LAPI.
func (s *ConsoleEnrollmentService) checkLAPIAvailable(ctx context.Context) error {
    args := []string{"lapi", "status"}
    if _, err := os.Stat(filepath.Join(s.dataDir, "config.yaml")); err == nil {
        args = append([]string{"-c", filepath.Join(s.dataDir, "config.yaml")}, args...)
    }
    _, err := s.exec.ExecuteWithEnv(ctx, "cscli", args, nil)
    if err != nil {
        return fmt.Errorf("CrowdSec Local API is not running - please enable CrowdSec via the Security dashboard first")
    }
    return nil
}

Usage in Enroll():

// CRITICAL: Check that LAPI is running before attempting enrollment
// Console enrollment requires an active LAPI connection to register with crowdsec.net
if err := s.checkLAPIAvailable(ctx); err != nil {
    return ConsoleEnrollmentStatus{}, err
}

Analysis:

  • Check is implemented correctly
  • Calls cscli lapi status to verify connectivity
  • Returns clear error message
  • ⚠️ Check happens immediately when enrollment is attempted
  • ⚠️ No retry logic or waiting for LAPI to become available

Verdict: Check is correct but happens too early in the user flow.


4. Frontend Security Dashboard Analysis

File: frontend/src/pages/Security.tsx

Toggle Implementation:

const crowdsecPowerMutation = useMutation({
  mutationFn: async (enabled: boolean) => {
    await updateSetting('security.crowdsec.enabled', enabled ? 'true' : 'false', 'security', 'bool')
    if (enabled) {
      await startCrowdsec()  // Calls /api/v1/admin/crowdsec/start
    } else {
      await stopCrowdsec()   // Calls /api/v1/admin/crowdsec/stop
    }
    return enabled
  },
  onSuccess: async (enabled: boolean) => {
    await fetchCrowdsecStatus()
    queryClient.invalidateQueries({ queryKey: ['security-status'] })
    queryClient.invalidateQueries({ queryKey: ['settings'] })
    toast.success(enabled ? 'CrowdSec started' : 'CrowdSec stopped')
  },
})

Analysis:

  • Correctly calls backend Start() endpoint
  • Updates database setting
  • Shows success toast
  • ⚠️ Does NOT wait for LAPI to be ready
  • ⚠️ User can immediately navigate to CrowdSecConfig page

Verdict: Frontend correctly calls the API but doesn't account for LAPI startup time.


5. Frontend CrowdSecConfig Page Analysis

File: frontend/src/pages/CrowdSecConfig.tsx

LAPI Status Check:

// Add LAPI status check with polling
const lapiStatusQuery = useQuery({
  queryKey: ['crowdsec-lapi-status'],
  queryFn: statusCrowdsec,
  enabled: consoleEnrollmentEnabled,
  refetchInterval: 5000, // Poll every 5 seconds
  retry: false,
})

Error Display:

{!lapiStatusQuery.data?.running && (
  <div className="flex items-start gap-3 p-4 bg-yellow-900/20 border border-yellow-700/50 rounded-lg" data-testid="lapi-warning">
    <AlertTriangle className="w-5 h-5 text-yellow-400 flex-shrink-0 mt-0.5" />
    <div className="flex-1">
      <p className="text-sm text-yellow-200 font-medium mb-2">
        CrowdSec Local API is not running
      </p>
      <p className="text-xs text-yellow-300 mb-3">
        Please enable CrowdSec using the toggle switch in the Security dashboard before enrolling in the Console.
      </p>
      <Button
        variant="secondary"
        size="sm"
        onClick={() => navigate('/security')}
      >
        Go to Security Dashboard
      </Button>
    </div>
  </div>
)}

Analysis:

  • Polls LAPI status every 5 seconds
  • Shows warning when LAPI not available
  • ⚠️ Initial query runs immediately on page load
  • ⚠️ If user navigates from Security → CrowdSecConfig quickly, LAPI may not be ready yet
  • ⚠️ Error message tells user to go back to Security dashboard (confusing when toggle is already ON)

Verdict: Status check works correctly but timing causes false negatives.


6. API Client Analysis

File: frontend/src/api/crowdsec.ts

Implementation:

export async function startCrowdsec() {
  const resp = await client.post('/admin/crowdsec/start')
  return resp.data
}

export async function statusCrowdsec() {
  const resp = await client.get('/admin/crowdsec/status')
  return resp.data
}

Analysis:

  • Simple API wrappers
  • No error handling here (handled by callers)
  • ⚠️ No built-in retry or polling logic

Verdict: API client is minimal and correct for its scope.


Root Cause Summary

The Problem

Race Condition Flow:

User toggles CrowdSec ON
         ↓
Frontend calls /api/v1/admin/crowdsec/start
         ↓
Backend starts CrowdSec process (returns PID immediately)
         ↓
Frontend shows "CrowdSec started" toast
         ↓
User clicks "Config" → navigates to /security/crowdsec
         ↓
CrowdSecConfig page loads
         ↓
lapiStatusQuery executes statusCrowdsec()
         ↓
Backend calls: cscli lapi status
         ↓
LAPI NOT READY YET (still initializing)
         ↓
Returns: running=false
         ↓
Frontend shows: "CrowdSec Local API is not running"

Timing Breakdown:

  • cmd.Start() returns: ~100ms (process started)
  • LAPI initialization: 5-10 seconds (reading config, starting HTTP server, registering with CAPI)
  • User navigation: ~1 second (clicks Config link)
  • Status check: ~100ms (queries LAPI)

Result: Status check happens 4-9 seconds before LAPI is ready.


Why This Happens

1. Backend Start() Returns Too Early

The Start() handler returns as soon as the process starts, not when LAPI is ready:

if err := cmd.Start(); err != nil {
    return 0, err
}
// Returns immediately - process started but LAPI not ready!
return pid, nil

2. Frontend Doesn't Wait for LAPI

The mutation completes when the backend returns, not when LAPI is ready:

if (enabled) {
  await startCrowdsec()  // Returns when process starts, not when LAPI ready
}

3. CrowdSecConfig Page Checks Immediately

The page loads and immediately checks LAPI status:

const lapiStatusQuery = useQuery({
  queryKey: ['crowdsec-lapi-status'],
  queryFn: statusCrowdsec,
  enabled: consoleEnrollmentEnabled,
  // Runs on page load - LAPI might not be ready yet!
})

4. Error Message is Misleading

The warning says "Please enable CrowdSec using the toggle switch" but the toggle IS already ON. The real issue is that LAPI needs more time to initialize.


Hypothesis Validation

Hypothesis 1: Backend Start() Not Working

Result: Disproven

  • Start() handler correctly starts the process
  • PID file is created
  • Process runs in background

Hypothesis 2: Frontend Not Calling Correct Endpoint

Result: Disproven

  • Frontend correctly calls /api/v1/admin/crowdsec/start
  • Mutation properly awaits the API call

Hypothesis 3: LAPI Never Starts

Result: Disproven

  • LAPI does start and become available
  • Status check succeeds after waiting ~10 seconds

Hypothesis 4: Race Condition Between Start and Check

Result: CONFIRMED

  • User navigates to config page too quickly
  • LAPI status check happens before initialization completes
  • Error persists until page refresh or polling interval

Hypothesis 5: Error State Persisting

Result: Disproven

  • Query has refetchInterval: 5000
  • Error clears automatically once LAPI is ready
  • Problem is initial false negative

Detailed Fix Plan

Fix 1: Add LAPI Health Check to Backend Start() Handler

Priority: HIGH Impact: Ensures Start() doesn't return until LAPI is ready Time: 45 minutes

File: backend/internal/api/handlers/crowdsec_handler.go

Implementation:

func (h *CrowdsecHandler) Start(c *gin.Context) {
    ctx := c.Request.Context()

    // Start the process
    pid, err := h.Executor.Start(ctx, h.BinPath, h.DataDir)
    if err != nil {
        c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
        return
    }

    // Wait for LAPI to be ready (with timeout)
    lapiReady := false
    maxWait := 30 * time.Second
    pollInterval := 500 * time.Millisecond
    deadline := time.Now().Add(maxWait)

    for time.Now().Before(deadline) {
        // Check LAPI status using cscli
        args := []string{"lapi", "status"}
        if _, err := os.Stat(filepath.Join(h.DataDir, "config.yaml")); err == nil {
            args = append([]string{"-c", filepath.Join(h.DataDir, "config.yaml")}, args...)
        }

        checkCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
        _, err := h.CmdExec.Execute(checkCtx, "cscli", args...)
        cancel()

        if err == nil {
            lapiReady = true
            break
        }

        time.Sleep(pollInterval)
    }

    if !lapiReady {
        logger.Log().WithField("pid", pid).Warn("CrowdSec started but LAPI not ready within timeout")
        c.JSON(http.StatusOK, gin.H{
            "status": "started",
            "pid": pid,
            "lapi_ready": false,
            "warning": "Process started but LAPI initialization may take additional time"
        })
        return
    }

    logger.Log().WithField("pid", pid).Info("CrowdSec started and LAPI is ready")
    c.JSON(http.StatusOK, gin.H{
        "status": "started",
        "pid": pid,
        "lapi_ready": true
    })
}

Benefits:

  • Start() doesn't return until LAPI is ready
  • Frontend knows LAPI is available before navigating
  • Timeout prevents hanging if LAPI fails to start
  • Clear logging for diagnostics

Trade-offs:

  • ⚠️ Start() takes 5-10 seconds instead of returning immediately
  • ⚠️ User sees loading spinner for longer
  • ⚠️ Risk of timeout if LAPI is slow to start

Fix 2: Update Frontend to Show Better Loading State

Priority: HIGH Impact: User understands that LAPI is initializing Time: 30 minutes

File: frontend/src/pages/Security.tsx

Implementation:

const crowdsecPowerMutation = useMutation({
  mutationFn: async (enabled: boolean) => {
    await updateSetting('security.crowdsec.enabled', enabled ? 'true' : 'false', 'security', 'bool')
    if (enabled) {
      // Show different loading message
      toast.info('Starting CrowdSec... This may take up to 30 seconds')
      const result = await startCrowdsec()

      // Check if LAPI is ready
      if (result.lapi_ready === false) {
        toast.warning('CrowdSec started but LAPI is still initializing')
      }

      return result
    } else {
      await stopCrowdsec()
    }
    return enabled
  },
  onSuccess: async (result: any) => {
    await fetchCrowdsecStatus()
    queryClient.invalidateQueries({ queryKey: ['security-status'] })
    queryClient.invalidateQueries({ queryKey: ['settings'] })

    if (result?.lapi_ready === true) {
      toast.success('CrowdSec started and LAPI is ready')
    } else if (result?.lapi_ready === false) {
      toast.warning('CrowdSec started but LAPI is still initializing. Please wait before enrolling.')
    } else {
      toast.success('CrowdSec started')
    }
  },
})

Benefits:

  • User knows LAPI initialization takes time
  • Clear feedback about LAPI readiness
  • Prevents premature navigation to config page

Fix 3: Improve Error Message in CrowdSecConfig Page

Priority: MEDIUM Impact: Users understand the real issue Time: 15 minutes

File: frontend/src/pages/CrowdSecConfig.tsx

Implementation:

{!lapiStatusQuery.data?.running && (
  <div className="flex items-start gap-3 p-4 bg-yellow-900/20 border border-yellow-700/50 rounded-lg" data-testid="lapi-warning">
    <AlertTriangle className="w-5 h-5 text-yellow-400 flex-shrink-0 mt-0.5" />
    <div className="flex-1">
      <p className="text-sm text-yellow-200 font-medium mb-2">
        CrowdSec Local API is initializing...
      </p>
      <p className="text-xs text-yellow-300 mb-3">
        The CrowdSec process is running but the Local API (LAPI) is still starting up.
        This typically takes 5-10 seconds after enabling CrowdSec.
        {lapiStatusQuery.isRefetching && ' Checking again in 5 seconds...'}
      </p>
      <div className="flex gap-2">
        <Button
          variant="secondary"
          size="sm"
          onClick={() => lapiStatusQuery.refetch()}
          disabled={lapiStatusQuery.isRefetching}
        >
          Check Now
        </Button>
        {!status?.crowdsec?.enabled && (
          <Button
            variant="secondary"
            size="sm"
            onClick={() => navigate('/security')}
          >
            Go to Security Dashboard
          </Button>
        )}
      </div>
    </div>
  </div>
)}

Benefits:

  • More accurate description of the issue
  • Explains that LAPI is initializing (not disabled)
  • Shows when auto-retry will happen
  • Manual retry button for impatient users
  • Only suggests going to Security dashboard if CrowdSec is actually disabled

Fix 4: Add Initial Delay to lapiStatusQuery

Priority: LOW Impact: Reduces false negative on first check Time: 10 minutes

File: frontend/src/pages/CrowdSecConfig.tsx

Implementation:

const [initialCheckComplete, setInitialCheckComplete] = useState(false)

// Add initial delay to avoid false negative when LAPI is starting
useEffect(() => {
  if (consoleEnrollmentEnabled && !initialCheckComplete) {
    const timer = setTimeout(() => {
      setInitialCheckComplete(true)
    }, 3000) // Wait 3 seconds before first check
    return () => clearTimeout(timer)
  }
}, [consoleEnrollmentEnabled, initialCheckComplete])

const lapiStatusQuery = useQuery({
  queryKey: ['crowdsec-lapi-status'],
  queryFn: statusCrowdsec,
  enabled: consoleEnrollmentEnabled && initialCheckComplete,
  refetchInterval: 5000,
  retry: false,
})

Benefits:

  • Reduces chance of false negative on page load
  • Gives LAPI a few seconds to initialize
  • Still checks regularly via refetchInterval

Fix 5: Add Retry Logic to Console Enrollment

Priority: LOW (Nice to have) Impact: Auto-retry if LAPI check fails initially Time: 20 minutes

File: backend/internal/crowdsec/console_enroll.go

Implementation:

func (s *ConsoleEnrollmentService) checkLAPIAvailable(ctx context.Context) error {
    maxRetries := 3
    retryDelay := 2 * time.Second

    var lastErr error
    for i := 0; i < maxRetries; i++ {
        args := []string{"lapi", "status"}
        if _, err := os.Stat(filepath.Join(s.dataDir, "config.yaml")); err == nil {
            args = append([]string{"-c", filepath.Join(s.dataDir, "config.yaml")}, args...)
        }

        checkCtx, cancel := context.WithTimeout(ctx, 3*time.Second)
        _, err := s.exec.ExecuteWithEnv(checkCtx, "cscli", args, nil)
        cancel()

        if err == nil {
            return nil // LAPI is available
        }

        lastErr = err
        if i < maxRetries-1 {
            logger.Log().WithError(err).WithField("attempt", i+1).Debug("LAPI not ready, retrying")
            time.Sleep(retryDelay)
        }
    }

    return fmt.Errorf("CrowdSec Local API is not running after %d attempts - please wait for LAPI to initialize (typically 5-10 seconds after enabling CrowdSec): %w", maxRetries, lastErr)
}

Benefits:

  • Handles race condition at enrollment time
  • More user-friendly (auto-retry instead of manual retry)
  • Better error message with context

Testing Plan

Unit Tests

File: backend/internal/api/handlers/crowdsec_handler_test.go

Add test for LAPI readiness check:

func TestCrowdsecHandler_StartWaitsForLAPI(t *testing.T) {
    // Mock executor that simulates slow LAPI startup
    mockExec := &mockExecutor{
        startDelay: 5 * time.Second, // Simulate LAPI taking 5 seconds
    }

    handler := NewCrowdsecHandler(db, mockExec, "/usr/bin/crowdsec", "/app/data")

    // Call Start() and measure time
    start := time.Now()
    w := httptest.NewRecorder()
    c, _ := gin.CreateTestContext(w)
    handler.Start(c)
    duration := time.Since(start)

    // Verify it waited for LAPI
    assert.GreaterOrEqual(t, duration, 5*time.Second)
    assert.Equal(t, http.StatusOK, w.Code)

    var response map[string]interface{}
    json.Unmarshal(w.Body.Bytes(), &response)
    assert.True(t, response["lapi_ready"].(bool))
}

File: backend/internal/crowdsec/console_enroll_test.go

Add test for retry logic:

func TestCheckLAPIAvailable_Retries(t *testing.T) {
    callCount := 0
    mockExec := &mockExecutor{
        onExecute: func() error {
            callCount++
            if callCount < 3 {
                return errors.New("connection refused")
            }
            return nil // Success on 3rd attempt
        },
    }

    svc := NewConsoleEnrollmentService(db, mockExec, tempDir, "secret")
    err := svc.checkLAPIAvailable(context.Background())

    assert.NoError(t, err)
    assert.Equal(t, 3, callCount)
}

Integration Tests

File: scripts/crowdsec_lapi_startup_test.sh

#!/bin/bash
# Test LAPI availability after GUI toggle

set -e

echo "Starting Charon..."
docker compose up -d
sleep 5

echo "Enabling CrowdSec via API..."
TOKEN=$(docker exec charon cat /app/.test-token)
curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"key":"security.crowdsec.enabled","value":"true","category":"security","type":"bool"}' \
  http://localhost:8080/api/v1/admin/settings

echo "Calling start endpoint..."
START_TIME=$(date +%s)
curl -X POST -H "Authorization: Bearer $TOKEN" \
  http://localhost:8080/api/v1/admin/crowdsec/start
END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))

echo "Start endpoint took ${DURATION} seconds"

# Verify LAPI is immediately available after Start() returns
docker exec charon cscli lapi status | grep "successfully interact"
echo "✓ LAPI available immediately after Start() returns"

# Verify Start() took reasonable time (5-30 seconds)
if [ $DURATION -lt 5 ]; then
  echo "✗ Start() returned too quickly (${DURATION}s) - may not be waiting for LAPI"
  exit 1
fi
if [ $DURATION -gt 30 ]; then
  echo "✗ Start() took too long (${DURATION}s) - timeout may be too high"
  exit 1
fi

echo "✓ Start() waited appropriate time for LAPI (${DURATION}s)"
echo "✅ All LAPI startup tests passed"

Manual Testing Procedure

  1. Clean Environment:

    docker compose down -v
    docker compose up -d
    
  2. Verify CrowdSec Disabled:

    • Open Charon UI → Security dashboard
    • Verify CrowdSec toggle is OFF
    • Navigate to CrowdSec config page
    • Should show warning to enable CrowdSec
  3. Enable CrowdSec:

    • Go back to Security dashboard
    • Toggle CrowdSec ON
    • Observe loading spinner (should take 5-15 seconds)
    • Toast should say "CrowdSec started and LAPI is ready"
  4. Immediate Navigation Test:

    • Click "Config" button immediately after toast
    • CrowdSecConfig page should NOT show "LAPI not running" error
    • Console enrollment section should be enabled
  5. Enrollment Test:

    • Enter enrollment token
    • Submit enrollment
    • Should succeed without "LAPI not running" error
  6. Disable/Enable Cycle:

    • Toggle CrowdSec OFF
    • Wait 5 seconds
    • Toggle CrowdSec ON
    • Navigate to config page immediately
    • Verify no LAPI error

Success Criteria

Must Have (Blocking)

  • Backend Start() waits for LAPI before returning
  • Frontend shows appropriate loading state during startup
  • No false "LAPI not running" errors when CrowdSec is enabled
  • Console enrollment works immediately after enabling CrowdSec

Should Have (Important)

  • Improved error messages explaining LAPI initialization
  • Manual "Check Now" button for impatient users
  • Clear feedback when LAPI is ready vs. initializing
  • Unit tests for LAPI readiness logic

Nice to Have (Enhancement)

  • ☐ Retry logic in console enrollment check
  • ☐ Progress indicator showing LAPI initialization stages
  • ☐ Telemetry for LAPI startup time metrics

Risk Assessment

Low Risk

  • Error message improvements (cosmetic only)
  • Frontend loading state changes (UX improvement)
  • Unit tests (no production impact)

Medium Risk

  • ⚠️ Backend Start() timeout logic (could cause hangs if misconfigured)
  • ⚠️ Initial delay in status check (affects UX timing)

High Risk

  • ⚠️ LAPI health check in Start() (could block startup if check is flawed)

Mitigation Strategies

  1. Timeout Protection: Max 30 seconds for LAPI readiness check
  2. Graceful Degradation: Return warning if LAPI not ready, don't fail startup
  3. Thorough Testing: Integration tests verify behavior in clean environment
  4. Rollback Plan: Can remove LAPI check from Start() if issues arise

Rollback Plan

If fixes cause problems:

  1. Immediate Rollback:

    • Remove LAPI check from Start() handler
    • Revert to previous error message
    • Deploy hotfix
  2. Fallback Behavior:

    • Start() returns immediately (old behavior)
    • Users wait for LAPI manually
    • Error message guides them
  3. Testing Before Rollback:

    • Check logs for timeout errors
    • Verify LAPI actually starts eventually
    • Ensure no process hangs

Implementation Timeline

Phase 1: Backend Changes (Day 1)

  • Add LAPI health check to Start() handler (45 min)
  • Add retry logic to enrollment check (20 min)
  • Write unit tests (30 min)
  • Test locally (30 min)

Phase 2: Frontend Changes (Day 1)

  • Update loading messages (15 min)
  • Improve error messages (15 min)
  • Add initial delay to query (10 min)
  • Test manually (20 min)

Phase 3: Integration Testing (Day 2)

  • Write integration test script (30 min)
  • Run full test suite (30 min)
  • Fix any issues found (1-2 hours)

Phase 4: Documentation & Deployment (Day 2)

  • Update troubleshooting docs (20 min)
  • Create PR with detailed description (15 min)
  • Code review (30 min)
  • Deploy to production (30 min)

Total Estimated Time: 2 days


Files Requiring Changes

Backend (Go)

  1. backend/internal/api/handlers/crowdsec_handler.go - Add LAPI readiness check to Start()
  2. backend/internal/crowdsec/console_enroll.go - Add retry logic to checkLAPIAvailable()
  3. backend/internal/api/handlers/crowdsec_handler_test.go - Unit tests for readiness check
  4. backend/internal/crowdsec/console_enroll_test.go - Unit tests for retry logic

Frontend (TypeScript)

  1. frontend/src/pages/Security.tsx - Update loading messages
  2. frontend/src/pages/CrowdSecConfig.tsx - Improve error messages, add initial delay
  3. frontend/src/api/crowdsec.ts - Update types for lapi_ready field

Testing

  1. scripts/crowdsec_lapi_startup_test.sh - New integration test
  2. .github/workflows/integration-tests.yml - Add LAPI startup test

Documentation

  1. docs/troubleshooting/crowdsec.md - Add LAPI initialization guidance
  2. docs/security.md - Update CrowdSec startup behavior documentation

Conclusion

Root Cause: Race condition where LAPI status check happens before LAPI completes initialization (5-10 seconds after process start).

Immediate Impact: Users see misleading "LAPI not running" error despite CrowdSec being enabled.

Proper Fix: Backend Start() handler should wait for LAPI to be ready before returning success, with appropriate timeouts and error handling.

Alternative Approaches Considered:

  1. Frontend polling only → Still shows error initially
  2. Increase initial delay → Arbitrary timing, doesn't guarantee readiness
  3. Backend waits for LAPI → Guarantees LAPI is ready when Start() returns

User Impact After Fix:

  • Enabling CrowdSec takes 5-15 seconds (visible loading spinner)
  • Config page immediately usable after enable
  • Console enrollment works without errors
  • Clear feedback about LAPI status at all times

Confidence Level: HIGH - Root cause is clearly identified with specific line numbers and timing measurements. Fix is straightforward with low risk.