Files
Charon/docs/plans/current_spec.md

468 lines
16 KiB
Markdown

# Security Dashboard Live Logs - Complete Trace Analysis
**Date:** December 16, 2025
**Status:** ✅ ALL ISSUES FIXED & VERIFIED
**Severity:** Was Critical (WebSocket reconnection loop) → Now Resolved
---
## 0. FULL TRACE ANALYSIS
### File-by-File Data Flow
| Step | File | Lines | Purpose | Status |
|------|------|-------|---------|--------|
| 1 | `frontend/src/pages/Security.tsx` | 36, 421 | Renders LiveLogViewer with memoized filters | ✅ Fixed |
| 2 | `frontend/src/components/LiveLogViewer.tsx` | 138-143, 183-268 | Manages WebSocket lifecycle in useEffect | ✅ Fixed |
| 3 | `frontend/src/api/logs.ts` | 177-237 | `connectSecurityLogs()` - builds WS URL with auth | ✅ Working |
| 4 | `backend/internal/api/routes/routes.go` | 373-394 | Registers `/cerberus/logs/ws` in protected group | ✅ Working |
| 5 | `backend/internal/api/middleware/auth.go` | 12-39 | Validates JWT from header/cookie/query param | ✅ Working |
| 6 | `backend/internal/api/handlers/cerberus_logs_ws.go` | 27-120 | WebSocket handler with filter parsing | ✅ Working |
| 7 | `backend/internal/services/log_watcher.go` | 44-237 | Tails Caddy access log, broadcasts to subscribers | ✅ Working |
### Authentication Flow
```text
Frontend Backend
──────── ───────
localStorage.getItem('charon_auth_token')
Query param: ?token=<jwt> ────────► AuthMiddleware:
1. Check Authorization header
2. Check auth_token cookie
3. Check token query param ◄── MATCHES
ValidateToken(jwt) → OK
Upgrade to WebSocket
```
### Logic Gap Analysis
**ANSWER: NO - There is NO logic gap between Frontend and Backend.**
| Question | Answer |
|----------|--------|
| Frontend auth method | Query param `?token=<jwt>` from `localStorage.getItem('charon_auth_token')` |
| Backend auth method | Accepts: Header → Cookie → Query param `token` ✅ |
| Filter params | Both use `source`, `level`, `ip`, `host`, `blocked_only` ✅ |
| Data format | `SecurityLogEntry` struct matches frontend TypeScript type ✅ |
---
## 1. VERIFICATION STATUS
### ✅ localStorage Key IS Correct
Both WebSocket functions in `frontend/src/api/logs.ts` correctly use `charon_auth_token`:
- **Line 119-122** (`connectLiveLogs`): `localStorage.getItem('charon_auth_token')`
- **Line 178-181** (`connectSecurityLogs`): `localStorage.getItem('charon_auth_token')`
---
## 2. ALL ISSUES FOUND (NOW FIXED)
### Issue #1: CRITICAL - Object Reference Instability in Props (ROOT CAUSE) ✅ FIXED
**Problem:** `Security.tsx` passed `securityFilters={{}}` inline, creating a new object on every render. This triggered useEffect cleanup/reconnection on every parent re-render.
**Fix Applied:**
```tsx
// frontend/src/pages/Security.tsx line 36
const emptySecurityFilters = useMemo(() => ({}), [])
// frontend/src/pages/Security.tsx line 421
<LiveLogViewer mode="security" securityFilters={emptySecurityFilters} className="w-full" />
```
### Issue #2: Default Props Had Same Problem ✅ FIXED
**Problem:** Default empty objects `filters = {}` in function params created new objects on each call.
**Fix Applied:**
```typescript
// frontend/src/components/LiveLogViewer.tsx lines 138-143
const EMPTY_LIVE_FILTER: LiveLogFilter = {};
const EMPTY_SECURITY_FILTER: SecurityLogFilter = {};
export function LiveLogViewer({
filters = EMPTY_LIVE_FILTER,
securityFilters = EMPTY_SECURITY_FILTER,
// ...
})
```
### Issue #3: `showBlockedOnly` Toggle (INTENTIONAL)
The `showBlockedOnly` state in useEffect dependencies causes reconnection when toggled. This is **intentional** for server-side filtering - not a bug.
---
## 3. ROOT CAUSE ANALYSIS
### The Reconnection Loop (Before Fix)
1. User navigates to Security Dashboard
2. `Security.tsx` renders with `<LiveLogViewer securityFilters={{}} />`
3. `LiveLogViewer` mounts → useEffect runs → WebSocket connects
4. React Query refetches security status
5. `Security.tsx` re-renders → **new `{}` object created**
6. `LiveLogViewer` re-renders → useEffect sees "changed" `securityFilters`
7. useEffect cleanup runs → **WebSocket closes**
8. useEffect body runs → **WebSocket opens**
9. Repeat steps 4-8 every ~100ms
### Evidence from Docker Logs (Before Fix)
```text
{"level":"info","msg":"Cerberus logs WebSocket connected","subscriber_id":"xxx"}
{"level":"info","msg":"Cerberus logs WebSocket client disconnected","subscriber_id":"xxx"}
{"level":"info","msg":"Cerberus logs WebSocket connected","subscriber_id":"yyy"}
{"level":"info","msg":"Cerberus logs WebSocket client disconnected","subscriber_id":"yyy"}
```
---
## 4. COMPONENT DEEP DIVE
### Frontend: Security.tsx
- Renders the Security Dashboard with 4 security layer cards (CrowdSec, ACL, Coraza, Rate Limiting)
- Contains multiple `useQuery`/`useMutation` hooks that trigger re-renders
- **Line 36:** Creates stable filter reference with `useMemo`
- **Line 421:** Passes stable reference to `LiveLogViewer`
### Frontend: LiveLogViewer.tsx
- Dual-mode log viewer (application logs vs security logs)
- **Lines 138-139:** Stable default filter objects defined outside component
- **Lines 183-268:** useEffect that manages WebSocket lifecycle
- **Line 268:** Dependencies: `[currentMode, filters, securityFilters, maxLogs, showBlockedOnly]`
- Uses `isPausedRef` to avoid reconnection when pausing
### Frontend: logs.ts (API Client)
- **`connectSecurityLogs()`** (lines 177-237):
- Builds URLSearchParams from filter object
- Gets auth token from `localStorage.getItem('charon_auth_token')`
- Appends token as query param
- Constructs URL: `wss://host/api/v1/cerberus/logs/ws?...&token=<jwt>`
### Backend: routes.go
- **Line 380-389:** Creates LogWatcher service pointing to `/var/log/caddy/access.log`
- **Line 393:** Creates `CerberusLogsHandler`
- **Line 394:** Registers route in protected group (auth required)
### Backend: auth.go (Middleware)
- **Lines 14-28:** Auth flow: Header → Cookie → Query param
- **Line 25-28:** Query param fallback: `if token := c.Query("token"); token != ""`
- WebSocket connections use query param auth (browsers can't set headers on WS)
### Backend: cerberus_logs_ws.go (Handler)
- **Lines 42-48:** Upgrades HTTP to WebSocket
- **Lines 53-59:** Parses filter query params
- **Lines 61-62:** Subscribes to LogWatcher
- **Lines 80-109:** Main loop broadcasting filtered entries
### Backend: log_watcher.go (Service)
- Singleton service tailing Caddy access log
- Parses JSON log lines into `SecurityLogEntry`
- Broadcasts to all WebSocket subscribers
- Detects security events (WAF, CrowdSec, ACL, rate limit)
---
## 5. SUMMARY TABLE
| Component | Status | Notes |
|-----------|--------|-------|
| localStorage key | ✅ Fixed | Now uses `charon_auth_token` |
| Auth middleware | ✅ Working | Accepts query param `token` |
| WebSocket endpoint | ✅ Working | Protected route, upgrades correctly |
| LogWatcher service | ✅ Working | Tails access.log successfully |
| **Frontend memoization** | ✅ Fixed | `useMemo` in Security.tsx |
| **Stable default props** | ✅ Fixed | Constants in LiveLogViewer.tsx |
---
## 6. VERIFICATION STEPS
After any changes, verify with:
```bash
# 1. Rebuild and restart
docker build -t charon:local . && docker compose -f docker-compose.override.yml up -d
# 2. Check for stable connection (should see ONE connect, no rapid cycling)
docker logs charon 2>&1 | grep -i "cerberus.*websocket" | tail -10
# 3. Browser DevTools → Console
# Should see: "Cerberus logs WebSocket connection established"
# Should NOT see repeated connection attempts
```
---
## 7. CONCLUSION
**Root Cause:** React reference instability (`{}` creates new object on every render)
**Solution Applied:** Memoize filter objects to maintain stable references
**Logic Gap Between Frontend/Backend:** **NO** - Both are correctly aligned
**Current Status:** ✅ All fixes applied and working
---
# Health Check 401 Auth Failures - Investigation Report
**Date:** December 16, 2025
**Status:** ✅ ANALYZED - NOT A BUG
**Severity:** Informational (Log Noise)
---
## 1. INVESTIGATION SUMMARY
### What the User Observed
The user reported recurring 401 auth failures in Docker logs:
```
01:03:10 AUTH 172.20.0.1 GET / → 401 [401] 133.6ms
{ "auth_failure": true }
01:04:10 AUTH 172.20.0.1 GET / → 401 [401] 112.9ms
{ "auth_failure": true }
```
### Initial Hypothesis vs Reality
| Hypothesis | Reality |
|------------|---------|
| Docker health check hitting `/` | ❌ Docker health check hits `/api/v1/health` and works correctly (200) |
| Charon backend auth issue | ❌ Charon backend auth is working fine |
| Missing health endpoint | ❌ `/api/v1/health` exists and is public |
---
## 2. ROOT CAUSE IDENTIFIED
### The 401s are FROM Plex, NOT Charon
**Evidence from logs:**
```json
{
"host": "plex.hatfieldhosted.com",
"uri": "/",
"status": 401,
"resp_headers": {
"X-Plex-Protocol": ["1.0"],
"X-Plex-Content-Compressed-Length": ["157"],
"Cache-Control": ["no-cache"]
}
}
```
The 401 responses contain **Plex-specific headers** (`X-Plex-Protocol`, `X-Plex-Content-Compressed-Length`). This proves:
1. The request goes through Caddy to **Plex backend**
2. **Plex** returns 401 because the request has no auth token
3. Caddy logs this as a handled request
### What's Making These Requests?
**Charon's Uptime Monitoring Service** (`backend/internal/services/uptime_service.go`)
The `checkMonitor()` function performs HTTP GET requests to proxied hosts:
```go
case "http", "https":
client := http.Client{Timeout: 10 * time.Second}
resp, err := client.Get(monitor.URL) // e.g., https://plex.hatfieldhosted.com/
```
Key behaviors:
- Runs every 60 seconds (`interval: 60`)
- Checks the **public URL** of each proxy host
- Uses `Go-http-client/2.0` User-Agent (visible in logs)
- **Correctly treats 401/403 as "service is up"** (lines 471-474 of uptime_service.go)
---
## 3. ARCHITECTURE FLOW
```text
┌─────────────────────────────────────────────────────────────┐
│ Charon Container (172.20.0.1 from Docker's perspective) │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────┐ │
│ │ Uptime Service │ │
│ │ (Go-http-client/2.0)│ │
│ └──────────┬──────────┘ │
│ │ GET https://plex.hatfieldhosted.com/ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Caddy Reverse Proxy │ │
│ │ (ports 80/443) │ │
│ └──────────┬──────────┘ │
│ │ Logs request to access.log │
└─────────────┼───────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Plex Container (172.20.0.x) │
├─────────────────────────────────────────────────────────────┤
│ GET / → 401 Unauthorized (no X-Plex-Token) │
└─────────────────────────────────────────────────────────────┘
```
---
## 4. DOCKER HEALTH CHECK STATUS
### ✅ Docker Health Check is WORKING CORRECTLY
**Configuration** (from all docker-compose files):
```yaml
healthcheck:
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:8080/api/v1/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
```
**Evidence:**
```
[GIN] 2025/12/16 - 01:04:45 | 200 | 304.212µs | ::1 | GET "/api/v1/health"
```
- Hits `/api/v1/health` (not `/`)
- Returns `200` (not `401`)
- Source IP is `::1` (localhost)
- Interval is 30s (matches config)
### Health Endpoint Details
**Route Registration** ([routes.go#L86](backend/internal/api/routes/routes.go#L86)):
```go
router.GET("/api/v1/health", handlers.HealthHandler)
```
This is registered **before** any auth middleware, making it a public endpoint.
**Handler Response** ([health_handler.go#L29-L37](backend/internal/api/handlers/health_handler.go#L29-L37)):
```go
func HealthHandler(c *gin.Context) {
c.JSON(http.StatusOK, gin.H{
"status": "ok",
"service": version.Name,
"version": version.Version,
"git_commit": version.GitCommit,
"build_time": version.BuildTime,
"internal_ip": getLocalIP(),
})
}
```
---
## 5. WHY THIS IS NOT A BUG
### Uptime Service Design is Correct
From [uptime_service.go#L471-L474](backend/internal/services/uptime_service.go#L471-L474):
```go
// Accept 2xx, 3xx, and 401/403 (Unauthorized/Forbidden often means the service is up but protected)
if (resp.StatusCode >= 200 && resp.StatusCode < 400) || resp.StatusCode == 401 || resp.StatusCode == 403 {
success = true
msg = fmt.Sprintf("HTTP %d", resp.StatusCode)
}
```
**Rationale:** A 401 response proves:
- The service is running
- The network path is functional
- The application is responding
This is industry-standard practice for uptime monitoring of auth-protected services.
---
## 6. RECOMMENDATIONS
### Option A: Do Nothing (Recommended)
The current behavior is correct:
- Docker health checks work ✅
- Uptime monitoring works ✅
- Plex is correctly marked as "up" despite 401 ✅
The 401s in Caddy access logs are informational noise, not errors.
### Option B: Reduce Log Verbosity (Optional)
If the log noise is undesirable, options include:
1. **Configure Caddy to not log uptime checks:**
Add a log filter for `Go-http-client` User-Agent
2. **Use backend health endpoints:**
Some services like Plex have health endpoints (`/identity`, `/status`) that don't require auth
3. **Add per-monitor health path option:**
Extend `UptimeMonitor` model to allow custom health check paths
### Option C: Already Implemented
The Uptime Service already logs status changes only, not every check:
```go
if statusChanged {
logger.Log().WithFields(map[string]interface{}{
"host_name": host.Name,
// ...
}).Info("Host status changed")
}
```
---
## 7. SUMMARY TABLE
| Question | Answer |
|----------|--------|
| What is making the requests? | Charon's Uptime Service (`Go-http-client/2.0`) |
| Should `/` be accessible without auth? | N/A - this is hitting proxied backends, not Charon |
| Is there a dedicated health endpoint? | Yes: `/api/v1/health` (public, returns 200) |
| Is Docker health check working? | ✅ Yes, every 30s, returns 200 |
| Are the 401s a bug? | ❌ No, they're expected from auth-protected backends |
| What's the fix? | None needed - working as designed |
---
## 8. CONCLUSION
**The 401s are NOT from Docker health checks or Charon auth failures.**
They are normal responses from **auth-protected backend services** (like Plex) being monitored by Charon's uptime service. The uptime service correctly interprets 401/403 as "service is up but requires authentication."
**No fix required.** The system is working as designed.