# Break Glass Protocol Redesign - Root Cause Analysis & 3-Tier Architecture **Date:** January 26, 2026 **Status:** Analysis Complete - Implementation Pending **Priority:** 🔴 CRITICAL - Emergency access is broken **Estimated Timeline:** 2-4 hours implementation + testing --- ## Executive Summary The emergency break glass token is **currently non-functional** due to a fundamental architectural flaw: the emergency reset endpoint is protected by the same Cerberus middleware it needs to bypass. This creates a deadlock scenario where administrators locked out by ACL/WAF cannot use the emergency token to regain access. **Current State:** Emergency endpoint → Cerberus ACL blocks request → Emergency handler never executes **Required State:** Emergency endpoint → Bypass all security → Emergency handler executes This document provides: 1. Complete root cause analysis with evidence 2. 3-tier break glass architecture design 3. Actionable implementation plan 4. Comprehensive verification strategy --- ## Part 1: Root Cause Analysis ### 1.1 The Deadlock Problem #### Evidence from Code Analysis **File:** `backend/internal/api/routes/routes.go` (Lines 113-116) ```go // Emergency endpoint - MUST be registered BEFORE Cerberus middleware // This endpoint bypasses all security checks for lockout recovery // Requires CHARON_EMERGENCY_TOKEN env var to be configured emergencyHandler := handlers.NewEmergencyHandler(db) router.POST("/api/v1/emergency/security-reset", emergencyHandler.SecurityReset) ``` **File:** `backend/internal/api/routes/routes.go` (Lines 118-122) ```go api := router.Group("/api/v1") // Cerberus middleware applies the optional security suite checks (WAF, ACL, CrowdSec) cerb := cerberus.New(cfg.Security, db) api.Use(cerb.Middleware()) ``` #### The Critical Flaw While the comment claims the emergency endpoint is registered "BEFORE Cerberus middleware," examination of the code reveals **it's registered on the root router but still under the `/api/v1` path**. The issue is: 1. **Emergency endpoint registration:** `router.POST("/api/v1/emergency/security-reset", ...)` 2. **API group with Cerberus:** `api := router.Group("/api/v1")` followed by `api.Use(cerb.Middleware())` **The problem:** Both routes share the `/api/v1` prefix. While there's an attempt to register the emergency endpoint on the root router before the API group is created with middleware, **Gin's routing may not guarantee this bypass behavior**. The `/api/v1/emergency/security-reset` path could still match routes within the `/api/v1` group depending on Gin's internal route resolution order. ### 1.2 Middleware Execution Order #### Current Middleware Chain (from `routes.go`) ``` 1. gzip.Gzip() - Global compression (Line 61) 2. middleware.SecurityHeaders() - Security headers (Line 68) 3. [Emergency endpoint registered here - Line 116] 4. cerb.Middleware() - Cerberus ACL/WAF/CrowdSec (Line 122) 5. authMiddleware() - JWT validation (Line 201) 6. [Protected endpoints] ``` #### The Cerberus Middleware ACL Logic **File:** `backend/internal/cerberus/cerberus.go` (Lines 134-160) ```go if aclEnabled { acls, err := c.accessSvc.List() if err == nil { clientIP := ctx.ClientIP() for _, acl := range acls { if !acl.Enabled { continue } allowed, _, err := c.accessSvc.TestIP(acl.ID, clientIP) if err == nil && !allowed { // Send security notification _ = c.securityNotifySvc.Send(context.Background(), models.SecurityEvent{ EventType: "acl_deny", Severity: "warn", Message: "Access control list blocked request", ClientIP: clientIP, Path: ctx.Request.URL.Path, Timestamp: time.Now(), Metadata: map[string]any{ "acl_name": acl.Name, "acl_id": acl.ID, }, }) ctx.AbortWithStatusJSON(http.StatusForbidden, gin.H{"error": "Blocked by access control list"}) return } } } } ``` **Key observations:** - ACL check happens **before** any endpoint-specific logic - Uses `ctx.AbortWithStatusJSON()` which **terminates the request chain** - Emergency token header is **never examined** by Cerberus - No bypass mechanism for emergency scenarios ### 1.3 Layer 3 vs Layer 7 Analysis #### CrowdSec Bouncer Investigation **File:** `.docker/compose/docker-compose.e2e.yml` (Lines 1-31) ```yaml services: charon-e2e: image: charon:local container_name: charon-e2e restart: "no" ports: - "8080:8080" # Management UI (Charon) environment: - CHARON_ENV=development - CHARON_DEBUG=0 - TZ=UTC - CHARON_ENCRYPTION_KEY=${CHARON_ENCRYPTION_KEY:?CHARON_ENCRYPTION_KEY is required} - CHARON_EMERGENCY_TOKEN=${CHARON_EMERGENCY_TOKEN:-test-emergency-token-for-e2e-32chars} ``` **Evidence from container inspection:** ```bash $ docker exec charon-e2e sh -c "command -v cscli" /usr/local/bin/cscli $ docker exec charon-e2e sh -c "iptables -L -n -v 2>/dev/null" [No output - iptables not available or no rules configured] ``` **Analysis:** - CrowdSec CLI (`cscli`) is present in the container - iptables does not appear to have active rules - **However:** The actual blocking may be happening at the **Caddy layer** via the `caddy-crowdsec-bouncer` plugin **File:** `backend/internal/cerberus/cerberus.go` (Lines 162-170) ```go // CrowdSec integration: The actual IP blocking is handled by the caddy-crowdsec-bouncer // plugin at the Caddy layer. This middleware provides defense-in-depth tracking. // When CrowdSec mode is "local", the bouncer communicates directly with the LAPI // to receive ban decisions and block malicious IPs before they reach the application. if c.cfg.CrowdSecMode == "local" { // Track that this request passed through CrowdSec evaluation // Note: Blocking decisions are made by Caddy bouncer, not here metrics.IncCrowdSecRequest() logger.Log().WithField("client_ip", ctx.ClientIP()).WithField("path", ctx.Request.URL.Path).Debug("Request evaluated by CrowdSec bouncer at Caddy layer") } ``` **Critical finding:** CrowdSec blocking happens at **Caddy layer (Layer 7 reverse proxy)** BEFORE the request reaches the Go application. This means: 1. **Layer 7 Block (Caddy):** CrowdSec bouncer → IP banned → HTTP 403 response 2. **Layer 7 Block (Go):** Cerberus ACL → IP not in whitelist → HTTP 403 response **Neither blocking point examines the emergency token header.** ### 1.4 Test Environment Network Topology #### Docker Network Analysis **Container:** `charon-e2e` **Port Mapping:** `8080:8080` (host → container) **Network Mode:** Docker bridge network (default) **Test Client:** Playwright running on host machine **Request Flow:** ``` [Playwright Test] ↓ (localhost:8080) [Docker Bridge Network] ↓ (172.17.0.x → charon-e2e:8080) [Caddy Reverse Proxy] ↓ (CrowdSec bouncer check - Layer 7) [Charon Go Application] ↓ (Cerberus ACL middleware - Layer 7) [Emergency Handler] ← NEVER REACHED ``` **Client IP as seen by backend:** From the test client's perspective, the backend sees the request coming from: - **Development:** `127.0.0.1` or `::1` (loopback) - **Docker bridge:** `172.17.0.1` (Docker gateway) - **E2E tests:** Likely appears as Docker internal IP **ACL Whitelist Issue:** If ACL is enabled with a restrictive whitelist (e.g., only `10.0.0.0/8`), the test client's IP (`172.17.0.1`) would be **blocked** before the emergency endpoint can execute. ### 1.5 Test Failure Scenario **File:** `tests/global-setup.ts` (Lines 63-106) ```typescript async function emergencySecurityReset(requestContext: APIRequestContext): Promise { console.log('Performing emergency security reset...'); const emergencyToken = 'test-emergency-token-for-e2e-32chars'; const headers = { 'Content-Type': 'application/json', 'X-Emergency-Token': emergencyToken, }; const modules = [ { key: 'security.acl.enabled', value: 'false' }, { key: 'security.waf.enabled', value: 'false' }, { key: 'security.crowdsec.enabled', value: 'false' }, { key: 'security.rate_limit.enabled', value: 'false' }, { key: 'feature.cerberus.enabled', value: 'false' }, ]; for (const { key, value } of modules) { try { await requestContext.post('/api/v1/settings', { data: { key, value }, headers, }); console.log(` ✓ Disabled: ${key}`); } catch (e) { console.log(` ⚠ Could not disable ${key}: ${e}`); } } // ... } ``` **Problem:** The test uses `/api/v1/settings` endpoint (not the emergency endpoint!) and passes the emergency token header. This is **incorrect** because: 1. **Wrong endpoint:** `/api/v1/settings` requires authentication via `authMiddleware` 2. **Wrong endpoint (again):** The emergency endpoint is `/api/v1/emergency/security-reset` 3. **ACL blocks first:** If ACL is enabled, the request is blocked at Cerberus before reaching the settings handler **Expected test flow:** ```typescript await requestContext.post('/api/v1/emergency/security-reset', { headers: { 'X-Emergency-Token': emergencyToken, }, }); ``` ### 1.6 Emergency Handler Validation **File:** `backend/internal/api/handlers/emergency_handler.go` (Lines 1-312) The emergency handler itself is **well-designed** with: - ✅ Timing-safe token comparison (constant-time) - ✅ Rate limiting (5 attempts per minute per IP) - ✅ Minimum token length validation (32 chars) - ✅ Comprehensive audit logging - ✅ Disables all security modules via settings - ✅ Updates `SecurityConfig` database record **The handler works correctly IF it can be reached.** --- ## Part 2: 3-Tier Break Glass Architecture ### 2.1 Design Philosophy **Defense in Depth for Recovery:** - **Tier 1 (Digital Key):** Fast, convenient, Layer 7 bypass within the application - **Tier 2 (Sidecar Door):** Separate ingress with minimal security, network-isolated - **Tier 3 (Physical Key):** Direct system access for catastrophic failures Each tier provides a fallback if the previous tier fails. ### 2.2 Tier 1: Digital Key (Layer 7 Bypass) #### Concept A high-priority middleware that short-circuits the entire security stack when the emergency token is present and valid. #### Design **Middleware Registration Order (NEW):** ```go // TOP OF CHAIN: Emergency bypass middleware (before gzip, before security headers) router.Use(middleware.EmergencyBypass(cfg.Security.EmergencyToken, db)) // Then standard middleware router.Use(gzip.Gzip(gzip.DefaultCompression)) router.Use(middleware.SecurityHeaders(securityHeadersCfg)) // Emergency handler registration on root router router.POST("/api/v1/emergency/security-reset", emergencyHandler.SecurityReset) // API group with Cerberus (emergency requests skip this entirely) api := router.Group("/api/v1") api.Use(cerb.Middleware()) ``` #### Implementation: Emergency Bypass Middleware **File:** `backend/internal/api/middleware/emergency.go` (NEW) ```go package middleware import ( "crypto/subtle" "net" "os" "strings" "github.com/gin-gonic/gin" "github.com/Wikid82/charon/backend/internal/logger" "gorm.io/gorm" ) const ( EmergencyTokenHeader = "X-Emergency-Token" EmergencyTokenEnvVar = "CHARON_EMERGENCY_TOKEN" MinTokenLength = 32 ) // EmergencyBypass creates middleware that bypasses all security checks // when a valid emergency token is present from an authorized source. // // Security conditions (ALL must be met): // 1. Request from management CIDR (RFC1918 private networks by default) // 2. X-Emergency-Token header matches configured token (timing-safe) // 3. Token meets minimum length requirement (32+ chars) // // This middleware must be registered FIRST in the middleware chain. func EmergencyBypass(managementCIDRs []string, db *gorm.DB) gin.HandlerFunc { // Load emergency token from environment emergencyToken := os.Getenv(EmergencyTokenEnvVar) if emergencyToken == "" { logger.Log().Warn("CHARON_EMERGENCY_TOKEN not set - emergency bypass disabled") return func(c *gin.Context) { c.Next() } // noop } if len(emergencyToken) < MinTokenLength { logger.Log().Warn("CHARON_EMERGENCY_TOKEN too short - emergency bypass disabled") return func(c *gin.Context) { c.Next() } // noop } // Parse management CIDRs var managementNets []*net.IPNet for _, cidr := range managementCIDRs { _, ipnet, err := net.ParseCIDR(cidr) if err != nil { logger.Log().WithError(err).WithField("cidr", cidr).Warn("Invalid management CIDR") continue } managementNets = append(managementNets, ipnet) } // Default to RFC1918 private networks if none specified if len(managementNets) == 0 { managementNets = []*net.IPNet{ mustParseCIDR("10.0.0.0/8"), mustParseCIDR("172.16.0.0/12"), mustParseCIDR("192.168.0.0/16"), mustParseCIDR("127.0.0.0/8"), // localhost for local development } } return func(c *gin.Context) { // Check if emergency token is present providedToken := c.GetHeader(EmergencyTokenHeader) if providedToken == "" { c.Next() // No emergency token - proceed normally return } // Validate source IP is from management network clientIP := net.ParseIP(c.ClientIP()) if clientIP == nil { logger.Log().WithField("ip", c.ClientIP()).Warn("Emergency bypass: invalid client IP") c.Next() return } inManagementNet := false for _, ipnet := range managementNets { if ipnet.Contains(clientIP) { inManagementNet = true break } } if !inManagementNet { logger.Log().WithField("ip", clientIP.String()).Warn("Emergency bypass: IP not in management network") c.Next() return } // Timing-safe token comparison if !constantTimeCompare(emergencyToken, providedToken) { logger.Log().WithField("ip", clientIP.String()).Warn("Emergency bypass: invalid token") c.Next() return } // Valid emergency token from authorized source logger.Log().WithFields(map[string]interface{}{ "ip": clientIP.String(), "path": c.Request.URL.Path, }).Warn("EMERGENCY BYPASS ACTIVE: Request bypassing all security checks") // Set flag for downstream handlers to know this is an emergency request c.Set("emergency_bypass", true) // Strip emergency token header to prevent it from reaching application // This is critical for security - prevents token exposure in logs c.Request.Header.Del(EmergencyTokenHeader) c.Next() } } func mustParseCIDR(cidr string) *net.IPNet { _, ipnet, _ := net.ParseCIDR(cidr) return ipnet } func constantTimeCompare(a, b string) bool { return subtle.ConstantTimeCompare([]byte(a), []byte(b)) == 1 } ``` #### Cerberus Middleware Update **File:** `backend/internal/cerberus/cerberus.go` (Line 106) ```go func (c *Cerberus) Middleware() gin.HandlerFunc { return func(ctx *gin.Context) { // Check for emergency bypass flag if bypass, exists := ctx.Get("emergency_bypass"); exists && bypass.(bool) { logger.Log().WithField("path", ctx.Request.URL.Path).Debug("Cerberus: Skipping security checks (emergency bypass)") ctx.Next() return } if !c.IsEnabled() { ctx.Next() return } // ... rest of existing logic } } ``` #### Security Considerations **Strengths:** - ✅ Double authentication: IP CIDR + secret token - ✅ Timing-safe comparison prevents timing attacks - ✅ Token stripped before reaching application (log safety) - ✅ Comprehensive audit logging - ✅ Bypass flag prevents any middleware from blocking **Weaknesses:** - ⚠️ Relies on `ClientIP()` which can be spoofed if behind proxies - ⚠️ Token in HTTP header (use HTTPS only) - ⚠️ If Caddy bouncer blocks at Layer 7, request never reaches Go app **Mitigations:** - Configure Gin's `SetTrustedProxies()` correctly - Document HTTPS-only requirement - Implement Tier 2 for Caddy-level blocks ### 2.3 Tier 2: Sidecar Door (Separate Entry Point) #### Concept A secondary HTTP port with minimal security, bound to localhost or VPN-only interfaces. #### Design **Architecture:** ``` [Public Traffic:443/80] ↓ [Caddy Reverse Proxy] ↓ (WAF, CrowdSec, ACL) [Charon Main Port:8080] [VPN/Localhost Only:2019] ← Sidecar Port ↓ [Emergency-Only Server] ↓ (Basic Auth or mTLS ONLY) [Emergency Handlers] ``` #### Implementation **File:** `backend/internal/server/emergency_server.go` (NEW) ```go package server import ( "context" "net/http" "time" "github.com/gin-gonic/gin" "gorm.io/gorm" "github.com/Wikid82/charon/backend/internal/api/handlers" "github.com/Wikid82/charon/backend/internal/api/middleware" "github.com/Wikid82/charon/backend/internal/config" "github.com/Wikid82/charon/backend/internal/logger" ) // EmergencyServer provides a minimal HTTP server for emergency operations. // This server runs on a separate port with minimal security for failsafe access. type EmergencyServer struct { server *http.Server db *gorm.DB cfg config.EmergencyConfig } // NewEmergencyServer creates a new emergency server instance func NewEmergencyServer(db *gorm.DB, cfg config.EmergencyConfig) *EmergencyServer { return &EmergencyServer{ db: db, cfg: cfg, } } // Start initializes and starts the emergency server func (s *EmergencyServer) Start() error { if !s.cfg.Enabled { logger.Log().Info("Emergency server disabled") return nil } router := gin.New() router.Use(gin.Recovery()) // Basic request logging (minimal) router.Use(func(c *gin.Context) { start := time.Now() c.Next() logger.Log().WithFields(map[string]interface{}{ "method": c.Request.Method, "path": c.Request.URL.Path, "status": c.Writer.Status(), "latency": time.Since(start).Milliseconds(), }).Info("Emergency server request") }) // Basic auth middleware (if configured) if s.cfg.BasicAuthUsername != "" && s.cfg.BasicAuthPassword != "" { router.Use(gin.BasicAuth(gin.Accounts{ s.cfg.BasicAuthUsername: s.cfg.BasicAuthPassword, })) } else { logger.Log().Warn("Emergency server has no authentication - use only on localhost!") } // Emergency endpoints emergencyHandler := handlers.NewEmergencyHandler(s.db) router.POST("/emergency/security-reset", emergencyHandler.SecurityReset) // Health check router.GET("/health", func(c *gin.Context) { c.JSON(http.StatusOK, gin.H{"status": "ok", "server": "emergency"}) }) // Start server s.server = &http.Server{ Addr: s.cfg.BindAddress, Handler: router, ReadTimeout: 10 * time.Second, WriteTimeout: 10 * time.Second, } logger.Log().WithField("address", s.cfg.BindAddress).Info("Starting emergency server") go func() { if err := s.server.ListenAndServe(); err != nil && err != http.ErrServerClosed { logger.Log().WithError(err).Error("Emergency server failed") } }() return nil } // Stop gracefully shuts down the emergency server func (s *EmergencyServer) Stop(ctx context.Context) error { if s.server == nil { return nil } logger.Log().Info("Stopping emergency server") return s.server.Shutdown(ctx) } ``` **Configuration:** `backend/internal/config/config.go` ```go type EmergencyConfig struct { Enabled bool `env:"CHARON_EMERGENCY_SERVER_ENABLED" envDefault:"false"` BindAddress string `env:"CHARON_EMERGENCY_BIND" envDefault:"127.0.0.1:2019"` BasicAuthUsername string `env:"CHARON_EMERGENCY_USERNAME" envDefault:""` BasicAuthPassword string `env:"CHARON_EMERGENCY_PASSWORD" envDefault:""` } ``` **Docker Compose:** `.docker/compose/docker-compose.e2e.yml` ```yaml services: charon-e2e: ports: - "8080:8080" # Main application - "2019:2019" # Emergency server (DO NOT expose publicly) environment: - CHARON_EMERGENCY_SERVER_ENABLED=true - CHARON_EMERGENCY_BIND=0.0.0.0:2019 # Bind to all interfaces in container - CHARON_EMERGENCY_USERNAME=admin - CHARON_EMERGENCY_PASSWORD=${CHARON_EMERGENCY_PASSWORD:-changeme} - CHARON_EMERGENCY_TOKEN=${CHARON_EMERGENCY_TOKEN:-test-emergency-token-for-e2e-32chars} ``` #### Security Considerations **Strengths:** - ✅ Completely separate from main application stack - ✅ No WAF, no CrowdSec, no ACL - ✅ Can bind to localhost-only (unreachable from network) - ✅ Optional Basic Auth or mTLS **Weaknesses:** - ⚠️ If exposed publicly, becomes attack surface - ⚠️ Basic Auth is weak (prefer mTLS for production) **Mitigations:** - **NEVER expose port publicly** - Use firewall rules to restrict access - Use VPN or SSH tunneling to reach port - Implement mTLS for production ### 2.4 Tier 3: Physical Key (Direct System Access) #### Concept When all application-level recovery fails, administrators need direct system access to manually fix the problem. #### Access Methods **1. SSH to Host Machine** ```bash # SSH to Docker host ssh admin@docker-host.example.com # View Charon logs docker logs charon-e2e # View CrowdSec decisions docker exec charon-e2e cscli decisions list # Delete all CrowdSec bans docker exec charon-e2e cscli decisions delete --all # Flush iptables (if CrowdSec uses netfilter) docker exec charon-e2e iptables -F docker exec charon-e2e iptables -X # Stop Caddy to bypass reverse proxy docker exec charon-e2e pkill caddy # Restart container with security disabled docker compose -f .docker/compose/docker-compose.e2e.yml down export CHARON_SECURITY_DISABLED=true docker compose -f .docker/compose/docker-compose.e2e.yml up -d ``` **2. Direct Database Access** ```bash # Access SQLite database directly docker exec -it charon-e2e sqlite3 /app/data/charon.db # Disable all security modules UPDATE settings SET value = 'false' WHERE key = 'feature.cerberus.enabled'; UPDATE settings SET value = 'false' WHERE key = 'security.acl.enabled'; UPDATE settings SET value = 'false' WHERE key = 'security.waf.enabled'; UPDATE security_configs SET enabled = 0 WHERE name = 'default'; ``` **3. Docker Volume Inspection** ```bash # Find Charon data volume docker volume ls | grep charon # Inspect volume docker volume inspect charon_data # Mount volume to temporary container docker run --rm -v charon_data:/data -it alpine sh cd /data vi charon.db # Or use sqlite3 ``` #### Documentation: Emergency Runbooks **File:** `docs/runbooks/emergency-lockout-recovery.md` (NEW) ```markdown # Emergency Lockout Recovery Runbook ## Symptom "Access Forbidden" or "Blocked by access control list" when trying to access Charon web interface. ## Tier 1: Digital Key (Emergency Token) ### Prerequisites - Access to `CHARON_EMERGENCY_TOKEN` value from deployment configuration - HTTPS connection to Charon (token security) - Source IP in management network (default: RFC1918 private IPs) ### Procedure 1. Send POST request with emergency token header: ```bash curl -X POST https://charon.example.com/api/v1/emergency/security-reset \ -H "X-Emergency-Token: " \ -H "Content-Type: application/json" ``` 2. Verify response: `{"success": true, "disabled_modules": [...]}` 3. Wait 5 seconds for settings to propagate 4. Access web interface ### Troubleshooting - **403 Forbidden before reset:** Tier 1 failed - proceed to Tier 2 - **401 Unauthorized:** Token mismatch - verify token from deployment config - **429 Too Many Requests:** Rate limited - wait 1 minute - **501 Not Implemented:** Token not configured in environment ## Tier 2: Sidecar Door (Emergency Server) ### Prerequisites - VPN or SSH access to Docker host - Knowledge of emergency server port (default: 2019) - Emergency server enabled in configuration ### Procedure 1. SSH to Docker host: ```bash ssh admin@docker-host.example.com ``` 2. Create SSH tunnel to emergency port: ```bash ssh -L 2019:localhost:2019 admin@docker-host.example.com ``` 3. From local machine, call emergency endpoint: ```bash curl -X POST http://localhost:2019/emergency/security-reset \ -H "X-Emergency-Token: " \ -u admin:password ``` 4. Verify response and access web interface ### Troubleshooting - **Connection refused:** Emergency server not enabled - **401 Unauthorized:** Basic auth credentials incorrect ## Tier 3: Physical Key (Direct System Access) ### Prerequisites - root or sudo access to Docker host - Knowledge of container name (default: charon-e2e or charon) ### Procedure 1. SSH to Docker host: ```bash ssh admin@docker-host.example.com ``` 2. Clear CrowdSec bans: ```bash docker exec charon cscli decisions delete --all ``` 3. Disable security via database: ```bash docker exec charon sqlite3 /app/data/charon.db < { test('should bypass ACL when valid emergency token is provided', async ({ request }) => { const testData = new TestDataManager(request, 'emergency-token-bypass'); // Step 1: Create restrictive ACL (whitelist only 192.168.1.0/24) const { id: aclId } = await testData.createAccessList({ name: 'test-restrictive-acl', type: 'whitelist', ipRules: [{ cidr: '192.168.1.0/24', description: 'Test network' }], enabled: true, }); // Step 2: Enable ACL globally await request.post('/api/v1/settings', { data: { key: 'security.acl.enabled', value: 'true' }, }); // Wait for settings to propagate await new Promise(resolve => setTimeout(resolve, 2000)); // Step 3: Verify ACL is blocking (request without emergency token should fail) const blockedResponse = await request.get('/api/v1/proxy-hosts'); expect(blockedResponse.status()).toBe(403); const blockedBody = await blockedResponse.json(); expect(blockedBody.error).toContain('Blocked by access control'); // Step 4: Use emergency token to disable security const emergencyToken = 'test-emergency-token-for-e2e-32chars'; const emergencyResponse = await request.post('/api/v1/emergency/security-reset', { headers: { 'X-Emergency-Token': emergencyToken, }, }); expect(emergencyResponse.status()).toBe(200); const emergencyBody = await emergencyResponse.json(); expect(emergencyBody.success).toBe(true); expect(emergencyBody.disabled_modules).toContain('security.acl.enabled'); // Wait for settings to propagate await new Promise(resolve => setTimeout(resolve, 2000)); // Step 5: Verify ACL is now disabled (request should succeed) const allowedResponse = await request.get('/api/v1/proxy-hosts'); expect(allowedResponse.ok()).toBeTruthy(); // Cleanup await testData.cleanup(); }); test('should rate limit emergency token attempts', async ({ request }) => { const emergencyToken = 'wrong-token-for-rate-limit-test-32chars'; // Make 6 rapid attempts with wrong token const attempts = []; for (let i = 0; i < 6; i++) { attempts.push( request.post('/api/v1/emergency/security-reset', { headers: { 'X-Emergency-Token': emergencyToken }, }) ); } const responses = await Promise.all(attempts); // First 5 should be unauthorized (401) for (let i = 0; i < 5; i++) { expect(responses[i].status()).toBe(401); } // 6th should be rate limited (429) expect(responses[5].status()).toBe(429); const body = await responses[5].json(); expect(body.error).toBe('rate limit exceeded'); }); test('should log emergency token usage to audit trail', async ({ request }) => { const emergencyToken = 'test-emergency-token-for-e2e-32chars'; // Use emergency token const response = await request.post('/api/v1/emergency/security-reset', { headers: { 'X-Emergency-Token': emergencyToken }, }); expect(response.ok()).toBeTruthy(); // Check audit logs for emergency event const auditResponse = await request.get('/api/v1/audit-logs'); expect(auditResponse.ok()).toBeTruthy(); const auditLogs = await auditResponse.json(); const emergencyLog = auditLogs.find( (log: any) => log.action === 'emergency_reset_success' ); expect(emergencyLog).toBeDefined(); expect(emergencyLog.details).toContain('Disabled modules'); }); }); ``` ### 4.4 Chaos Testing **File:** `tests/chaos/security-lockout.spec.ts` (NEW) ```typescript import { test, expect } from '@playwright/test'; import { TestDataManager } from '../utils/TestDataManager'; test.describe('Security Lockout Recovery - Chaos Testing', () => { test('should recover from complete lockout scenario', async ({ request }) => { // Simulate worst-case scenario: // 1. ACL enabled with restrictive whitelist // 2. WAF enabled and blocking patterns // 3. Rate limiting enabled // 4. CrowdSec enabled with bans const testData = new TestDataManager(request, 'chaos-lockout-recovery'); // Enable all security modules with maximum restrictions await request.post('/api/v1/settings', { data: { key: 'security.acl.enabled', value: 'true' }, }); await request.post('/api/v1/settings', { data: { key: 'security.waf.enabled', value: 'true' }, }); await request.post('/api/v1/settings', { data: { key: 'security.rate_limit.enabled', value: 'true' }, }); await request.post('/api/v1/settings', { data: { key: 'feature.cerberus.enabled', value: 'true' }, }); // Create restrictive ACL await testData.createAccessList({ name: 'chaos-test-acl', type: 'whitelist', ipRules: [{ cidr: '10.0.0.0/8' }], // Only allow 10.x.x.x enabled: true, }); // Wait for settings to propagate await new Promise(resolve => setTimeout(resolve, 3000)); // Verify complete lockout const lockedResponse = await request.get('/api/v1/health'); expect(lockedResponse.status()).toBe(403); // RECOVERY: Use emergency token const emergencyResponse = await request.post('/api/v1/emergency/security-reset', { headers: { 'X-Emergency-Token': 'test-emergency-token-for-e2e-32chars', }, }); expect(emergencyResponse.status()).toBe(200); // Wait for settings to propagate await new Promise(resolve => setTimeout(resolve, 3000)); // Verify full recovery const recoveredResponse = await request.get('/api/v1/health'); expect(recoveredResponse.ok()).toBeTruthy(); // Cleanup await testData.cleanup(); }); }); ``` --- ## Part 5: Timeline & Dependencies ``` Day 1 (4 hours) ├─ Phase 3.1: Emergency Bypass Middleware (1h) ├─ Phase 3.2: Emergency Server (1.5h) ├─ Phase 3.3: Documentation (0.5h) └─ Phase 3.4: Test Environment (1h) Day 2 (2 hours) ├─ Phase 3.5: Production Deployment (0.5h) ├─ E2E Testing (1h) └─ Documentation Review (0.5h) Total: 6 hours (spread across 2 days) ``` **Dependencies:** - Emergency Bypass Middleware → Cerberus update (sequential) - Emergency Server → Configuration updates (sequential) - All phases → Documentation (parallel after code complete) - Production deployment → All tests passing (blocker) --- ## Part 6: Risk Assessment ### High Priority Risks | Risk | Impact | Likelihood | Mitigation | |------|--------|------------|------------| | Emergency token leaked | Critical | Low | Rotate token immediately, audit logs, require 2FA | | Middleware ordering bug | Critical | Medium | Comprehensive integration tests, code review | | Emergency port exposed publicly | High | Medium | Firewall rules, documentation warnings | | ClientIP spoofing behind proxy | High | Medium | Configure SetTrustedProxies() correctly | | Emergency server no auth | Critical | Low | Require Basic Auth or mTLS in production | ### Medium Priority Risks | Risk | Impact | Likelihood | Mitigation | |------|--------|------------|------------| | Token in logs (HTTP headers logged) | Medium | High | Strip header after validation, use HTTPS | | Rate limiting too strict | Low | Medium | Adjust limits, provide bypass for Tier 2 | | Emergency endpoint DOS | Medium | Low | Rate limiting, Web Application Firewall | | Documentation outdated | Medium | Medium | Automated testing of runbook procedures | --- ## Part 7: Success Criteria ### Must Have (MVP) - ✅ Emergency token bypasses Cerberus ACL middleware - ✅ Emergency endpoint accessible when ACL is blocking - ✅ Unit tests for emergency bypass middleware (>80% coverage) - ✅ Integration tests for ACL bypass scenario - ✅ E2E tests pass with security enabled - ✅ Emergency runbook documented and tested ### Should Have (Production Ready) - ✅ Emergency server (Tier 2) implemented and tested - ✅ Management CIDR configuration - ✅ Token rotation procedure documented - ✅ Audit logging for all emergency access - ✅ Monitoring alerts for emergency token usage - ✅ Rate limiting with appropriate thresholds ### Nice to Have (Future Enhancements) - ⏳ mTLS support for emergency server - ⏳ Multi-factor authentication for emergency access - ⏳ Emergency access session tokens (time-limited) - ⏳ Automated emergency token rotation - ⏳ Emergency access approval workflow --- ## Appendix A: Configuration Reference ### Environment Variables ```bash # Emergency Token (Required) CHARON_EMERGENCY_TOKEN=<64-char-hex-token> # openssl rand -hex 32 # Management Networks (Optional, defaults to RFC1918) CHARON_MANAGEMENT_CIDRS=10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 # Emergency Server (Optional) CHARON_EMERGENCY_SERVER_ENABLED=true CHARON_EMERGENCY_BIND=127.0.0.1:2019 # localhost only by default CHARON_EMERGENCY_USERNAME=admin CHARON_EMERGENCY_PASSWORD= ``` ### Docker Compose Example ```yaml services: charon: image: charon:latest ports: - "443:443" # Main HTTPS - "127.0.0.1:2019:2019" # Emergency port (localhost only) environment: - CHARON_EMERGENCY_TOKEN=${CHARON_EMERGENCY_TOKEN} - CHARON_MANAGEMENT_CIDRS=10.10.0.0/16,192.168.1.0/24 - CHARON_EMERGENCY_SERVER_ENABLED=true - CHARON_EMERGENCY_USERNAME=admin - CHARON_EMERGENCY_PASSWORD=${EMERGENCY_PASSWORD} ``` --- ## Appendix B: Testing Checklist ### Pre-Implementation Tests - [x] Reproduce current failure (global-setup.ts emergency reset fails with ACL enabled) - [x] Document exact error messages - [x] Verify Cerberus middleware execution order - [x] Verify CrowdSec layer (Caddy vs iptables) ### Post-Implementation Tests - [ ] Unit tests for emergency bypass middleware pass - [ ] Integration tests for ACL bypass pass - [ ] E2E tests pass with all security modules enabled - [ ] Emergency server unit tests pass - [ ] Chaos testing scenarios pass - [ ] Runbook procedures tested manually - [ ] Emergency token rotation procedure tested ### Production Smoke Tests - [ ] Health check endpoint responds - [ ] Emergency endpoint responds to valid token - [ ] Emergency endpoint blocks invalid tokens - [ ] Emergency endpoint rate limits excessive attempts - [ ] Audit logs capture emergency access events - [ ] Monitoring alerts trigger on emergency access --- ## Appendix C: Decision Records ### Decision 1: Why 3 Tiers Instead of Single Break Glass? **Date:** January 26, 2026 **Decision:** Implement 3-tier break glass architecture instead of single emergency endpoint **Rationale:** - **Single Point of Failure:** A single break glass mechanism can fail (blocked by Caddy, network issues, etc.) - **Defense in Depth:** Multiple recovery paths increase resilience - **Operational Flexibility:** Different scenarios may require different access methods **Trade-offs:** - More complexity to implement and maintain - More attack surface (emergency server port) - More documentation and training required **Mitigation:** Comprehensive documentation, automated testing, clear runbooks --- ### Decision 2: Middleware First vs Endpoint Registration **Date:** January 26, 2026 **Decision:** Use middleware bypass flag instead of registering endpoint before middleware **Rationale:** - **Gin Routing Ambiguity:** `/api/v1/emergency/...` may still match `/api/v1` group routes - **Explicit Control:** Bypass flag gives clear control flow - **Testability:** Easier to test middleware behavior with context flags **Trade-offs:** - Requires checking flag in all security middleware - Slightly more code changes **Mitigation:** Comprehensive testing, clear documentation of bypass mechanism --- ### Decision 3: Emergency Server Port 2019 **Date:** January 26, 2026 **Decision:** Use port 2019 for emergency server (matching Caddy admin API default) **Rationale:** - **Convention:** Caddy uses 2019 for admin API, familiar to operators - **Separation:** Clearly separate from main application ports (80/443/8080) - **Non-Standard:** Less likely to conflict with other services **Trade-offs:** - Not a well-known port (requires documentation) **Mitigation:** Document in all deployment guides, include in runbooks --- ## Conclusion This comprehensive plan provides: 1. **Root Cause Analysis:** Complete understanding of why the emergency token currently fails 2. **3-Tier Architecture:** Robust break glass system with multiple recovery paths 3. **Implementation Plan:** Actionable tasks with time estimates and verification steps 4. **Testing Strategy:** Unit, integration, E2E, and chaos testing 5. **Documentation:** Runbooks, configuration reference, decision records **Next Steps:** 1. Review and approve this plan 2. Begin Phase 3.1 (Emergency Bypass Middleware) 3. Execute implementation phases in order 4. Verify with comprehensive testing 5. Deploy to production with monitoring **Estimated Completion:** 6 hours of implementation + 2 hours of testing = **8 hours total**