Files
Charon/docs/plans/current_spec.md
GitHub Actions 67c93ff6b5 hotfix: Route-Aware Verification and jq Dependency
- Added a new implementation report for the Cerberus TC-2 test fix detailing the changes made to handle the break glass protocol's dual-route structure.
- Modified `scripts/cerberus_integration.sh` to replace naive byte-position checking with route-aware verification.
- Introduced a hard requirement for jq, including error handling for its absence.
- Implemented emergency route detection using exact path matching.
- Enhanced defensive programming practices with JSON validation, route structure checks, and numeric validations.
- Improved logging and output for better debugging and clarity.
- Verified handler order within main routes while skipping emergency routes.
- Updated test results and compliance with specifications in the implementation report.
2026-01-28 21:46:11 +00:00

1205 lines
46 KiB
Markdown

# Cerberus Integration Test Failure - Handler Ordering After Break Glass Protocol
**Status**: ACTIVE - ANALYSIS COMPLETE
**Priority**: HIGH 🔴 - CI Blocking
**Created**: 2026-01-28
**Updated**: 2026-01-28
**Bug**: Cerberus integration test TC-2 fails due to emergency route handler ordering
**CI Run**: https://github.com/Wikid82/Charon/actions/runs/21453152384/job/61786891302
**Branch**: feature/beta-release
**PR**: https://github.com/Wikid82/Charon/pull/550
---
## Executive Summary
The Cerberus integration test (**TC-2: Verify Handler Order in Caddy Config**) is failing in CI after the break glass protocol implementation. The test uses byte position checking to verify that security handlers (WAF, rate_limit) appear **before** the reverse_proxy handler in the Caddy configuration JSON.
**Impact**:
- 🔴 **CI pipeline blocked** - Cannot merge PR #550
- 🔴 **Test TC-2 fails** - Handler order verification fails
- 🟡 **Break glass protocol working correctly** - Emergency routes properly bypass security
- 🟢 **No functional issue** - Caddy processes routes correctly (emergency first, then main)
- 🟢 **All other tests passing** - Only TC-2 affected
**Root Cause**:
After implementing break glass protocol, each proxy host now generates **TWO routes**:
1. **Emergency route** (added first, line 711): Has path matchers, NO security handlers, reverse_proxy only
2. **Main route** (added second, line 731): No path matchers, HAS security handlers + reverse_proxy
The test uses `grep -ob` to find byte positions of handlers in the Caddy JSON config. It fails because:
- Emergency route's `reverse_proxy` appears FIRST in the JSON (earliest byte position)
- Main route's `waf` and `rate_limit` appear LATER (later byte positions)
- Test: `if WAF_POS < PROXY_POS && RATE_POS < PROXY_POS`**FAILS** (reverse_proxy comes first!)
**The Fix**:
Update the test to be **route-aware** instead of using naive byte position checking. The test should:
1. Parse the Caddy JSON config properly
2. For each route, verify handler order WITHIN that route
3. Skip emergency routes (identified by path matchers for `/api/v1/emergency/*`)
4. Verify security handlers appear before reverse_proxy in main routes only
**Complexity**: MEDIUM - Requires test refactoring, no production code changes needed
---
## Technical Analysis
### Break Glass Protocol Route Structure (Current Implementation)
**File**: `backend/internal/caddy/config.go`
For each proxy host, the config generator creates **TWO routes** in this order:
#### 1. Emergency Route (Lines 687-711)
```go
emergencyPaths := []string{
"/api/v1/emergency/security-reset",
"/api/v1/emergency/*",
"/emergency/security-reset",
"/emergency/*",
}
// NOTE: NO securityHandlers - bypasses WAF, rate_limit, ACL, etc.
emergencyHandlers := append(append([]Handler{}, handlers...),
ReverseProxyHandler(dial, host.WebsocketSupport, host.Application, enableStdHeaders))
emergencyRoute := &Route{
Match: []Match{{
Host: uniqueDomains, // e.g., ["cerberus.test.local"]
Path: emergencyPaths, // CRITICAL: Has path matchers
}},
Handle: emergencyHandlers, // NO WAF, NO rate_limit, YES reverse_proxy
Terminal: true,
}
routes = append(routes, emergencyRoute) // Added FIRST
```
**Key Properties**:
- ✅ Has path matchers (`/api/v1/emergency/*`)
- ✅ NO security handlers (WAF, rate_limit, CrowdSec, ACL)
- ✅ DOES have reverse_proxy handler
- ✅ Added to routes slice FIRST
- ✅ Caddy evaluates this first due to specificity (host + path > host only)
#### 2. Main Route (Lines 714-731)
```go
// NOTE: DOES include securityHandlers - applies WAF, rate_limit, ACL, etc.
mainHandlers := append(append([]Handler{}, securityHandlers...), handlers...)
mainHandlers = append(mainHandlers,
ReverseProxyHandler(dial, host.WebsocketSupport, host.Application, enableStdHeaders))
route := &Route{
Match: []Match{{
Host: uniqueDomains, // e.g., ["cerberus.test.local"]
// NO Path matchers - matches all paths for this host
}},
Handle: mainHandlers, // YES WAF, YES rate_limit, YES reverse_proxy
Terminal: true,
}
routes = append(routes, route) // Added SECOND
```
**Key Properties**:
- ✅ NO path matchers (matches all paths)
- ✅ DOES have security handlers (WAF, rate_limit, CrowdSec, ACL)
- ✅ DOES have reverse_proxy handler
- ✅ Added to routes slice SECOND
- ✅ Caddy evaluates this second (less specific matcher)
### How Caddy Processes These Routes (CORRECT BEHAVIOR)
Caddy evaluates routes in order of specificity (not array order):
1. **Most specific matcher**: Emergency route (host + path) → Checked first
- Request: `GET /api/v1/emergency/security-reset` → Matches emergency route → **Bypasses security**
- Handlers: `handlers``reverse_proxy` (NO security)
2. **Less specific matcher**: Main route (host only) → Checked second
- Request: `GET /api/v1/proxy-hosts` → Does NOT match emergency route → Falls through to main route
- Handlers: `securityHandlers` (WAF, rate_limit) → `handlers``reverse_proxy`
**This is the intended break glass protocol design** - emergency paths bypass security, everything else goes through security.
### How the Test Checks Handler Order (BROKEN LOGIC)
**File**: `scripts/cerberus_integration.sh` (Lines 374-420)
```bash
log_test "TC-2: Verify Handler Order in Caddy Config"
CADDY_CONFIG=$(curl -sL "http://localhost:${CADDY_ADMIN_PORT}/config/" 2>/dev/null || echo "")
# Find byte positions of handlers in the JSON
WAF_POS=$(echo "$CADDY_CONFIG" | grep -ob '"handler":"waf"' | head -1 | cut -d: -f1 || echo "0")
RATE_POS=$(echo "$CADDY_CONFIG" | grep -ob '"handler":"rate_limit"' | head -1 | cut -d: -f1 || echo "0")
PROXY_POS=$(echo "$CADDY_CONFIG" | grep -ob '"handler":"reverse_proxy"' | head -1 | cut -d: -f1 || echo "0")
# Check if security handlers appear before reverse_proxy
if [ "$WAF_POS" != "0" ] && [ "$RATE_POS" != "0" ] && [ "$PROXY_POS" != "0" ]; then
if [ "$WAF_POS" -lt "$PROXY_POS" ] && [ "$RATE_POS" -lt "$PROXY_POS" ]; then
log_info " ✓ Security handlers appear before reverse_proxy"
PASSED=$((PASSED + 1))
else
fail_test "Security handlers not in correct order"
fi
fi
```
### Why the Test Fails (THE BUG)
**Problem**: `grep -ob '"handler":"reverse_proxy"' | head -1` finds the **FIRST** occurrence of `reverse_proxy` in the entire JSON.
**Scenario** (with break glass protocol):
```json
{
"apps": {
"http": {
"servers": {
"charon_server": {
"routes": [
{
"match": [{"host": ["cerberus.test.local"], "path": ["/api/v1/emergency/*"]}],
"handle": [
{"handler": "reverse_proxy", ...} // <-- FIRST reverse_proxy (byte pos 1234)
]
},
{
"match": [{"host": ["cerberus.test.local"]}],
"handle": [
{"handler": "waf", ...}, // <-- WAF (byte pos 2000)
{"handler": "rate_limit", ...}, // <-- rate_limit (byte pos 2500)
{"handler": "reverse_proxy", ...} // <-- SECOND reverse_proxy (byte pos 3000)
]
}
]
}
}
}
}
}
```
**Test Logic**:
- `WAF_POS = 2000`
- `RATE_POS = 2500`
- `PROXY_POS = 1234` (emergency route's reverse_proxy - FIRST occurrence!)
- Check: `WAF_POS (2000) < PROXY_POS (1234)`**FALSE**
- Check: `RATE_POS (2500) < PROXY_POS (1234)`**FALSE**
- **Test fails**: "Security handlers not in correct order"
**Reality**: The security handlers ARE in correct order **within the main route**. The test is checking byte positions across ALL routes, which doesn't account for the multi-route structure.
### Why This Didn't Fail Before
**Before break glass protocol**:
- Only ONE route per proxy host
- That route had: `securityHandlers``handlers``reverse_proxy`
- `WAF_POS < RATE_POS < PROXY_POS` was always true
- Test passed correctly
**After break glass protocol**:
- TWO routes per proxy host (emergency + main)
- Emergency route has `reverse_proxy` FIRST in JSON (earliest byte position)
- Main route has `WAF → rate_limit → reverse_proxy` LATER in JSON
- Test's byte-position logic breaks - it finds emergency route's reverse_proxy first
---
## Requirements Specification [NEW]
### Functional Requirements
**FR-1: jq Dependency Management**
- The test MUST check for jq availability before execution
- The test MUST fail with a clear error message if jq is not installed
- Error message MUST include installation instructions for common platforms
- NO fallback mode - jq is a hard requirement
**FR-2: Emergency Path Detection**
- The test MUST use EXACT path matching for emergency routes
- The following 4 paths MUST be recognized as emergency paths (from config.go:687-691):
- `/api/v1/emergency/security-reset` (exact match)
- `/api/v1/emergency/*` (exact match)
- `/emergency/security-reset` (exact match)
- `/emergency/*` (exact match)
- Substring matching (e.g., `contains("/emergency")`) MUST NOT be used
- Emergency paths MUST be defined as a readonly array
**FR-3: Defensive Programming**
- JSON validation MUST occur before any jq processing
- Route structure validation MUST verify `.apps.http.servers.charon_server.routes` exists
- All numeric values (indices, counts) MUST be validated as numeric before comparisons
- All array access operations MUST be preceded by existence checks
**FR-4: Network Resilience**
- curl command MUST include `--max-time 10` timeout
- curl MUST retry up to 3 times with 2-second delay between attempts
- Test MUST fail gracefully if Caddy config cannot be retrieved after retries
**FR-5: Bash Best Practices**
- Loops MUST use bash arithmetic syntax: `for ((i=0; i<N; i++))`
- Numeric comparisons MUST use `-lt`, `-gt`, `-eq` (not string comparison)
- Index comparisons MUST validate operands are numeric first
- Critical failures MUST use `return 1` to exit test immediately
### Non-Functional Requirements
**NFR-1: Performance**
- Test MUST complete within 60 seconds under normal conditions
- Each curl retry MUST have exponential backoff or fixed 2s delay
**NFR-2: Maintainability**
- Emergency paths MUST be easily updateable (single readonly array definition)
- Error messages MUST be descriptive and actionable
- Code MUST include inline comments explaining validation steps
**NFR-3: Backward Compatibility**
- Test MUST pass for configs with only main routes (no emergency routes)
- Test MUST pass for configs with only emergency routes (edge case)
- Test MUST handle multiple proxy hosts correctly
**NFR-4: CI/CD Integration**
- jq MUST be verified as available in GitHub Actions runners
- Test output MUST be parseable by CI logging systems
- Test exit codes MUST be consistent (0=pass, 1=fail)
---
## Rollback Plan [NEW]
### Triggers for Rollback
Rollback to previous byte-position checking if:
1. Test becomes unstable (>5% failure rate in CI)
2. False negatives detected (fails on valid configs)
3. Performance degrades (test takes >90 seconds)
4. jq dependency causes CI failures
### Rollback Procedure
**Step 1: Revert Test Script**
```bash
git revert <commit-hash> # Revert TC-2 changes
```
**Step 2: Restore Original Logic**
```bash
# Restore simple byte-position checking
WAF_POS=$(echo "$CADDY_CONFIG" | grep -ob '"handler":"waf"' | head -1 | cut -d: -f1)
RATE_POS=$(echo "$CADDY_CONFIG" | grep -ob '"handler":"rate_limit"' | head -1 | cut -d: -f1)
PROXY_POS=$(echo "$CADDY_CONFIG" | grep -ob '"handler":"reverse_proxy"' | head -1 | cut -d: -f1)
```
**Step 3: Skip TC-2 Temporarily**
```bash
# Add condition to skip TC-2 until proper fix is implemented
if [ "$SKIP_TC2" = "true" ]; then
log_warn "TC-2 skipped (break glass protocol incompatible)"
return 0
fi
```
**Step 4: Document Issue**
- Create GitHub issue documenting why rollback was needed
- Tag issue with `test-framework` and `cerberus`
- Link to failed CI runs
**Step 5: Alternative Approaches**
If route-aware verification proves problematic:
- **Option A**: Count handler occurrences instead of checking order
- **Option B**: Skip TC-2 entirely and add backend unit tests for route generation
- **Option C**: Move handler order verification to E2E tests (Playwright)
### Rollback Verification
After rollback:
- [ ] Run full Cerberus integration test suite locally
- [ ] Verify CI passes
- [ ] Monitor for 24 hours
- [ ] Document lessons learned
---
## Solution: Route-Aware Handler Order Verification
### Approach
**Fix the test, not the production code.** The production code is correct - the break glass protocol is working as designed. The test's naive byte-position checking needs to be replaced with route-aware verification.
### Implementation Options
#### Option A: Parse JSON and Verify Per-Route (RECOMMENDED) ⭐
**Approach**:
1. Use `jq` to parse the Caddy JSON config
2. Extract routes individually
3. For each route, check if it's an emergency route (has `/api/v1/emergency/*` path matcher)
4. Skip emergency routes (they're supposed to bypass security)
5. For main routes only, verify handler order
**Pros**:
- ✅ Accurate - checks handler order within logical route boundaries
- ✅ Accounts for emergency routes correctly
- ✅ Robust - works with any number of routes
- ✅ Easy to understand and maintain
**Cons**:
- ⚠️ Requires `jq` to be installed in test environment
**Implementation**:
```bash
# TC-2: Verify handler order in Caddy config
log_test "TC-2: Verify Handler Order in Caddy Config (Route-Aware)"
CADDY_CONFIG=$(curl -sL "http://localhost:${CADDY_ADMIN_PORT}/config/" 2>/dev/null || echo "")
if [ -z "$CADDY_CONFIG" ]; then
fail_test "Could not retrieve Caddy config"
else
# Extract routes from Caddy config
# Filter to only check main routes (non-emergency routes)
# Emergency routes are identified by having path matchers for /api/v1/emergency/*
# Count total routes
TOTAL_ROUTES=$(echo "$CADDY_CONFIG" | jq -r '.apps.http.servers.charon_server.routes | length')
log_info " Found $TOTAL_ROUTES routes in config"
# Check each route individually
MAIN_ROUTES_CHECKED=0
HANDLER_ORDER_CORRECT=0
for i in $(seq 0 $((TOTAL_ROUTES - 1))); do
ROUTE=$(echo "$CADDY_CONFIG" | jq -r ".apps.http.servers.charon_server.routes[$i]")
# Check if this is an emergency route (has path matchers containing /emergency)
HAS_EMERGENCY_PATH=$(echo "$ROUTE" | jq -r '
.match[]? |
select(.path != null) |
.path[]? |
select(contains("/emergency"))' | wc -l)
if [ "$HAS_EMERGENCY_PATH" -gt 0 ]; then
log_info " Route $i: Emergency route (skipping handler order check)"
continue
fi
# This is a main route - check handler order
MAIN_ROUTES_CHECKED=$((MAIN_ROUTES_CHECKED + 1))
# Get handler types in order
HANDLERS=$(echo "$ROUTE" | jq -r '.handle[].handler' | tr '\n' ' ')
log_info " Route $i: Main route handlers: $HANDLERS"
# Verify WAF comes before reverse_proxy (if WAF is present)
WAF_INDEX=$(echo "$ROUTE" | jq -r '[.handle[].handler] | map(if . == "waf" then true else false end) | index(true) // -1')
PROXY_INDEX=$(echo "$ROUTE" | jq -r '[.handle[].handler] | map(if . == "reverse_proxy" then true else false end) | index(true) // -1')
if [ "$WAF_INDEX" != "-1" ] && [ "$PROXY_INDEX" != "-1" ]; then
if [ "$WAF_INDEX" -lt "$PROXY_INDEX" ]; then
log_info " ✓ WAF ($WAF_INDEX) before reverse_proxy ($PROXY_INDEX)"
HANDLER_ORDER_CORRECT=$((HANDLER_ORDER_CORRECT + 1))
else
fail_test "WAF must appear before reverse_proxy in route $i"
fi
fi
# Verify rate_limit comes before reverse_proxy (if rate_limit is present)
RATE_INDEX=$(echo "$ROUTE" | jq -r '[.handle[].handler] | map(if . == "rate_limit" then true else false end) | index(true) // -1')
if [ "$RATE_INDEX" != "-1" ] && [ "$PROXY_INDEX" != "-1" ]; then
if [ "$RATE_INDEX" -lt "$PROXY_INDEX" ]; then
log_info " ✓ rate_limit ($RATE_INDEX) before reverse_proxy ($PROXY_INDEX)"
else
fail_test "rate_limit must appear before reverse_proxy in route $i"
fi
fi
done
if [ "$MAIN_ROUTES_CHECKED" -gt 0 ] && [ "$HANDLER_ORDER_CORRECT" -gt 0 ]; then
log_info " ✓ Handler order verified in $MAIN_ROUTES_CHECKED main routes"
PASSED=$((PASSED + 1))
elif [ "$MAIN_ROUTES_CHECKED" -eq 0 ]; then
log_warn " No main routes found to verify (this may indicate a config issue)"
PASSED=$((PASSED + 1)) # Don't fail if no main routes (may be valid)
else
fail_test "Handler order incorrect in main routes"
fi
fi
```
#### Option B: Count Handlers Instead of Positions (SIMPLER)
**Approach**:
1. Count total occurrences of each handler type
2. Verify expected counts match
3. Don't worry about order - Caddy handles it correctly
**Pros**:
- ✅ Simple - no JSON parsing needed
- ✅ Robust - works with emergency routes
- ✅ No additional dependencies
**Cons**:
- ⚠️ Doesn't actually verify order (less strict)
- ⚠️ Could miss misconfigurations
**Implementation**:
```bash
# TC-2: Verify required handlers are present
log_test "TC-2: Verify Required Security Handlers Present"
CADDY_CONFIG=$(curl -sL "http://localhost:${CADDY_ADMIN_PORT}/config/" 2>/dev/null || echo "")
if [ -z "$CADDY_CONFIG" ]; then
fail_test "Could not retrieve Caddy config"
else
# Count handler occurrences
WAF_COUNT=$(echo "$CADDY_CONFIG" | grep -o '"handler":"waf"' | wc -l)
RATE_COUNT=$(echo "$CADDY_CONFIG" | grep -o '"handler":"rate_limit"' | wc -l)
PROXY_COUNT=$(echo "$CADDY_CONFIG" | grep -o '"handler":"reverse_proxy"' | wc -l)
log_info " Handler counts: WAF=$WAF_COUNT, rate_limit=$RATE_COUNT, reverse_proxy=$PROXY_COUNT"
# Verify we have at least one of each security handler
if [ "$WAF_COUNT" -ge 1 ]; then
log_info " ✓ WAF handler present"
PASSED=$((PASSED + 1))
else
fail_test "WAF handler not found"
fi
if [ "$RATE_COUNT" -ge 1 ]; then
log_info " ✓ rate_limit handler present"
PASSED=$((PASSED + 1))
else
fail_test "rate_limit handler not found"
fi
# Verify we have more reverse_proxy handlers than security handlers
# (because emergency routes have reverse_proxy but no security)
if [ "$PROXY_COUNT" -gt "$WAF_COUNT" ]; then
log_info " ✓ Emergency routes present (more reverse_proxy than WAF)"
PASSED=$((PASSED + 1))
else
log_warn " ⚠ May be missing emergency routes"
fi
fi
```
#### Option C: Document Test Limitation and Skip (NOT RECOMMENDED)
**Approach**:
1. Update test to skip handler order check when emergency routes are detected
2. Add comment explaining why
3. Continue with rest of tests
**Pros**:
- ✅ Minimal change
- ✅ Quick to implement
**Cons**:
- ❌ Loses test coverage
- ❌ Doesn't validate handler order at all
- ❌ Defeats purpose of TC-2
### Recommended Approach: Option A (Route-Aware Verification) ⭐
**Rationale**:
- Most accurate and complete verification
- Accounts for emergency routes explicitly
- Maintains test coverage for handler ordering
- Clear and maintainable code
- Aligns with break glass protocol design
---
## Implementation Plan
### Phase 1: Verify Test Failure Locally (15 minutes) [UPDATED]
**Objective**: Reproduce the test failure in local environment and confirm root cause.
**Tasks**:
1. **Run Cerberus integration test locally**:
```bash
# Start test environment
docker compose -f .docker/compose/docker-compose.playwright-local.yml up -d
# Run Cerberus integration test
./scripts/cerberus_integration.sh
```
2. **Capture failing test output**:
- Document exact error message for TC-2
- Capture byte positions reported: `WAF_POS`, `RATE_POS`, `PROXY_POS`
- Verify that `PROXY_POS < WAF_POS` (emergency route's reverse_proxy comes first)
3. **Inspect Caddy config**:
```bash
# Get Caddy config
curl -s http://localhost:2019/config/ | jq . > caddy_config.json
# Count routes
jq '.apps.http.servers.charon_server.routes | length' caddy_config.json
# Verify emergency route is first
jq '.apps.http.servers.charon_server.routes[0].match[].path[]?' caddy_config.json
# Verify main route is second
jq '.apps.http.servers.charon_server.routes[1].handle[].handler' caddy_config.json
```
4. **Verify break glass protocol works**:
```bash
# Emergency endpoint should bypass security
curl -H "X-Emergency-Token: test-emergency-token-for-e2e-32chars" \
-X POST http://localhost:8080/api/v1/emergency/security-reset
# Should return 200 OK and disable security modules
```
**Success Criteria**:
- ✅ Test fails with "Security handlers not in correct order"
- ✅ Confirmed: `PROXY_POS < WAF_POS` due to emergency route
- ✅ Confirmed: Emergency route appears first in JSON
- ✅ Confirmed: Break glass protocol works functionally
**Files**:
- `scripts/cerberus_integration.sh` (read only - observe failure)
- Document findings in diagnosis report
**Estimated Time**: 15 minutes
---
### Phase 2: Implement Route-Aware Verification (45 minutes) [UPDATED]
**Objective**: Update the test to properly verify handler order within each route, accounting for emergency routes.
**CRITICAL CHANGES** (Addressing Supervisor Feedback):
1. **Make jq a HARD REQUIREMENT** - No fallback mode
2. **Fix emergency path detection** - Use EXACT path matching for 4 specific paths
3. **Add defensive checks** - JSON validation, numeric validation, route structure checks
4. **Improve curl command** - Add timeout and retry logic
5. **Fix bash scripting** - Use bash arithmetic loops, proper numeric comparisons
**File**: `scripts/cerberus_integration.sh` (lines 374-420)
**Changes**:
```bash
# TC-2: Verify handler order in Caddy config (ROUTE-AWARE)
# ============================================================================
log_test "TC-2: Verify Handler Order in Caddy Config"
# HARD REQUIREMENT: Check if jq is available (no fallback mode)
if ! command -v jq >/dev/null 2>&1; then
fail_test "jq is required for handler order verification. Install: apt-get install jq / brew install jq"
return 1
fi
# Fetch Caddy config with timeout and retry
CADDY_CONFIG=""
for attempt in 1 2 3; do
CADDY_CONFIG=$(curl --max-time 10 -sL "http://localhost:${CADDY_ADMIN_PORT}/config/" 2>/dev/null || echo "")
if [ -n "$CADDY_CONFIG" ]; then
break
fi
log_warn " Attempt $attempt/3: Failed to fetch Caddy config, retrying..."
sleep 2
done
if [ -z "$CADDY_CONFIG" ]; then
fail_test "Could not retrieve Caddy config after 3 attempts"
return 1
fi
# Validate JSON structure before processing
if ! echo "$CADDY_CONFIG" | jq empty 2>/dev/null; then
fail_test "Retrieved Caddy config is not valid JSON"
return 1
fi
# Validate expected structure exists
if ! echo "$CADDY_CONFIG" | jq -e '.apps.http.servers.charon_server.routes' >/dev/null 2>&1; then
fail_test "Caddy config missing expected route structure (.apps.http.servers.charon_server.routes)"
return 1
fi
# Get route count with validation
TOTAL_ROUTES=$(echo "$CADDY_CONFIG" | jq -r '.apps.http.servers.charon_server.routes | length' 2>/dev/null || echo "0")
# Validate route count is numeric and non-negative
if ! [[ "$TOTAL_ROUTES" =~ ^[0-9]+$ ]]; then
fail_test "Invalid route count (not numeric): $TOTAL_ROUTES"
return 1
fi
log_info " Found $TOTAL_ROUTES routes in Caddy config"
if [ "$TOTAL_ROUTES" -eq 0 ]; then
fail_test "No routes found in Caddy config"
return 1
fi
# Define EXACT emergency paths (must match config.go lines 687-691)
readonly EMERGENCY_PATHS=(
"/api/v1/emergency/security-reset"
"/api/v1/emergency/*"
"/emergency/security-reset"
"/emergency/*"
)
ROUTES_VERIFIED=0
EMERGENCY_ROUTES_SKIPPED=0
# Use bash arithmetic loop instead of seq
for ((i=0; i<TOTAL_ROUTES; i++)); do
# Validate route exists at index
if ! echo "$CADDY_CONFIG" | jq -e ".apps.http.servers.charon_server.routes[$i]" >/dev/null 2>&1; then
log_warn " Route $i: Missing or invalid route structure, skipping"
continue
fi
# Check if this route has EXACT emergency path matches
IS_EMERGENCY_ROUTE=false
for emergency_path in "${EMERGENCY_PATHS[@]}"; do
# EXACT path comparison (not substring matching)
EXACT_MATCH=$(echo "$CADDY_CONFIG" | jq -r "
.apps.http.servers.charon_server.routes[$i].match[]? |
select(.path != null) |
.path[]? |
select(. == \"$emergency_path\")" 2>/dev/null | wc -l | tr -d ' ')
# Validate match count is numeric
if ! [[ "$EXACT_MATCH" =~ ^[0-9]+$ ]]; then
log_warn " Route $i: Invalid match count for path '$emergency_path', skipping"
continue
fi
if [ "$EXACT_MATCH" -gt 0 ]; then
IS_EMERGENCY_ROUTE=true
break
fi
done
if [ "$IS_EMERGENCY_ROUTE" = true ]; then
log_info " Route $i: Emergency route (security bypass by design) - skipping"
EMERGENCY_ROUTES_SKIPPED=$((EMERGENCY_ROUTES_SKIPPED + 1))
continue
fi
# Main route - verify handler order
log_info " Route $i: Main route - verifying handler order..."
# Validate handlers array exists
if ! echo "$CADDY_CONFIG" | jq -e ".apps.http.servers.charon_server.routes[$i].handle" >/dev/null 2>&1; then
log_warn " Route $i: No handlers found, skipping"
continue
fi
# Find indices of security handlers and reverse_proxy
WAF_IDX=$(echo "$CADDY_CONFIG" | jq -r "[.apps.http.servers.charon_server.routes[$i].handle[]?.handler] | map(if . == \"waf\" then true else false end) | index(true) // -1" 2>/dev/null || echo "-1")
RATE_IDX=$(echo "$CADDY_CONFIG" | jq -r "[.apps.http.servers.charon_server.routes[$i].handle[]?.handler] | map(if . == \"rate_limit\" then true else false end) | index(true) // -1" 2>/dev/null || echo "-1")
PROXY_IDX=$(echo "$CADDY_CONFIG" | jq -r "[.apps.http.servers.charon_server.routes[$i].handle[]?.handler] | map(if . == \"reverse_proxy\" then true else false end) | index(true) // -1" 2>/dev/null || echo "-1")
# Validate all indices are numeric
if ! [[ "$WAF_IDX" =~ ^-?[0-9]+$ ]] || ! [[ "$RATE_IDX" =~ ^-?[0-9]+$ ]] || ! [[ "$PROXY_IDX" =~ ^-?[0-9]+$ ]]; then
fail_test "Invalid handler indices in route $i (not numeric)"
return 1
fi
# Verify WAF comes before reverse_proxy (if present)
if [ "$WAF_IDX" -ge 0 ] && [ "$PROXY_IDX" -ge 0 ]; then
if [ "$WAF_IDX" -lt "$PROXY_IDX" ]; then
log_info " ✓ WAF (index $WAF_IDX) before reverse_proxy (index $PROXY_IDX)"
else
fail_test "WAF must appear before reverse_proxy in route $i (WAF=$WAF_IDX, proxy=$PROXY_IDX)"
return 1
fi
fi
# Verify rate_limit comes before reverse_proxy (if present)
if [ "$RATE_IDX" -ge 0 ] && [ "$PROXY_IDX" -ge 0 ]; then
if [ "$RATE_IDX" -lt "$PROXY_IDX" ]; then
log_info " ✓ rate_limit (index $RATE_IDX) before reverse_proxy (index $PROXY_IDX)"
else
fail_test "rate_limit must appear before reverse_proxy in route $i (rate=$RATE_IDX, proxy=$PROXY_IDX)"
return 1
fi
fi
ROUTES_VERIFIED=$((ROUTES_VERIFIED + 1))
done
log_info " Summary: Verified $ROUTES_VERIFIED main routes, skipped $EMERGENCY_ROUTES_SKIPPED emergency routes"
if [ "$ROUTES_VERIFIED" -gt 0 ]; then
log_info " ✓ Handler order correct in all main routes"
PASSED=$((PASSED + 1))
else
log_warn " No main routes found to verify (all routes are emergency routes?)"
# Don't fail if only emergency routes exist - this may be valid in some configs
PASSED=$((PASSED + 1))
fi
```
**Key Changes** [UPDATED]:
1. **BREAKING**: Make jq a HARD REQUIREMENT (fail if missing, no fallback)
2. **FIXED**: Use EXACT path matching for 4 specific emergency paths (not substring `contains`)
3. **ADDED**: JSON validation before processing
4. **ADDED**: Numeric validation for all index comparisons
5. **ADDED**: curl timeout (10s) and retry logic (3 attempts, 2s delay)
6. **FIXED**: Use bash arithmetic loops (`for ((i=0; i<N; i++))`) instead of `seq`
7. **ADDED**: Route structure validation at each step
8. Skip emergency routes (they're supposed to bypass security)
9. Verify handler order WITHIN each main route only
10. Use `return 1` on critical failures to exit test immediately
**Success Criteria** [UPDATED]:
- ✅ Test passes with break glass protocol routes
- ✅ Emergency routes detected using EXACT path matching (not substring)
- ✅ Handler order verified in main routes only
- ✅ jq dependency checked at start (fail fast if missing)
- ✅ All defensive checks in place (JSON validation, numeric validation)
- ✅ curl command includes timeout and retry
- ✅ Bash loops use arithmetic syntax
- ✅ Clear, informative console output
- ✅ Test fails fast on critical errors
**Files**:
- `scripts/cerberus_integration.sh` (TC-2 section only)
**Dependencies**:
- **jq** (REQUIRED) - Must be installed in test environment and CI
**Estimated Time**: 45 minutes (increased due to additional validation)
---
### Phase 3: Update Documentation (25 minutes) [UPDATED]
**Objective**: Document the test change, requirements, and explain why handler order checking must be route-aware.
**Tasks**:
1. **Update test documentation**:
- File: `docs/testing/cerberus-integration.md` (create if doesn't exist)
- Explain break glass protocol route structure
- Document why emergency routes skip security handlers
- Explain why byte-position checking fails
- **[NEW]** Document jq as a hard dependency with installation instructions
- **[NEW]** Document emergency path matching logic (exact match vs substring)
- **[NEW]** Add troubleshooting section for common issues
2. **Create test requirements document** [NEW]:
- File: `docs/testing/cerberus-test-requirements.md`
- List all functional requirements (FR-1 through FR-5)
- List all non-functional requirements (NFR-1 through NFR-4)
- Document expected test behavior for each scenario
3. **Update CHANGELOG**:
```markdown
### Fixed
- **CI**: Fixed Cerberus integration test TC-2 (handler order verification) to be route-aware
- Test now correctly accounts for emergency bypass routes that skip security handlers
- **BREAKING**: jq is now a hard requirement for Cerberus integration tests
- Added defensive validation: JSON validation, numeric checks, route structure validation
- Added curl timeout and retry logic for network resilience
- Fixed emergency path detection to use exact matching instead of substring matching
- No production code changes - break glass protocol working as designed
```
4. **Add rollback documentation** [NEW]:
- File: `docs/testing/cerberus-rollback-plan.md`
- Document rollback triggers and procedure
- List alternative approaches if route-aware verification fails
5. **Update test inline comments**:
- Add comments explaining emergency route detection logic
- Document why we use jq for route parsing (required for route-aware checking)
- **[NEW]** Add comments for each defensive check
- **[NEW]** Add comments explaining bash arithmetic loops
- **[REMOVED]** ~~Explain fallback behavior~~ (no fallback mode)
**Success Criteria** [UPDATED]:
- ✅ Test behavior documented clearly with requirements
- ✅ Break glass protocol explained in context
- ✅ jq dependency documented with installation instructions
- ✅ Rollback plan documented
- ✅ Changelog updated with breaking change warning
- ✅ Inline comments added to all validation steps
- ✅ Troubleshooting guide created
**Files**:
- `docs/testing/cerberus-integration.md` (NEW/UPDATE)
- `docs/testing/cerberus-test-requirements.md` (NEW)
- `docs/testing/cerberus-rollback-plan.md` (NEW)
- `CHANGELOG.md` (UPDATE)
- `scripts/cerberus_integration.sh` (inline comments)
**Estimated Time**: 25 minutes (increased due to additional documentation)
---
### Phase 4: Verify Test Fix (45 minutes) [UPDATED]
**Objective**: Run the updated test locally and in CI to confirm it passes. Test all edge cases and backward compatibility.
**Tasks**:
1. **Verify jq is available**:
```bash
# Check jq version
jq --version
# If not installed:
# Ubuntu/Debian: sudo apt-get install -y jq
# macOS: brew install jq
# Alpine: apk add --no-cache jq
```
2. **Run test locally (standard case)**:
```bash
# Clean environment
docker compose -f .docker/compose/docker-compose.playwright-local.yml down -v
# Rebuild and start
docker compose -f .docker/compose/docker-compose.playwright-local.yml up -d
# Wait for services to be ready
sleep 15
# Run updated Cerberus integration test
./scripts/cerberus_integration.sh
```
3. **Test edge case: Multiple proxy hosts** [NEW]:
```bash
# Add 2-3 additional proxy hosts via UI or API
# Each host should generate 2 routes (emergency + main)
# Run test again - should verify all main routes
./scripts/cerberus_integration.sh
```
4. **Test edge case: Cerberus disabled** [NEW]:
```bash
# Use override compose file to disable Cerberus
docker compose -f .docker/compose/docker-compose.playwright-local.yml \
-f .docker/compose/docker-compose.e2e.cerberus-disabled.override.yml \
up -d
# Run test - TC-2 should pass (no security handlers to check)
./scripts/cerberus_integration.sh
```
5. **Test backward compatibility: Single route (no break-glass)** [NEW]:
```bash
# This requires temporarily reverting config.go to only generate main route
# OR testing against an older version of Charon
# TC-2 should still pass for single-route configs
```
6. **Verify jq is present in GitHub Actions** [NEW]:
```bash
# Check if jq is pre-installed on GitHub-hosted runners
# Ubuntu 22.04 runners include jq by default
# Verify in .github/workflows/*.yml
```
7. **Verify output** [UPDATED]:
- TC-2 should PASS
- Should see "Found N routes in Caddy config"
- Should see "Emergency route (security bypass by design) - skipping" for each emergency route
- Should see "Main route - verifying handler order" for each main route
- Should see "WAF (index X) before reverse_proxy (index Y)" for each main route
- Should see "rate_limit (index X) before reverse_proxy (index Y)" for each main route
- Should see "Summary: Verified N main routes, skipped M emergency routes"
- All other TCs (TC-1, TC-3, TC-4, TC-5) should continue to pass
- NO fallback messages (jq is required)
8. **Create PR and run CI** [UPDATED]:
- Push changes to feature branch
- Verify CI pipeline passes
- Check that Cerberus integration test passes in CI
- Review CI logs for expected output
**Success Criteria** [UPDATED]:
- ✅ TC-2 passes locally (standard case)
- ✅ TC-2 passes with multiple proxy hosts
- ✅ TC-2 passes with Cerberus disabled (edge case)
- ✅ TC-2 passes with single route (backward compatibility)
- ✅ All 5 Cerberus integration TCs pass
- ✅ Test output is clear and informative (includes route summaries)
- ✅ jq dependency verified in CI environment
- ✅ CI pipeline passes
- ✅ No regressions in other tests
- ✅ Test fails fast if jq is missing (no silent fallback)
**Estimated Time**: 45 minutes (includes edge case testing)
---
## Testing Strategy
### Unit Tests (Not Applicable)
This issue is purely in the integration test script - no unit tests needed.
### Integration Tests [UPDATED]
**File**: `scripts/cerberus_integration.sh`
The fix IS the integration test itself. Verification involves:
1. **Positive Test Cases**:
- Single proxy host with emergency + main routes → TC-2 passes
- **[NEW]** Multiple proxy hosts with multiple routes each → TC-2 passes
- Emergency routes correctly identified using EXACT path matching
- Main routes correctly verified for handler order
2. **Edge Cases** [UPDATED]:
- Only emergency routes (no main routes) → TC-2 should not fail (valid config)
- **[REMOVED]** ~~No jq available → Fallback~~ (jq now REQUIRED, test fails if missing)
- Caddy config unavailable after 3 retries → Fail gracefully with clear error
- Invalid JSON response → Fail with validation error
- Missing route structure → Fail with structure validation error
- **[NEW]** Cerberus disabled (no security handlers) → TC-2 passes (no security handlers to check)
3. **Backward Compatibility Test Cases** [NEW]:
- Single route, pre-break-glass config (no emergency route) → TC-2 should still pass
- Config with only main route (no path matchers) → Verify handlers normally
3. **Regression Tests**:
- TC-1 (feature status) still passes
- TC-3 (WAF doesn't consume rate limit) still passes
- TC-4 (traffic flows through layers) still passes
- TC-5 (latency check) still passes
### E2E Tests (Existing Coverage)
The break glass protocol already has E2E test coverage:
- `tests/emergency-server/emergency-server.spec.ts`
- `tests/emergency-server/tier2-validation.spec.ts`
- `tests/security-enforcement/emergency-token.spec.ts`
No additional E2E tests needed - the Cerberus integration test IS the verification.
---
## Success Metrics
### Functionality
- ✅ TC-2 passes with break glass protocol routes
- ✅ Emergency routes correctly detected and skipped
- ✅ Handler order verified in main routes
- ✅ No false positives (emergency routes don't fail order check)
- ✅ No false negatives (real order issues are still detected)
### CI/CD
- ✅ CI pipeline passes on feature/beta-release branch
- ✅ PR #550 can be merged
- ✅ No test flakiness introduced
### Documentation
- ✅ Test behavior documented clearly
- ✅ Break glass protocol explained in test context
- ✅ Changelog updated
---
## Risk Assessment [UPDATED]
| Risk | Impact | Likelihood | Mitigation |
|------|--------|------------|------------|
| **jq not available in CI** | High | Low | **CHANGED**: Make jq REQUIRED; verify it's pre-installed on GitHub runners (Ubuntu 22.04 includes jq) |
| **Test becomes too complex** | Medium | Medium | Clear comments, step-by-step validation, early exit on error |
| **False positives (test passes when it shouldn't)** | High | Low | Extensive defensive checks, numeric validation, structure validation |
| **Regression in other TCs** | Medium | Low | Run full test suite before committing |
| **Test still flaky** | High | Low | Curl retry logic (3 attempts), increased timeout, validate JSON before parsing |
| **Emergency path detection false positives** [NEW] | High | Low | Use EXACT path matching instead of substring matching |
| **Numeric comparison failures** [NEW] | Medium | Low | Validate all indices are numeric before comparison |
| **Missing route structure** [NEW] | Medium | Low | Validate route structure exists before accessing |
---
## Timeline [UPDATED]
**Total Estimated Time**: 2 hours 30 minutes (increased due to additional validation and testing)
- ✅ **Phase 1**: Verify test failure locally (15 min)
- ✅ **Phase 2**: Implement route-aware verification with full validation (45 min) [+15 min]
- ✅ **Phase 3**: Update documentation (25 min) [+10 min]
- ✅ **Phase 4**: Verify test fix and edge cases (45 min) [+15 min]
**Target Completion**: Same day
---
## Implementation Checklist
### Phase 1: Verify Test Failure Locally (15 min)
- [ ] Start E2E environment (`docker compose up`)
- [ ] Run `scripts/cerberus_integration.sh`
- [ ] Capture TC-2 failure output
- [ ] Inspect Caddy config with `curl http://localhost:2019/config/ | jq`
- [ ] Verify emergency route is first, main route is second
- [ ] Test break glass protocol manually
- [ ] Document findings
### Phase 2: Implement Route-Aware Verification (45 min) [UPDATED]
- [ ] **[NEW]** Add jq dependency check at start of TC-2 (fail if missing)
- [ ] Update `scripts/cerberus_integration.sh` TC-2 section
- [ ] **[NEW]** Add curl timeout (10s) and retry logic (3 attempts, 2s delay)
- [ ] **[NEW]** Add JSON validation before processing config
- [ ] **[NEW]** Add route structure validation
- [ ] Add jq-based route parsing
- [ ] **[CHANGED]** Implement emergency route detection using EXACT path matching (not `contains`)
- [ ] **[NEW]** Define readonly array of 4 specific emergency paths
- [ ] **[NEW]** Use bash arithmetic loops instead of `seq`
- [ ] **[NEW]** Add numeric validation for all index variables
- [ ] Skip emergency routes in verification
- [ ] Verify handler order in main routes only
- [ ] **[REMOVED]** ~~Add fallback for environments without jq~~ (jq now REQUIRED)
- [ ] **[NEW]** Add `return 1` statements for critical failures
- [ ] Test locally to confirm TC-2 passes
- [ ] Verify all other TCs still pass
### Phase 3: Update Documentation (25 min) [UPDATED]
- [ ] Create `docs/testing/cerberus-integration.md`
- [ ] Document break glass protocol route structure
- [ ] **[NEW]** Document jq as a hard dependency
- [ ] **[NEW]** Document emergency path matching logic (exact match, not substring)
- [ ] Explain why route-aware checking is needed
- [ ] **[NEW]** Add rollback plan section (revert to byte-position check if issues)
- [ ] **[NEW]** Document defensive checks and validation steps
- [ ] Update CHANGELOG.md
- [ ] Add inline comments to test script
- [ ] **[NEW]** Verify all file paths are accurate (no references to non-existent files)
### Phase 4: Verify Test Fix (45 min) [UPDATED]
- [ ] **[NEW]** Verify jq is installed (`jq --version`)
- [ ] Run test locally 3+ times to ensure stability
- [ ] **[NEW]** Test edge case: Multiple proxy hosts (3+ hosts)
- [ ] **[NEW]** Test edge case: Cerberus disabled (no security handlers)
- [ ] **[NEW]** Test edge case: Single route (backward compatibility)
- [ ] **[REMOVED]** ~~Test: no jq~~ (now fails fast if missing)
- [ ] Test edge case: Only emergency routes
- [ ] **[NEW]** Test: curl timeout (stop Caddy temporarily)
- [ ] **[NEW]** Test: Invalid JSON response (simulate)
- [ ] **[NEW]** Verify jq is available in GitHub Actions runners
- [ ] Create PR with test fix
- [ ] Push to feature/beta-release branch
- [ ] Monitor CI pipeline
- [ ] Verify Cerberus integration test passes in CI
- [ ] **[NEW]** Verify CI logs show route summaries
- [ ] Review CI logs for expected output
- [ ] Request code review
### Post-Implementation
- [ ] Monitor CI stability for 24 hours
- [ ] Merge PR #550 if all checks pass
- [ ] Close related issues
---
## Future Enhancements (Out of Scope)
1. **Structured Test Output**:
- Convert bash test to Go test with structured JSON output
- Easier to parse and integrate with CI reporting
2. **Visual Route Inspector**:
- Add admin UI endpoint showing route structure
- Highlight emergency vs main routes
- Show handler order per route
3. **Automated Route Verification**:
- Add backend unit tests for route generation
- Verify emergency routes have no security handlers
- Verify main routes have correct handler order
---
## Appendix
### Related Documentation
- **Break Glass Protocol Design**: `docs/plans/break_glass_protocol_redesign.md`
- **Cerberus Integration Plan**: `docs/plans/cerberus_integration_testing_plan.md`
- **Emergency Server Implementation**: `docs/implementation/PHASE_3_4_TEST_ENVIRONMENT_COMPLETE.md`
- **E2E Test Remediation**: `docs/implementation/e2e_remediation_complete.md`
### References
1. **Caddy Configuration**: `backend/internal/caddy/config.go` (lines 687-731)
- Emergency route generation (lines 687-711)
- Main route generation (lines 714-731)
2. **Cerberus Integration Test**: `scripts/cerberus_integration.sh` (lines 374-420)
- TC-2: Handler order verification
3. **Security Handlers**: `backend/internal/caddy/config.go` (lines 507-580)
- WAF handler builder
- Rate limit handler builder
- ACL handler builder
---
**Plan Status**: ✅ UPDATED - READY FOR SUPERVISOR REVIEW
**Next Action**: Submit to Supervisor for approval of updates
**Assigned To**: Planning Agent → Supervisor Review
**Priority**: HIGH 🔴 - CI blocking, prevents PR #550 from merging
**Estimated Duration**: 2 hours 30 minutes (updated)
**Impact**: Test fix only - No production code changes required
---
## Supervisor Feedback Addressed [NEW]
### ✅ Issue 1: Emergency Path Detection Fixed
- **Before**: Used `contains("/emergency")` substring matching
- **After**: Uses EXACT path matching for 4 specific paths from config.go
- **Implementation**: Readonly array with exact path strings, loop comparison
### ✅ Issue 2: E2E Compose File Path Corrected
- **Before**: Referenced non-existent `.docker/compose/docker-compose.e2e.yml`
- **After**: All references updated to `.docker/compose/docker-compose.playwright-local.yml`
- **Verified**: File exists and contains E2E test configuration
### ✅ Issue 3: jq Dependency Management
- **Before**: Had fallback mode without jq
- **After**: jq is HARD REQUIREMENT, test fails fast if missing
- **Added**: Explicit jq check with installation instructions in error message
- **Verified**: jq is pre-installed on GitHub Actions Ubuntu 22.04 runners
### ✅ Issue 4: Shell Scripting Improvements
- **Added**: JSON validation before processing (`jq empty`)
- **Added**: Route structure validation (`.apps.http.servers.charon_server.routes` exists)
- **Added**: Numeric validation for all indices before comparison
- **Fixed**: Use bash arithmetic loops `for ((i=0; i<N; i++))` instead of `seq`
- **Added**: curl timeout (`--max-time 10`) and retry logic (3 attempts, 2s delay)
- **Added**: Defensive checks at every array access point
- **Added**: `return 1` for critical failures to exit immediately
### ✅ Issue 5: Testing Enhancements
- **Added**: Multiple proxy hosts test case
- **Added**: Backward compatibility test (single route, pre-break-glass)
- **Added**: Edge case - Cerberus disabled (no security handlers)
- **Added**: Edge case - curl timeout/retry testing
- **Added**: Edge case - Invalid JSON response handling
### 📋 Plan Updates Summary
- Increased time estimate: 1h30m → 2h30m (more validation = more time)
- Updated file paths throughout (e2e.yml → playwright-local.yml)
- Enhanced error handling and validation at all steps
- Removed fallback mode entirely
- Added comprehensive requirements specification
- Added rollback plan for risk mitigation