Files
Charon/docs/plans/current_spec.md
GitHub Actions 67c93ff6b5 hotfix: Route-Aware Verification and jq Dependency
- Added a new implementation report for the Cerberus TC-2 test fix detailing the changes made to handle the break glass protocol's dual-route structure.
- Modified `scripts/cerberus_integration.sh` to replace naive byte-position checking with route-aware verification.
- Introduced a hard requirement for jq, including error handling for its absence.
- Implemented emergency route detection using exact path matching.
- Enhanced defensive programming practices with JSON validation, route structure checks, and numeric validations.
- Improved logging and output for better debugging and clarity.
- Verified handler order within main routes while skipping emergency routes.
- Updated test results and compliance with specifications in the implementation report.
2026-01-28 21:46:11 +00:00

46 KiB

Cerberus Integration Test Failure - Handler Ordering After Break Glass Protocol

Status: ACTIVE - ANALYSIS COMPLETE Priority: HIGH 🔴 - CI Blocking Created: 2026-01-28 Updated: 2026-01-28 Bug: Cerberus integration test TC-2 fails due to emergency route handler ordering CI Run: https://github.com/Wikid82/Charon/actions/runs/21453152384/job/61786891302 Branch: feature/beta-release PR: https://github.com/Wikid82/Charon/pull/550


Executive Summary

The Cerberus integration test (TC-2: Verify Handler Order in Caddy Config) is failing in CI after the break glass protocol implementation. The test uses byte position checking to verify that security handlers (WAF, rate_limit) appear before the reverse_proxy handler in the Caddy configuration JSON.

Impact:

  • 🔴 CI pipeline blocked - Cannot merge PR #550
  • 🔴 Test TC-2 fails - Handler order verification fails
  • 🟡 Break glass protocol working correctly - Emergency routes properly bypass security
  • 🟢 No functional issue - Caddy processes routes correctly (emergency first, then main)
  • 🟢 All other tests passing - Only TC-2 affected

Root Cause: After implementing break glass protocol, each proxy host now generates TWO routes:

  1. Emergency route (added first, line 711): Has path matchers, NO security handlers, reverse_proxy only
  2. Main route (added second, line 731): No path matchers, HAS security handlers + reverse_proxy

The test uses grep -ob to find byte positions of handlers in the Caddy JSON config. It fails because:

  • Emergency route's reverse_proxy appears FIRST in the JSON (earliest byte position)
  • Main route's waf and rate_limit appear LATER (later byte positions)
  • Test: if WAF_POS < PROXY_POS && RATE_POS < PROXY_POSFAILS (reverse_proxy comes first!)

The Fix: Update the test to be route-aware instead of using naive byte position checking. The test should:

  1. Parse the Caddy JSON config properly
  2. For each route, verify handler order WITHIN that route
  3. Skip emergency routes (identified by path matchers for /api/v1/emergency/*)
  4. Verify security handlers appear before reverse_proxy in main routes only

Complexity: MEDIUM - Requires test refactoring, no production code changes needed


Technical Analysis

Break Glass Protocol Route Structure (Current Implementation)

File: backend/internal/caddy/config.go

For each proxy host, the config generator creates TWO routes in this order:

1. Emergency Route (Lines 687-711)

emergencyPaths := []string{
    "/api/v1/emergency/security-reset",
    "/api/v1/emergency/*",
    "/emergency/security-reset",
    "/emergency/*",
}
// NOTE: NO securityHandlers - bypasses WAF, rate_limit, ACL, etc.
emergencyHandlers := append(append([]Handler{}, handlers...),
    ReverseProxyHandler(dial, host.WebsocketSupport, host.Application, enableStdHeaders))

emergencyRoute := &Route{
    Match: []Match{{
        Host: uniqueDomains,    // e.g., ["cerberus.test.local"]
        Path: emergencyPaths,   // CRITICAL: Has path matchers
    }},
    Handle:   emergencyHandlers, // NO WAF, NO rate_limit, YES reverse_proxy
    Terminal: true,
}
routes = append(routes, emergencyRoute) // Added FIRST

Key Properties:

  • Has path matchers (/api/v1/emergency/*)
  • NO security handlers (WAF, rate_limit, CrowdSec, ACL)
  • DOES have reverse_proxy handler
  • Added to routes slice FIRST
  • Caddy evaluates this first due to specificity (host + path > host only)

2. Main Route (Lines 714-731)

// NOTE: DOES include securityHandlers - applies WAF, rate_limit, ACL, etc.
mainHandlers := append(append([]Handler{}, securityHandlers...), handlers...)
mainHandlers = append(mainHandlers,
    ReverseProxyHandler(dial, host.WebsocketSupport, host.Application, enableStdHeaders))

route := &Route{
    Match: []Match{{
        Host: uniqueDomains,    // e.g., ["cerberus.test.local"]
        // NO Path matchers - matches all paths for this host
    }},
    Handle:   mainHandlers, // YES WAF, YES rate_limit, YES reverse_proxy
    Terminal: true,
}
routes = append(routes, route) // Added SECOND

Key Properties:

  • NO path matchers (matches all paths)
  • DOES have security handlers (WAF, rate_limit, CrowdSec, ACL)
  • DOES have reverse_proxy handler
  • Added to routes slice SECOND
  • Caddy evaluates this second (less specific matcher)

How Caddy Processes These Routes (CORRECT BEHAVIOR)

Caddy evaluates routes in order of specificity (not array order):

  1. Most specific matcher: Emergency route (host + path) → Checked first

    • Request: GET /api/v1/emergency/security-reset → Matches emergency route → Bypasses security
    • Handlers: handlersreverse_proxy (NO security)
  2. Less specific matcher: Main route (host only) → Checked second

    • Request: GET /api/v1/proxy-hosts → Does NOT match emergency route → Falls through to main route
    • Handlers: securityHandlers (WAF, rate_limit) → handlersreverse_proxy

This is the intended break glass protocol design - emergency paths bypass security, everything else goes through security.

How the Test Checks Handler Order (BROKEN LOGIC)

File: scripts/cerberus_integration.sh (Lines 374-420)

log_test "TC-2: Verify Handler Order in Caddy Config"

CADDY_CONFIG=$(curl -sL "http://localhost:${CADDY_ADMIN_PORT}/config/" 2>/dev/null || echo "")

# Find byte positions of handlers in the JSON
WAF_POS=$(echo "$CADDY_CONFIG" | grep -ob '"handler":"waf"' | head -1 | cut -d: -f1 || echo "0")
RATE_POS=$(echo "$CADDY_CONFIG" | grep -ob '"handler":"rate_limit"' | head -1 | cut -d: -f1 || echo "0")
PROXY_POS=$(echo "$CADDY_CONFIG" | grep -ob '"handler":"reverse_proxy"' | head -1 | cut -d: -f1 || echo "0")

# Check if security handlers appear before reverse_proxy
if [ "$WAF_POS" != "0" ] && [ "$RATE_POS" != "0" ] && [ "$PROXY_POS" != "0" ]; then
    if [ "$WAF_POS" -lt "$PROXY_POS" ] && [ "$RATE_POS" -lt "$PROXY_POS" ]; then
        log_info "  ✓ Security handlers appear before reverse_proxy"
        PASSED=$((PASSED + 1))
    else
        fail_test "Security handlers not in correct order"
    fi
fi

Why the Test Fails (THE BUG)

Problem: grep -ob '"handler":"reverse_proxy"' | head -1 finds the FIRST occurrence of reverse_proxy in the entire JSON.

Scenario (with break glass protocol):

{
  "apps": {
    "http": {
      "servers": {
        "charon_server": {
          "routes": [
            {
              "match": [{"host": ["cerberus.test.local"], "path": ["/api/v1/emergency/*"]}],
              "handle": [
                {"handler": "reverse_proxy", ...}  // <-- FIRST reverse_proxy (byte pos 1234)
              ]
            },
            {
              "match": [{"host": ["cerberus.test.local"]}],
              "handle": [
                {"handler": "waf", ...},           // <-- WAF (byte pos 2000)
                {"handler": "rate_limit", ...},    // <-- rate_limit (byte pos 2500)
                {"handler": "reverse_proxy", ...}  // <-- SECOND reverse_proxy (byte pos 3000)
              ]
            }
          ]
        }
      }
    }
  }
}

Test Logic:

  • WAF_POS = 2000
  • RATE_POS = 2500
  • PROXY_POS = 1234 (emergency route's reverse_proxy - FIRST occurrence!)
  • Check: WAF_POS (2000) < PROXY_POS (1234)FALSE
  • Check: RATE_POS (2500) < PROXY_POS (1234)FALSE
  • Test fails: "Security handlers not in correct order"

Reality: The security handlers ARE in correct order within the main route. The test is checking byte positions across ALL routes, which doesn't account for the multi-route structure.

Why This Didn't Fail Before

Before break glass protocol:

  • Only ONE route per proxy host
  • That route had: securityHandlershandlersreverse_proxy
  • WAF_POS < RATE_POS < PROXY_POS was always true
  • Test passed correctly

After break glass protocol:

  • TWO routes per proxy host (emergency + main)
  • Emergency route has reverse_proxy FIRST in JSON (earliest byte position)
  • Main route has WAF → rate_limit → reverse_proxy LATER in JSON
  • Test's byte-position logic breaks - it finds emergency route's reverse_proxy first

Requirements Specification [NEW]

Functional Requirements

FR-1: jq Dependency Management

  • The test MUST check for jq availability before execution
  • The test MUST fail with a clear error message if jq is not installed
  • Error message MUST include installation instructions for common platforms
  • NO fallback mode - jq is a hard requirement

FR-2: Emergency Path Detection

  • The test MUST use EXACT path matching for emergency routes
  • The following 4 paths MUST be recognized as emergency paths (from config.go:687-691):
    • /api/v1/emergency/security-reset (exact match)
    • /api/v1/emergency/* (exact match)
    • /emergency/security-reset (exact match)
    • /emergency/* (exact match)
  • Substring matching (e.g., contains("/emergency")) MUST NOT be used
  • Emergency paths MUST be defined as a readonly array

FR-3: Defensive Programming

  • JSON validation MUST occur before any jq processing
  • Route structure validation MUST verify .apps.http.servers.charon_server.routes exists
  • All numeric values (indices, counts) MUST be validated as numeric before comparisons
  • All array access operations MUST be preceded by existence checks

FR-4: Network Resilience

  • curl command MUST include --max-time 10 timeout
  • curl MUST retry up to 3 times with 2-second delay between attempts
  • Test MUST fail gracefully if Caddy config cannot be retrieved after retries

FR-5: Bash Best Practices

  • Loops MUST use bash arithmetic syntax: for ((i=0; i<N; i++))
  • Numeric comparisons MUST use -lt, -gt, -eq (not string comparison)
  • Index comparisons MUST validate operands are numeric first
  • Critical failures MUST use return 1 to exit test immediately

Non-Functional Requirements

NFR-1: Performance

  • Test MUST complete within 60 seconds under normal conditions
  • Each curl retry MUST have exponential backoff or fixed 2s delay

NFR-2: Maintainability

  • Emergency paths MUST be easily updateable (single readonly array definition)
  • Error messages MUST be descriptive and actionable
  • Code MUST include inline comments explaining validation steps

NFR-3: Backward Compatibility

  • Test MUST pass for configs with only main routes (no emergency routes)
  • Test MUST pass for configs with only emergency routes (edge case)
  • Test MUST handle multiple proxy hosts correctly

NFR-4: CI/CD Integration

  • jq MUST be verified as available in GitHub Actions runners
  • Test output MUST be parseable by CI logging systems
  • Test exit codes MUST be consistent (0=pass, 1=fail)

Rollback Plan [NEW]

Triggers for Rollback

Rollback to previous byte-position checking if:

  1. Test becomes unstable (>5% failure rate in CI)
  2. False negatives detected (fails on valid configs)
  3. Performance degrades (test takes >90 seconds)
  4. jq dependency causes CI failures

Rollback Procedure

Step 1: Revert Test Script

git revert <commit-hash>  # Revert TC-2 changes

Step 2: Restore Original Logic

# Restore simple byte-position checking
WAF_POS=$(echo "$CADDY_CONFIG" | grep -ob '"handler":"waf"' | head -1 | cut -d: -f1)
RATE_POS=$(echo "$CADDY_CONFIG" | grep -ob '"handler":"rate_limit"' | head -1 | cut -d: -f1)
PROXY_POS=$(echo "$CADDY_CONFIG" | grep -ob '"handler":"reverse_proxy"' | head -1 | cut -d: -f1)

Step 3: Skip TC-2 Temporarily

# Add condition to skip TC-2 until proper fix is implemented
if [ "$SKIP_TC2" = "true" ]; then
    log_warn "TC-2 skipped (break glass protocol incompatible)"
    return 0
fi

Step 4: Document Issue

  • Create GitHub issue documenting why rollback was needed
  • Tag issue with test-framework and cerberus
  • Link to failed CI runs

Step 5: Alternative Approaches If route-aware verification proves problematic:

  • Option A: Count handler occurrences instead of checking order
  • Option B: Skip TC-2 entirely and add backend unit tests for route generation
  • Option C: Move handler order verification to E2E tests (Playwright)

Rollback Verification

After rollback:

  • Run full Cerberus integration test suite locally
  • Verify CI passes
  • Monitor for 24 hours
  • Document lessons learned

Solution: Route-Aware Handler Order Verification

Approach

Fix the test, not the production code. The production code is correct - the break glass protocol is working as designed. The test's naive byte-position checking needs to be replaced with route-aware verification.

Implementation Options

Approach:

  1. Use jq to parse the Caddy JSON config
  2. Extract routes individually
  3. For each route, check if it's an emergency route (has /api/v1/emergency/* path matcher)
  4. Skip emergency routes (they're supposed to bypass security)
  5. For main routes only, verify handler order

Pros:

  • Accurate - checks handler order within logical route boundaries
  • Accounts for emergency routes correctly
  • Robust - works with any number of routes
  • Easy to understand and maintain

Cons:

  • ⚠️ Requires jq to be installed in test environment

Implementation:

# TC-2: Verify handler order in Caddy config
log_test "TC-2: Verify Handler Order in Caddy Config (Route-Aware)"

CADDY_CONFIG=$(curl -sL "http://localhost:${CADDY_ADMIN_PORT}/config/" 2>/dev/null || echo "")

if [ -z "$CADDY_CONFIG" ]; then
    fail_test "Could not retrieve Caddy config"
else
    # Extract routes from Caddy config
    # Filter to only check main routes (non-emergency routes)
    # Emergency routes are identified by having path matchers for /api/v1/emergency/*

    # Count total routes
    TOTAL_ROUTES=$(echo "$CADDY_CONFIG" | jq -r '.apps.http.servers.charon_server.routes | length')
    log_info "  Found $TOTAL_ROUTES routes in config"

    # Check each route individually
    MAIN_ROUTES_CHECKED=0
    HANDLER_ORDER_CORRECT=0

    for i in $(seq 0 $((TOTAL_ROUTES - 1))); do
        ROUTE=$(echo "$CADDY_CONFIG" | jq -r ".apps.http.servers.charon_server.routes[$i]")

        # Check if this is an emergency route (has path matchers containing /emergency)
        HAS_EMERGENCY_PATH=$(echo "$ROUTE" | jq -r '
            .match[]? |
            select(.path != null) |
            .path[]? |
            select(contains("/emergency"))' | wc -l)

        if [ "$HAS_EMERGENCY_PATH" -gt 0 ]; then
            log_info "  Route $i: Emergency route (skipping handler order check)"
            continue
        fi

        # This is a main route - check handler order
        MAIN_ROUTES_CHECKED=$((MAIN_ROUTES_CHECKED + 1))

        # Get handler types in order
        HANDLERS=$(echo "$ROUTE" | jq -r '.handle[].handler' | tr '\n' ' ')
        log_info "  Route $i: Main route handlers: $HANDLERS"

        # Verify WAF comes before reverse_proxy (if WAF is present)
        WAF_INDEX=$(echo "$ROUTE" | jq -r '[.handle[].handler] | map(if . == "waf" then true else false end) | index(true) // -1')
        PROXY_INDEX=$(echo "$ROUTE" | jq -r '[.handle[].handler] | map(if . == "reverse_proxy" then true else false end) | index(true) // -1')

        if [ "$WAF_INDEX" != "-1" ] && [ "$PROXY_INDEX" != "-1" ]; then
            if [ "$WAF_INDEX" -lt "$PROXY_INDEX" ]; then
                log_info "    ✓ WAF ($WAF_INDEX) before reverse_proxy ($PROXY_INDEX)"
                HANDLER_ORDER_CORRECT=$((HANDLER_ORDER_CORRECT + 1))
            else
                fail_test "WAF must appear before reverse_proxy in route $i"
            fi
        fi

        # Verify rate_limit comes before reverse_proxy (if rate_limit is present)
        RATE_INDEX=$(echo "$ROUTE" | jq -r '[.handle[].handler] | map(if . == "rate_limit" then true else false end) | index(true) // -1')

        if [ "$RATE_INDEX" != "-1" ] && [ "$PROXY_INDEX" != "-1" ]; then
            if [ "$RATE_INDEX" -lt "$PROXY_INDEX" ]; then
                log_info "    ✓ rate_limit ($RATE_INDEX) before reverse_proxy ($PROXY_INDEX)"
            else
                fail_test "rate_limit must appear before reverse_proxy in route $i"
            fi
        fi
    done

    if [ "$MAIN_ROUTES_CHECKED" -gt 0 ] && [ "$HANDLER_ORDER_CORRECT" -gt 0 ]; then
        log_info "  ✓ Handler order verified in $MAIN_ROUTES_CHECKED main routes"
        PASSED=$((PASSED + 1))
    elif [ "$MAIN_ROUTES_CHECKED" -eq 0 ]; then
        log_warn "  No main routes found to verify (this may indicate a config issue)"
        PASSED=$((PASSED + 1)) # Don't fail if no main routes (may be valid)
    else
        fail_test "Handler order incorrect in main routes"
    fi
fi

Option B: Count Handlers Instead of Positions (SIMPLER)

Approach:

  1. Count total occurrences of each handler type
  2. Verify expected counts match
  3. Don't worry about order - Caddy handles it correctly

Pros:

  • Simple - no JSON parsing needed
  • Robust - works with emergency routes
  • No additional dependencies

Cons:

  • ⚠️ Doesn't actually verify order (less strict)
  • ⚠️ Could miss misconfigurations

Implementation:

# TC-2: Verify required handlers are present
log_test "TC-2: Verify Required Security Handlers Present"

CADDY_CONFIG=$(curl -sL "http://localhost:${CADDY_ADMIN_PORT}/config/" 2>/dev/null || echo "")

if [ -z "$CADDY_CONFIG" ]; then
    fail_test "Could not retrieve Caddy config"
else
    # Count handler occurrences
    WAF_COUNT=$(echo "$CADDY_CONFIG" | grep -o '"handler":"waf"' | wc -l)
    RATE_COUNT=$(echo "$CADDY_CONFIG" | grep -o '"handler":"rate_limit"' | wc -l)
    PROXY_COUNT=$(echo "$CADDY_CONFIG" | grep -o '"handler":"reverse_proxy"' | wc -l)

    log_info "  Handler counts: WAF=$WAF_COUNT, rate_limit=$RATE_COUNT, reverse_proxy=$PROXY_COUNT"

    # Verify we have at least one of each security handler
    if [ "$WAF_COUNT" -ge 1 ]; then
        log_info "  ✓ WAF handler present"
        PASSED=$((PASSED + 1))
    else
        fail_test "WAF handler not found"
    fi

    if [ "$RATE_COUNT" -ge 1 ]; then
        log_info "  ✓ rate_limit handler present"
        PASSED=$((PASSED + 1))
    else
        fail_test "rate_limit handler not found"
    fi

    # Verify we have more reverse_proxy handlers than security handlers
    # (because emergency routes have reverse_proxy but no security)
    if [ "$PROXY_COUNT" -gt "$WAF_COUNT" ]; then
        log_info "  ✓ Emergency routes present (more reverse_proxy than WAF)"
        PASSED=$((PASSED + 1))
    else
        log_warn "  ⚠ May be missing emergency routes"
    fi
fi

Approach:

  1. Update test to skip handler order check when emergency routes are detected
  2. Add comment explaining why
  3. Continue with rest of tests

Pros:

  • Minimal change
  • Quick to implement

Cons:

  • Loses test coverage
  • Doesn't validate handler order at all
  • Defeats purpose of TC-2

Rationale:

  • Most accurate and complete verification
  • Accounts for emergency routes explicitly
  • Maintains test coverage for handler ordering
  • Clear and maintainable code
  • Aligns with break glass protocol design

Implementation Plan

Phase 1: Verify Test Failure Locally (15 minutes) [UPDATED]

Objective: Reproduce the test failure in local environment and confirm root cause.

Tasks:

  1. Run Cerberus integration test locally:

    # Start test environment
    docker compose -f .docker/compose/docker-compose.playwright-local.yml up -d
    
    # Run Cerberus integration test
    ./scripts/cerberus_integration.sh
    
  2. Capture failing test output:

    • Document exact error message for TC-2
    • Capture byte positions reported: WAF_POS, RATE_POS, PROXY_POS
    • Verify that PROXY_POS < WAF_POS (emergency route's reverse_proxy comes first)
  3. Inspect Caddy config:

    # Get Caddy config
    curl -s http://localhost:2019/config/ | jq . > caddy_config.json
    
    # Count routes
    jq '.apps.http.servers.charon_server.routes | length' caddy_config.json
    
    # Verify emergency route is first
    jq '.apps.http.servers.charon_server.routes[0].match[].path[]?' caddy_config.json
    
    # Verify main route is second
    jq '.apps.http.servers.charon_server.routes[1].handle[].handler' caddy_config.json
    
  4. Verify break glass protocol works:

    # Emergency endpoint should bypass security
    curl -H "X-Emergency-Token: test-emergency-token-for-e2e-32chars" \
         -X POST http://localhost:8080/api/v1/emergency/security-reset
    
    # Should return 200 OK and disable security modules
    

Success Criteria:

  • Test fails with "Security handlers not in correct order"
  • Confirmed: PROXY_POS < WAF_POS due to emergency route
  • Confirmed: Emergency route appears first in JSON
  • Confirmed: Break glass protocol works functionally

Files:

  • scripts/cerberus_integration.sh (read only - observe failure)
  • Document findings in diagnosis report

Estimated Time: 15 minutes


Phase 2: Implement Route-Aware Verification (45 minutes) [UPDATED]

Objective: Update the test to properly verify handler order within each route, accounting for emergency routes.

CRITICAL CHANGES (Addressing Supervisor Feedback):

  1. Make jq a HARD REQUIREMENT - No fallback mode
  2. Fix emergency path detection - Use EXACT path matching for 4 specific paths
  3. Add defensive checks - JSON validation, numeric validation, route structure checks
  4. Improve curl command - Add timeout and retry logic
  5. Fix bash scripting - Use bash arithmetic loops, proper numeric comparisons

File: scripts/cerberus_integration.sh (lines 374-420)

Changes:

# TC-2: Verify handler order in Caddy config (ROUTE-AWARE)
# ============================================================================
log_test "TC-2: Verify Handler Order in Caddy Config"

# HARD REQUIREMENT: Check if jq is available (no fallback mode)
if ! command -v jq >/dev/null 2>&1; then
    fail_test "jq is required for handler order verification. Install: apt-get install jq / brew install jq"
    return 1
fi

# Fetch Caddy config with timeout and retry
CADDY_CONFIG=""
for attempt in 1 2 3; do
    CADDY_CONFIG=$(curl --max-time 10 -sL "http://localhost:${CADDY_ADMIN_PORT}/config/" 2>/dev/null || echo "")
    if [ -n "$CADDY_CONFIG" ]; then
        break
    fi
    log_warn "  Attempt $attempt/3: Failed to fetch Caddy config, retrying..."
    sleep 2
done

if [ -z "$CADDY_CONFIG" ]; then
    fail_test "Could not retrieve Caddy config after 3 attempts"
    return 1
fi

# Validate JSON structure before processing
if ! echo "$CADDY_CONFIG" | jq empty 2>/dev/null; then
    fail_test "Retrieved Caddy config is not valid JSON"
    return 1
fi

# Validate expected structure exists
if ! echo "$CADDY_CONFIG" | jq -e '.apps.http.servers.charon_server.routes' >/dev/null 2>&1; then
    fail_test "Caddy config missing expected route structure (.apps.http.servers.charon_server.routes)"
    return 1
fi

# Get route count with validation
TOTAL_ROUTES=$(echo "$CADDY_CONFIG" | jq -r '.apps.http.servers.charon_server.routes | length' 2>/dev/null || echo "0")

# Validate route count is numeric and non-negative
if ! [[ "$TOTAL_ROUTES" =~ ^[0-9]+$ ]]; then
    fail_test "Invalid route count (not numeric): $TOTAL_ROUTES"
    return 1
fi

log_info "  Found $TOTAL_ROUTES routes in Caddy config"

if [ "$TOTAL_ROUTES" -eq 0 ]; then
    fail_test "No routes found in Caddy config"
    return 1
fi

# Define EXACT emergency paths (must match config.go lines 687-691)
readonly EMERGENCY_PATHS=(
    "/api/v1/emergency/security-reset"
    "/api/v1/emergency/*"
    "/emergency/security-reset"
    "/emergency/*"
)

ROUTES_VERIFIED=0
EMERGENCY_ROUTES_SKIPPED=0

# Use bash arithmetic loop instead of seq
for ((i=0; i<TOTAL_ROUTES; i++)); do
    # Validate route exists at index
    if ! echo "$CADDY_CONFIG" | jq -e ".apps.http.servers.charon_server.routes[$i]" >/dev/null 2>&1; then
        log_warn "  Route $i: Missing or invalid route structure, skipping"
        continue
    fi

    # Check if this route has EXACT emergency path matches
    IS_EMERGENCY_ROUTE=false
    for emergency_path in "${EMERGENCY_PATHS[@]}"; do
        # EXACT path comparison (not substring matching)
        EXACT_MATCH=$(echo "$CADDY_CONFIG" | jq -r "
            .apps.http.servers.charon_server.routes[$i].match[]? |
            select(.path != null) |
            .path[]? |
            select(. == \"$emergency_path\")" 2>/dev/null | wc -l | tr -d ' ')

        # Validate match count is numeric
        if ! [[ "$EXACT_MATCH" =~ ^[0-9]+$ ]]; then
            log_warn "  Route $i: Invalid match count for path '$emergency_path', skipping"
            continue
        fi

        if [ "$EXACT_MATCH" -gt 0 ]; then
            IS_EMERGENCY_ROUTE=true
            break
        fi
    done

    if [ "$IS_EMERGENCY_ROUTE" = true ]; then
        log_info "  Route $i: Emergency route (security bypass by design) - skipping"
        EMERGENCY_ROUTES_SKIPPED=$((EMERGENCY_ROUTES_SKIPPED + 1))
        continue
    fi

    # Main route - verify handler order
    log_info "  Route $i: Main route - verifying handler order..."

    # Validate handlers array exists
    if ! echo "$CADDY_CONFIG" | jq -e ".apps.http.servers.charon_server.routes[$i].handle" >/dev/null 2>&1; then
        log_warn "  Route $i: No handlers found, skipping"
        continue
    fi

    # Find indices of security handlers and reverse_proxy
    WAF_IDX=$(echo "$CADDY_CONFIG" | jq -r "[.apps.http.servers.charon_server.routes[$i].handle[]?.handler] | map(if . == \"waf\" then true else false end) | index(true) // -1" 2>/dev/null || echo "-1")
    RATE_IDX=$(echo "$CADDY_CONFIG" | jq -r "[.apps.http.servers.charon_server.routes[$i].handle[]?.handler] | map(if . == \"rate_limit\" then true else false end) | index(true) // -1" 2>/dev/null || echo "-1")
    PROXY_IDX=$(echo "$CADDY_CONFIG" | jq -r "[.apps.http.servers.charon_server.routes[$i].handle[]?.handler] | map(if . == \"reverse_proxy\" then true else false end) | index(true) // -1" 2>/dev/null || echo "-1")

    # Validate all indices are numeric
    if ! [[ "$WAF_IDX" =~ ^-?[0-9]+$ ]] || ! [[ "$RATE_IDX" =~ ^-?[0-9]+$ ]] || ! [[ "$PROXY_IDX" =~ ^-?[0-9]+$ ]]; then
        fail_test "Invalid handler indices in route $i (not numeric)"
        return 1
    fi

    # Verify WAF comes before reverse_proxy (if present)
    if [ "$WAF_IDX" -ge 0 ] && [ "$PROXY_IDX" -ge 0 ]; then
        if [ "$WAF_IDX" -lt "$PROXY_IDX" ]; then
            log_info "    ✓ WAF (index $WAF_IDX) before reverse_proxy (index $PROXY_IDX)"
        else
            fail_test "WAF must appear before reverse_proxy in route $i (WAF=$WAF_IDX, proxy=$PROXY_IDX)"
            return 1
        fi
    fi

    # Verify rate_limit comes before reverse_proxy (if present)
    if [ "$RATE_IDX" -ge 0 ] && [ "$PROXY_IDX" -ge 0 ]; then
        if [ "$RATE_IDX" -lt "$PROXY_IDX" ]; then
            log_info "    ✓ rate_limit (index $RATE_IDX) before reverse_proxy (index $PROXY_IDX)"
        else
            fail_test "rate_limit must appear before reverse_proxy in route $i (rate=$RATE_IDX, proxy=$PROXY_IDX)"
            return 1
        fi
    fi

    ROUTES_VERIFIED=$((ROUTES_VERIFIED + 1))
done

log_info "  Summary: Verified $ROUTES_VERIFIED main routes, skipped $EMERGENCY_ROUTES_SKIPPED emergency routes"

if [ "$ROUTES_VERIFIED" -gt 0 ]; then
    log_info "  ✓ Handler order correct in all main routes"
    PASSED=$((PASSED + 1))
else
    log_warn "  No main routes found to verify (all routes are emergency routes?)"
    # Don't fail if only emergency routes exist - this may be valid in some configs
    PASSED=$((PASSED + 1))
fi

Key Changes [UPDATED]:

  1. BREAKING: Make jq a HARD REQUIREMENT (fail if missing, no fallback)
  2. FIXED: Use EXACT path matching for 4 specific emergency paths (not substring contains)
  3. ADDED: JSON validation before processing
  4. ADDED: Numeric validation for all index comparisons
  5. ADDED: curl timeout (10s) and retry logic (3 attempts, 2s delay)
  6. FIXED: Use bash arithmetic loops (for ((i=0; i<N; i++))) instead of seq
  7. ADDED: Route structure validation at each step
  8. Skip emergency routes (they're supposed to bypass security)
  9. Verify handler order WITHIN each main route only
  10. Use return 1 on critical failures to exit test immediately

Success Criteria [UPDATED]:

  • Test passes with break glass protocol routes
  • Emergency routes detected using EXACT path matching (not substring)
  • Handler order verified in main routes only
  • jq dependency checked at start (fail fast if missing)
  • All defensive checks in place (JSON validation, numeric validation)
  • curl command includes timeout and retry
  • Bash loops use arithmetic syntax
  • Clear, informative console output
  • Test fails fast on critical errors

Files:

  • scripts/cerberus_integration.sh (TC-2 section only)

Dependencies:

  • jq (REQUIRED) - Must be installed in test environment and CI

Estimated Time: 45 minutes (increased due to additional validation)


Phase 3: Update Documentation (25 minutes) [UPDATED]

Objective: Document the test change, requirements, and explain why handler order checking must be route-aware.

Tasks:

  1. Update test documentation:

    • File: docs/testing/cerberus-integration.md (create if doesn't exist)
    • Explain break glass protocol route structure
    • Document why emergency routes skip security handlers
    • Explain why byte-position checking fails
    • [NEW] Document jq as a hard dependency with installation instructions
    • [NEW] Document emergency path matching logic (exact match vs substring)
    • [NEW] Add troubleshooting section for common issues
  2. Create test requirements document [NEW]:

    • File: docs/testing/cerberus-test-requirements.md
    • List all functional requirements (FR-1 through FR-5)
    • List all non-functional requirements (NFR-1 through NFR-4)
    • Document expected test behavior for each scenario
  3. Update CHANGELOG:

    ### Fixed
    - **CI**: Fixed Cerberus integration test TC-2 (handler order verification) to be route-aware
    - Test now correctly accounts for emergency bypass routes that skip security handlers
    - **BREAKING**: jq is now a hard requirement for Cerberus integration tests
    - Added defensive validation: JSON validation, numeric checks, route structure validation
    - Added curl timeout and retry logic for network resilience
    - Fixed emergency path detection to use exact matching instead of substring matching
    - No production code changes - break glass protocol working as designed
    
  4. Add rollback documentation [NEW]:

    • File: docs/testing/cerberus-rollback-plan.md
    • Document rollback triggers and procedure
    • List alternative approaches if route-aware verification fails
  5. Update test inline comments:

    • Add comments explaining emergency route detection logic
    • Document why we use jq for route parsing (required for route-aware checking)
    • [NEW] Add comments for each defensive check
    • [NEW] Add comments explaining bash arithmetic loops
    • [REMOVED] Explain fallback behavior (no fallback mode)

Success Criteria [UPDATED]:

  • Test behavior documented clearly with requirements
  • Break glass protocol explained in context
  • jq dependency documented with installation instructions
  • Rollback plan documented
  • Changelog updated with breaking change warning
  • Inline comments added to all validation steps
  • Troubleshooting guide created

Files:

  • docs/testing/cerberus-integration.md (NEW/UPDATE)
  • docs/testing/cerberus-test-requirements.md (NEW)
  • docs/testing/cerberus-rollback-plan.md (NEW)
  • CHANGELOG.md (UPDATE)
  • scripts/cerberus_integration.sh (inline comments)

Estimated Time: 25 minutes (increased due to additional documentation)


Phase 4: Verify Test Fix (45 minutes) [UPDATED]

Objective: Run the updated test locally and in CI to confirm it passes. Test all edge cases and backward compatibility.

Tasks:

  1. Verify jq is available:

    # Check jq version
    jq --version
    
    # If not installed:
    # Ubuntu/Debian: sudo apt-get install -y jq
    # macOS: brew install jq
    # Alpine: apk add --no-cache jq
    
  2. Run test locally (standard case):

    # Clean environment
    docker compose -f .docker/compose/docker-compose.playwright-local.yml down -v
    
    # Rebuild and start
    docker compose -f .docker/compose/docker-compose.playwright-local.yml up -d
    
    # Wait for services to be ready
    sleep 15
    
    # Run updated Cerberus integration test
    ./scripts/cerberus_integration.sh
    
  3. Test edge case: Multiple proxy hosts [NEW]:

    # Add 2-3 additional proxy hosts via UI or API
    # Each host should generate 2 routes (emergency + main)
    # Run test again - should verify all main routes
    ./scripts/cerberus_integration.sh
    
  4. Test edge case: Cerberus disabled [NEW]:

    # Use override compose file to disable Cerberus
    docker compose -f .docker/compose/docker-compose.playwright-local.yml \
                   -f .docker/compose/docker-compose.e2e.cerberus-disabled.override.yml \
                   up -d
    
    # Run test - TC-2 should pass (no security handlers to check)
    ./scripts/cerberus_integration.sh
    
  5. Test backward compatibility: Single route (no break-glass) [NEW]:

    # This requires temporarily reverting config.go to only generate main route
    # OR testing against an older version of Charon
    # TC-2 should still pass for single-route configs
    
  6. Verify jq is present in GitHub Actions [NEW]:

    # Check if jq is pre-installed on GitHub-hosted runners
    # Ubuntu 22.04 runners include jq by default
    # Verify in .github/workflows/*.yml
    
  7. Verify output [UPDATED]:

    • TC-2 should PASS
    • Should see "Found N routes in Caddy config"
    • Should see "Emergency route (security bypass by design) - skipping" for each emergency route
    • Should see "Main route - verifying handler order" for each main route
    • Should see "WAF (index X) before reverse_proxy (index Y)" for each main route
    • Should see "rate_limit (index X) before reverse_proxy (index Y)" for each main route
    • Should see "Summary: Verified N main routes, skipped M emergency routes"
    • All other TCs (TC-1, TC-3, TC-4, TC-5) should continue to pass
    • NO fallback messages (jq is required)
  8. Create PR and run CI [UPDATED]:

    • Push changes to feature branch
    • Verify CI pipeline passes
    • Check that Cerberus integration test passes in CI
    • Review CI logs for expected output

Success Criteria [UPDATED]:

  • TC-2 passes locally (standard case)
  • TC-2 passes with multiple proxy hosts
  • TC-2 passes with Cerberus disabled (edge case)
  • TC-2 passes with single route (backward compatibility)
  • All 5 Cerberus integration TCs pass
  • Test output is clear and informative (includes route summaries)
  • jq dependency verified in CI environment
  • CI pipeline passes
  • No regressions in other tests
  • Test fails fast if jq is missing (no silent fallback)

Estimated Time: 45 minutes (includes edge case testing)


Testing Strategy

Unit Tests (Not Applicable)

This issue is purely in the integration test script - no unit tests needed.

Integration Tests [UPDATED]

File: scripts/cerberus_integration.sh

The fix IS the integration test itself. Verification involves:

  1. Positive Test Cases:

    • Single proxy host with emergency + main routes → TC-2 passes
    • [NEW] Multiple proxy hosts with multiple routes each → TC-2 passes
    • Emergency routes correctly identified using EXACT path matching
    • Main routes correctly verified for handler order
  2. Edge Cases [UPDATED]:

    • Only emergency routes (no main routes) → TC-2 should not fail (valid config)
    • [REMOVED] No jq available → Fallback (jq now REQUIRED, test fails if missing)
    • Caddy config unavailable after 3 retries → Fail gracefully with clear error
    • Invalid JSON response → Fail with validation error
    • Missing route structure → Fail with structure validation error
    • [NEW] Cerberus disabled (no security handlers) → TC-2 passes (no security handlers to check)
  3. Backward Compatibility Test Cases [NEW]:

    • Single route, pre-break-glass config (no emergency route) → TC-2 should still pass
    • Config with only main route (no path matchers) → Verify handlers normally
  4. Regression Tests:

    • TC-1 (feature status) still passes
    • TC-3 (WAF doesn't consume rate limit) still passes
    • TC-4 (traffic flows through layers) still passes
    • TC-5 (latency check) still passes

E2E Tests (Existing Coverage)

The break glass protocol already has E2E test coverage:

  • tests/emergency-server/emergency-server.spec.ts
  • tests/emergency-server/tier2-validation.spec.ts
  • tests/security-enforcement/emergency-token.spec.ts

No additional E2E tests needed - the Cerberus integration test IS the verification.


Success Metrics

Functionality

  • TC-2 passes with break glass protocol routes
  • Emergency routes correctly detected and skipped
  • Handler order verified in main routes
  • No false positives (emergency routes don't fail order check)
  • No false negatives (real order issues are still detected)

CI/CD

  • CI pipeline passes on feature/beta-release branch
  • PR #550 can be merged
  • No test flakiness introduced

Documentation

  • Test behavior documented clearly
  • Break glass protocol explained in test context
  • Changelog updated

Risk Assessment [UPDATED]

Risk Impact Likelihood Mitigation
jq not available in CI High Low CHANGED: Make jq REQUIRED; verify it's pre-installed on GitHub runners (Ubuntu 22.04 includes jq)
Test becomes too complex Medium Medium Clear comments, step-by-step validation, early exit on error
False positives (test passes when it shouldn't) High Low Extensive defensive checks, numeric validation, structure validation
Regression in other TCs Medium Low Run full test suite before committing
Test still flaky High Low Curl retry logic (3 attempts), increased timeout, validate JSON before parsing
Emergency path detection false positives [NEW] High Low Use EXACT path matching instead of substring matching
Numeric comparison failures [NEW] Medium Low Validate all indices are numeric before comparison
Missing route structure [NEW] Medium Low Validate route structure exists before accessing

Timeline [UPDATED]

Total Estimated Time: 2 hours 30 minutes (increased due to additional validation and testing)

  • Phase 1: Verify test failure locally (15 min)
  • Phase 2: Implement route-aware verification with full validation (45 min) [+15 min]
  • Phase 3: Update documentation (25 min) [+10 min]
  • Phase 4: Verify test fix and edge cases (45 min) [+15 min]

Target Completion: Same day


Implementation Checklist

Phase 1: Verify Test Failure Locally (15 min)

  • Start E2E environment (docker compose up)
  • Run scripts/cerberus_integration.sh
  • Capture TC-2 failure output
  • Inspect Caddy config with curl http://localhost:2019/config/ | jq
  • Verify emergency route is first, main route is second
  • Test break glass protocol manually
  • Document findings

Phase 2: Implement Route-Aware Verification (45 min) [UPDATED]

  • [NEW] Add jq dependency check at start of TC-2 (fail if missing)
  • Update scripts/cerberus_integration.sh TC-2 section
  • [NEW] Add curl timeout (10s) and retry logic (3 attempts, 2s delay)
  • [NEW] Add JSON validation before processing config
  • [NEW] Add route structure validation
  • Add jq-based route parsing
  • [CHANGED] Implement emergency route detection using EXACT path matching (not contains)
  • [NEW] Define readonly array of 4 specific emergency paths
  • [NEW] Use bash arithmetic loops instead of seq
  • [NEW] Add numeric validation for all index variables
  • Skip emergency routes in verification
  • Verify handler order in main routes only
  • [REMOVED] Add fallback for environments without jq (jq now REQUIRED)
  • [NEW] Add return 1 statements for critical failures
  • Test locally to confirm TC-2 passes
  • Verify all other TCs still pass

Phase 3: Update Documentation (25 min) [UPDATED]

  • Create docs/testing/cerberus-integration.md
  • Document break glass protocol route structure
  • [NEW] Document jq as a hard dependency
  • [NEW] Document emergency path matching logic (exact match, not substring)
  • Explain why route-aware checking is needed
  • [NEW] Add rollback plan section (revert to byte-position check if issues)
  • [NEW] Document defensive checks and validation steps
  • Update CHANGELOG.md
  • Add inline comments to test script
  • [NEW] Verify all file paths are accurate (no references to non-existent files)

Phase 4: Verify Test Fix (45 min) [UPDATED]

  • [NEW] Verify jq is installed (jq --version)
  • Run test locally 3+ times to ensure stability
  • [NEW] Test edge case: Multiple proxy hosts (3+ hosts)
  • [NEW] Test edge case: Cerberus disabled (no security handlers)
  • [NEW] Test edge case: Single route (backward compatibility)
  • [REMOVED] Test: no jq (now fails fast if missing)
  • Test edge case: Only emergency routes
  • [NEW] Test: curl timeout (stop Caddy temporarily)
  • [NEW] Test: Invalid JSON response (simulate)
  • [NEW] Verify jq is available in GitHub Actions runners
  • Create PR with test fix
  • Push to feature/beta-release branch
  • Monitor CI pipeline
  • Verify Cerberus integration test passes in CI
  • [NEW] Verify CI logs show route summaries
  • Review CI logs for expected output
  • Request code review

Post-Implementation

  • Monitor CI stability for 24 hours
  • Merge PR #550 if all checks pass
  • Close related issues

Future Enhancements (Out of Scope)

  1. Structured Test Output:

    • Convert bash test to Go test with structured JSON output
    • Easier to parse and integrate with CI reporting
  2. Visual Route Inspector:

    • Add admin UI endpoint showing route structure
    • Highlight emergency vs main routes
    • Show handler order per route
  3. Automated Route Verification:

    • Add backend unit tests for route generation
    • Verify emergency routes have no security handlers
    • Verify main routes have correct handler order

Appendix

  • Break Glass Protocol Design: docs/plans/break_glass_protocol_redesign.md
  • Cerberus Integration Plan: docs/plans/cerberus_integration_testing_plan.md
  • Emergency Server Implementation: docs/implementation/PHASE_3_4_TEST_ENVIRONMENT_COMPLETE.md
  • E2E Test Remediation: docs/implementation/e2e_remediation_complete.md

References

  1. Caddy Configuration: backend/internal/caddy/config.go (lines 687-731)

    • Emergency route generation (lines 687-711)
    • Main route generation (lines 714-731)
  2. Cerberus Integration Test: scripts/cerberus_integration.sh (lines 374-420)

    • TC-2: Handler order verification
  3. Security Handlers: backend/internal/caddy/config.go (lines 507-580)

    • WAF handler builder
    • Rate limit handler builder
    • ACL handler builder

Plan Status: UPDATED - READY FOR SUPERVISOR REVIEW Next Action: Submit to Supervisor for approval of updates Assigned To: Planning Agent → Supervisor Review Priority: HIGH 🔴 - CI blocking, prevents PR #550 from merging Estimated Duration: 2 hours 30 minutes (updated) Impact: Test fix only - No production code changes required


Supervisor Feedback Addressed [NEW]

Issue 1: Emergency Path Detection Fixed

  • Before: Used contains("/emergency") substring matching
  • After: Uses EXACT path matching for 4 specific paths from config.go
  • Implementation: Readonly array with exact path strings, loop comparison

Issue 2: E2E Compose File Path Corrected

  • Before: Referenced non-existent .docker/compose/docker-compose.e2e.yml
  • After: All references updated to .docker/compose/docker-compose.playwright-local.yml
  • Verified: File exists and contains E2E test configuration

Issue 3: jq Dependency Management

  • Before: Had fallback mode without jq
  • After: jq is HARD REQUIREMENT, test fails fast if missing
  • Added: Explicit jq check with installation instructions in error message
  • Verified: jq is pre-installed on GitHub Actions Ubuntu 22.04 runners

Issue 4: Shell Scripting Improvements

  • Added: JSON validation before processing (jq empty)
  • Added: Route structure validation (.apps.http.servers.charon_server.routes exists)
  • Added: Numeric validation for all indices before comparison
  • Fixed: Use bash arithmetic loops for ((i=0; i<N; i++)) instead of seq
  • Added: curl timeout (--max-time 10) and retry logic (3 attempts, 2s delay)
  • Added: Defensive checks at every array access point
  • Added: return 1 for critical failures to exit immediately

Issue 5: Testing Enhancements

  • Added: Multiple proxy hosts test case
  • Added: Backward compatibility test (single route, pre-break-glass)
  • Added: Edge case - Cerberus disabled (no security handlers)
  • Added: Edge case - curl timeout/retry testing
  • Added: Edge case - Invalid JSON response handling

📋 Plan Updates Summary

  • Increased time estimate: 1h30m → 2h30m (more validation = more time)
  • Updated file paths throughout (e2e.yml → playwright-local.yml)
  • Enhanced error handling and validation at all steps
  • Removed fallback mode entirely
  • Added comprehensive requirements specification
  • Added rollback plan for risk mitigation