Files
Charon/docs/plans/archive/e2e-test-triage-quick-start.md
2026-03-04 18:34:49 +00:00

260 lines
8.2 KiB
Markdown

# E2E Test Triage - Quick Start Guide
## Status: ROOT CAUSE IDENTIFIED ✅
**Date:** February 3, 2026
**Test Suite:** Cross-browser Playwright (Chromium, Firefox, WebKit)
**Total Tests:** 2,737
---
## Critical Finding
### Design Intent (CONFIRMED)
Cerberus should be **ENABLED** during E2E tests to test the break glass feature:
- Cerberus framework stays **ON** throughout test suite
- All Cerberus tests run first (toggles, navigation, etc.)
- **Break glass test runs LAST** to validate emergency override
### Problem
13 E2E tests are **conditionally skipping** at runtime because:
- Toggle buttons are **disabled** when Cerberus framework is off
- Emergency security reset is disabling **Cerberus itself** (bug)
- Tests check `toggle.isDisabled()` and skip when true
### Root Cause
The `/emergency/security-reset` endpoint (used in `tests/global-setup.ts`) is incorrectly disabling:
-`security.acl.enabled` = false ← CORRECT (module disabled)
-`security.waf.enabled` = false ← CORRECT (module disabled)
-`security.rate_limit.enabled` = false ← CORRECT (module disabled)
-`security.crowdsec.enabled` = false ← CORRECT (module disabled)
-**`feature.cerberus.enabled` = false** ← BUG (framework should stay enabled)
### Expected Behavior (CONFIRMED)
For E2E tests, Cerberus should be:
- **Framework Enabled:** `feature.cerberus.enabled` = true (allows testing)
- **Modules Disabled:** Individual security modules off for clean state
- **Test Order:** All Cerberus tests → Break glass test (LAST)
---
## Affected Tests (13 Total)
### Category 1: Security Dashboard - Toggle Actions (5 tests)
- Test 77: Toggle ACL enabled/disabled
- Test 78: Toggle WAF enabled/disabled
- Test 79: Toggle Rate Limiting enabled/disabled
- Test 80/214: Persist toggle state after page reload
### Category 2: Security Dashboard - Navigation (4 tests)
- Test 81/250: Navigate to CrowdSec config
- Test 83/309: Navigate to WAF config
- Test 84/335: Navigate to Rate Limiting config
### Category 3: Rate Limiting Config (1 test)
- Test 57/70: Toggle rate limiting on/off
### Category 4: CrowdSec Decisions (13 tests - SKIP OK)
- Tests 42-53: Explicitly skipped with `test.describe.skip()`
- **No action needed** - these require CrowdSec running (integration tests)
---
## Immediate Action Plan
### Step 1: Verify Current State ✅ CONFIRMED
**Design Intent:** Cerberus should be enabled for break glass testing
**Test Flow:** Global setup → All Cerberus tests → Break glass test (LAST)
**Problem:** Emergency reset incorrectly disables Cerberus framework
Run diagnostic script:
```bash
./scripts/diagnose-test-env.sh
```
Expected output shows:
- ✓ Container running
- ✗ Cerberus state unknown (no settings endpoint on emergency server)
### Step 2: Check Cerberus State via Main API
```bash
# Requires authentication - use your test user credentials
curl -H "Authorization: Bearer <token>" http://localhost:8080/api/v1/security/config | jq '.cerberus // .feature.cerberus'
```
### Step 3: Review Emergency Handler Code (INVESTIGATE)
File: `backend/internal/api/handlers/emergency_handler.go`
Find the `SecurityReset` function and check what it's disabling:
```bash
grep -A 20 "func.*SecurityReset" backend/internal/api/handlers/emergency_handler.go
```
### Step 4: Fix Emergency Reset Bug
**Goal:** Keep Cerberus enabled while disabling security modules
**Option A: Backend Fix (Recommended)**
Modify `emergency_handler.go` SecurityReset to:
-**REMOVE:** `feature.cerberus.enabled` = false (this is the bug)
-**KEEP:** Disable individual security modules
-**KEEP:** `security.{acl,waf,rate_limit,crowdsec}.enabled` = false
Expected behavior:
- Framework stays enabled for testing
- Modules disabled for clean slate
- Break glass test can run last to validate emergency override
**Option B: Frontend State Reset (Workaround)**
Add post-reset call in `tests/global-setup.ts`:
```typescript
// After emergency reset, re-enable Cerberus framework
// (Workaround for backend bug where reset disables Cerberus)
const enableResponse = await requestContext.patch('/api/v1/settings', {
data: { 'feature.cerberus.enabled': true }
});
```
### Step 5: Validate Fix
```bash
# Rebuild E2E environment
.github/skills/scripts/skill-runner.sh docker-rebuild-e2e
# Run affected tests
npm run test:e2e -- tests/security/security-dashboard.spec.ts --project=chromium
# Verify toggles are enabled (not disabled)
# Tests should now executed, not skip
```
---
## Files to Review/Modify
### Backend
- [ ] `backend/internal/api/handlers/emergency_handler.go` - SecurityReset function
- [ ] `backend/internal/services/settings_service.go` - Settings update logic
### Tests
- [ ] `tests/global-setup.ts` - Emergency reset call
- [ ] `tests/security/security-dashboard.spec.ts` - Toggle tests
- [ ] `tests/security/rate-limiting.spec.ts` - Toggle test
### Documentation
- [x] `docs/plans/e2e-test-triage-plan.md` - Full triage plan (COMPLETE)
- [x] `scripts/diagnose-test-env.sh` - Diagnostic script (CREATED)
- [ ] Update after fix is implemented
---
## Success Criteria
### Before Fix
```
Running 2737 tests using 2 workers
✓ pass - Tests that run successfully
- skip - Tests that conditionally skip (13 affected)
```
### After Fix
```
Running 2737 tests using 2 workers
✓ pass - All 13 previously-skipped tests now execute
- skip - Only explicitly skipped tests (test.describe.skip)
```
### Validation Checklist
- [ ] Emergency reset keeps Cerberus enabled
- [ ] Emergency reset disables all security modules
- [ ] Toggle buttons are enabled (not disabled)
- [ ] Configure buttons are enabled (not disabled)
- [ ] Tests execute instead of skip
- [ ] Tests pass (or have actionable failures)
- [ ] CI/CD pipeline updated if needed
---
## Next Steps
1. **Investigate Backend** (30 min)
- Read `emergency_handler.go` SecurityReset implementation
- Determine what settings are being modified
- Document current behavior
2. **Design Fix** (30 min)
- Choose Option A (backend) or Option B (frontend)
- Create implementation plan
- Review with team if needed
3. **Implement Fix** (1-2 hours)
- Make code changes
- Add comments explaining the behavior
- Test locally
4. **Validate** (30 min)
- Run full E2E test suite
- Check that skip count decreases
- Verify tests pass
5. **Document** (15 min)
- Update triage plan with resolution
- Add decision record
- Update any affected documentation
---
## Risk Assessment
### Low Risk Fix (Recommended)
- Modify emergency reset to keep Cerberus enabled
- Only affects test environment behavior
- No production impact
- Easy to rollback
### Rollback Plan
```bash
git checkout HEAD^ -- backend/internal/api/handlers/emergency_handler.go
git checkout HEAD^ -- tests/global-setup.ts
.github/skills/scripts/skill-runner.sh docker-rebuild-e2e
```
---
## Questions for Investigation
1. **Why does emergency reset disable Cerberus?** ✅ ANSWERED
- **CONFIRMED BUG:** This is incorrect behavior
- **Design Intent:** Cerberus should stay enabled for break glass testing
- **Fix Required:** Remove line that disables `feature.cerberus.enabled`
2. **What should the test environment look like?** ✅ ANSWERED
- **Cerberus Framework:** ENABLED (`feature.cerberus.enabled` = true)
- **Security Modules:** DISABLED (clean slate for testing)
- **Test Order:** All Cerberus tests → Break glass test (LAST)
3. **Are there other tests affected?**
- Run full suite after fix
- Check for cascading test failures
- Validate assumptions
---
## Resources
- **Full Triage Plan:** [docs/plans/e2e-test-triage-plan.md](../plans/e2e-test-triage-plan.md)
- **Diagnostic Script:** [scripts/diagnose-test-env.sh](../../scripts/diagnose-test-env.sh)
- **Global Setup:** [tests/global-setup.ts](../../tests/global-setup.ts)
- **Emergency Handler:** [backend/internal/api/handlers/emergency_handler.go](../../backend/internal/api/handlers/emergency_handler.go)
- **Testing Instructions:** [.github/instructions/testing.instructions.md](../../.github/instructions/testing.instructions.md)
---
## Contact
For questions or clarification, see:
- Triage Plan: Full analysis and categorization
- Testing protocols: E2E test execution guidelines
- Architecture docs: Cerberus security framework
**Status:** Ready for implementation - Root cause identified