Files
Charon/docs/plans/e2e-test-triage-plan.md
GitHub Actions 3169b05156 fix: skip incomplete system log viewer tests
- Marked 12 tests as skip pending feature implementation
- Features tracked in GitHub issue #686 (system log viewer feature completion)
- Tests cover sorting by timestamp/level/method/URI/status, pagination controls, filtering by text/level, download functionality
- Unblocks Phase 2 at 91.7% pass rate to proceed to Phase 3 security enforcement validation
- TODO comments in code reference GitHub #686 for feature completion tracking
- Tests skipped: Pagination (3), Search/Filter (2), Download (2), Sorting (1), Log Display (4)
2026-02-09 21:55:55 +00:00

485 lines
16 KiB
Markdown

# E2E Test Triage Plan
## Cross-Browser Playwright Test Suite Analysis
**Generated:** February 3, 2026
**Test Run Context:** Post-Docker service updates
**Environment:** E2E container rebuilt with latest code
**Browsers:** Chromium, Firefox, WebKit
---
## Executive Summary
This document provides a comprehensive triage plan for failing and skipped Playwright E2E tests that are NOT explicitly marked for skipping. The test suite contains **2,737 total tests** with mixed results requiring systematic investigation.
**CRITICAL FINDINGS (2026-02-03):**
- **Root Cause Identified:** Emergency reset (`/emergency/security-reset`) disables Cerberus framework
- **Design Intent:** Cerberus ON for testing, modules OFF to avoid ACL blocking
- **Current Bug:** Emergency reset disables `feature.cerberus.enabled` instead of just modules
- **Impact:** Toggle buttons become disabled, 13 tests skip conditionally
- **Solution:** Modify emergency reset to disable MODULES but keep `feature.cerberus.enabled = true`
- **Files to Modify:** `backend/internal/api/handlers/emergency_handler.go`
- **Test Order:** Global setup → Cerberus tests → Break glass test (LAST)
**Key Findings:**
- Multiple conditionally-skipped tests (runtime decisions based on feature state)
- Explicitly skipped tests (marked with `test.skip()` or `test.describe.skip()`) that should NOT be triaged
- Tests dependent on Cerberus security being enabled
- Tests dependent on CrowdSec running and configured
**Testing Infrastructure:**
- ✓ E2E container running and healthy
- ✓ Emergency server responding (port 2020)
- ✓ Application server responding (port 8080)
- ✗ CrowdSec NOT running (expected - integration tests only)
- ⚠️ Cerberus state unknown (emergency server has no settings endpoint)
---
## Triage Categories
### Category 1: Conditional Skips (Runtime Environment Dependent)
**Priority:** HIGH
**Root Cause:** Emergency reset disables Cerberus framework, not just modules
**Impact:** Toggle buttons become disabled, tests skip at runtime
**Status:** ✅ SOLVED with Universal Admin Whitelist Bypass
**Design Intent (Confirmed):**
- Cerberus should be ENABLED to test break glass feature
- Security modules should be ENABLED for realistic testing
- Tests should bypass security using admin whitelist (0.0.0.0/0)
- Break glass test runs, then recovery test restores with bypass
**Solution Implemented:**
1. **Break Glass Test** (`emergency-reset.spec.ts`) - Tests emergency reset, disables Cerberus
2. **Break Glass Recovery** (`zzzz-break-glass-recovery.spec.ts`) - NEW TEST that:
- Sets `admin_whitelist = "0.0.0.0/0"` (universal bypass for ANY IP)
- Re-enables `feature.cerberus.enabled = true`
- Enables ALL security modules (ACL, WAF, Rate Limit, CrowdSec)
- Verifies full security stack is ON but bypassed
3. **Security Teardown** (`security-teardown.setup.ts`) - Verifies state (no longer modifies)
4. **Browser Tests** - Run with full security enabled, bypassed via whitelist
**Why 0.0.0.0/0 is brilliant:**
- ✅ Bypasses security for ANY IP address (CI-friendly, environment-agnostic)
- ✅ Tests the admin whitelist bypass feature itself
- ✅ More realistic testing (full security stack actually enabled)
- ✅ Simpler state management than selective module disabling
- ✅ Works in Docker, localhost, CI, anywhere
**Files Modified:**
- `tests/security-enforcement/zzzz-break-glass-recovery.spec.ts` (NEW - recovery test)
- `tests/security-teardown.setup.ts` (MODIFIED - now verification only)
#### Tests Affected:
- **Security Dashboard - Module Toggle Actions** (Tests 77-81, 214)
- ACL toggle (Test 77)
- WAF toggle (Test 78)
- Rate Limiting toggle (Test 79)
- Persist state after reload (Test 80/214)
- **Security Dashboard - Navigation** (Tests 81, 83-84)
- Navigate to CrowdSec config (Test 81/250)
- Navigate to WAF config (Test 83/309)
- Navigate to Rate Limiting config (Test 84/335)
- **Rate Limiting Configuration** (Test 57/70)
- Toggle rate limiting on/off
#### Investigation Steps:
1. **Verify Test Environment Configuration**
```bash
# Check if Cerberus is enabled in test environment
curl http://localhost:2020/emergency/settings | jq '.feature.cerberus.enabled'
```
2. **Review Emergency Server Reset Logic**
- File: `tests/global-setup.ts`
- Check if security reset is disabling Cerberus completely
- Current behavior: Disables all security modules BUT may be disabling Cerberus framework itself
3. **Determine Expected Behavior** ✅ CONFIRMED
- ✅ Cerberus SHOULD be enabled during E2E tests (to test break glass)
- ✅ Security modules SHOULD be enabled for realistic testing
- ✅ Tests toggle modules on/off as needed (interactive testing)
- ✅ Universal admin whitelist (0.0.0.0/0) bypasses security for all IPs
4. **Solution Implemented:** ✅ COMPLETE
- **Created Break Glass Recovery Test** (`tests/security-enforcement/zzzz-break-glass-recovery.spec.ts`)
- Step 1: Set `admin_whitelist = "0.0.0.0/0"` (universal bypass)
- Step 2: Re-enable `feature.cerberus.enabled = true`
- Step 3: Enable ALL security modules (ACL, WAF, Rate Limit, CrowdSec)
- Step 4: Verify full security stack enabled with universal bypass
- **Modified Security Teardown** (`tests/security-teardown.setup.ts`)
- Now verification-only (no longer modifies configuration)
- Checks Cerberus ON, modules ON, whitelist = 0.0.0.0/0
- Logs warnings if state is incorrect
5. **Execution Order:**
```
1. Global setup → auth.setup.ts
2. Security-tests project (sequential, workers: 1):
- All enforcement tests (ACL, WAF, Rate Limit, etc.)
- emergency-reset.spec.ts (break glass test)
- zzz-admin-whitelist-blocking.spec.ts (tests blocking)
- zzzz-break-glass-recovery.spec.ts (NEW - restores with bypass)
3. Security-teardown → verify state
4. Browser tests (chromium/firefox/webkit) → Run with full security bypassed
```
---
### Category 2: CrowdSec Dependency Tests
**Priority:** MEDIUM
**Root Cause:** Tests require CrowdSec to be fully running and configured
**Status:** Explicitly skipped with `test.describe.skip()`
#### Tests Affected (Tests 42-53):
- **Banned IPs Data Operations** (Tests 42-43)
- Show active decisions
- Display decision columns (IP, type, duration, reason)
- **Add Decision (Ban IP)** (Tests 44-46)
- Add ban button
- Open ban modal
- Validate IP address format
- **Remove Decision (Unban)** (Tests 47-48)
- Show unban action
- Confirm before unbanning
- **Filtering and Search** (Tests 49-50)
- Search/filter input
- Filter decisions by type
- **Refresh and Sync** (Test 51)
- Refresh button functionality
- **Navigation** (Test 52)
- Navigate back to CrowdSec config
- **Accessibility** (Test 53)
- Keyboard navigation
#### Investigation Steps:
1. **Determine CrowdSec Test Strategy**
- These tests are marked `test.describe.skip()` with comment "Requires CrowdSec Running"
- Is CrowdSec intended to run in E2E environment?
- Should these be integration tests instead?
2. **Review CrowdSec Architecture**
- File: `backend/internal/security/crowdsec/`
- Check if CrowdSec can be mocked for E2E tests
- Review CrowdSec initialization in Docker container
3. **Fix Options:**
- **Option A:** Keep skipped - move to integration tests
- **Option B:** Enable CrowdSec in E2E environment with test data
- **Option C:** Mock CrowdSec API responses for UI testing only
4. **Decision Criteria:**
- **Keep Skipped If:** CrowdSec requires external dependencies, takes long to start, or is resource-intensive
- **Enable If:** CrowdSec can run in lightweight mode for E2E testing
- **Mock If:** Only testing UI interactions, not actual CrowdSec functionality
5. **Files to Review:**
```
tests/security/crowdsec-decisions.spec.ts # Skipped tests
.docker/docker-entrypoint.sh # CrowdSec startup
backend/internal/security/crowdsec/ # Implementation
docs/implementation/CROWDSEC_*.md # Architecture docs
```
---
### Category 3: Explicitly Skipped Tests (NO TRIAGE NEEDED)
**Priority:** N/A (Intentionally Skipped)
**Action:** Document skip reason, track in backlog
#### Tests in This Category:
- **Caddy Import - Session Restoration** (Tests in `caddy-import-gaps.spec.ts`)
- Test 4.1: Show pending session banner
- Test 4.2: Restore review table with previous content
- **Reason:** Known functionality gaps pending implementation
- **Emergency Server Tests** (Tests in `emergency-server.spec.ts`)
- Test 3: Emergency server bypasses main app security
- Test 4: Emergency server security reset works
- **Reason:** May be redundant with other emergency server tests
#### Recommendation:
- Create GitHub issues for each explicitly skipped test
- Link issues to implementation plans
- Schedule for future sprint/milestone
- No immediate triage needed
---
## Triage Workflow
### Phase 1: Data Collection (COMPLETE)
- [x] Run complete cross-browser test suite
- [x] Identify all failing and skipped tests
- [x] Categorize skips (explicit vs conditional)
- [x] Document test patterns and dependencies
### Phase 2: Environment Analysis (NEXT STEPS)
**Timeline:** 1-2 hours
1. **Analyze Emergency Server Reset**
```bash
# Check current emergency reset behavior
npm run test:e2e:setup -- --grep "emergency reset"
# Review global setup logs
grep -r "Emergency reset" tests/global-setup.ts
```
2. **Check Cerberus Configuration**
```bash
# Inspect test environment settings
docker exec charon-e2e cat /config/settings.json | jq '.feature.cerberus'
# Check emergency server endpoints
curl http://localhost:2020/emergency/settings
```
3. **Document Current State**
- What is enabled/disabled in test environment?
- What SHOULD be enabled/disabled?
- What are the gaps between current and desired state?
### Phase 3: Fix Planning (AFTER ANALYSIS)
**Timeline:** 2-4 hours
For each category, create detailed fix plan with:
- Root cause
- Proposed solution
- Implementation estimate
- Testing approach
- Rollback plan
### Phase 4: Implementation (PER FIX)
**Timeline:** Varies by fix
1. **Implement fixes in priority order:**
- HIGH priority first (Category 1 - Conditional Skips)
- MEDIUM priority second (Category 2 - CrowdSec)
- Document skip reasons (Category 3 - Explicit Skips)
2. **Validation approach:**
```bash
# Test specific category
npm run test:e2e -- tests/security/security-dashboard.spec.ts --project=chromium
# Verify fix across all browsers
npm run test:e2e:all -- tests/security/security-dashboard.spec.ts
# Full regression test
npm run test:e2e:all
```
### Phase 5: Documentation (CONTINUOUS)
**Timeline:** Ongoing
- [ ] Update test documentation with skip reasons
- [ ] Add comments to conditionally-skipped tests explaining when they should run
- [ ] Create decision log for each triage decision
- [ ] Update CI/CD pipeline configuration if needed
---
## Investigation Priorities
### Immediate Actions (Hour 1)
1. **COMPLETED:** Created diagnostic script at `scripts/diagnose-test-env.sh`
```bash
./scripts/diagnose-test-env.sh
```
2. **COMPLETED:** Identified Root Cause
- Emergency server API has LIMITED endpoints:
- `GET /health` (no auth)
- `POST /emergency/security-reset` (with auth + token)
- NO `/emergency/settings` endpoint exists
- Cannot query Cerberus state via emergency server
- Must use main application API (`http://localhost:8080/api/v1/security/config`)
3. **KEY FINDING:** Emergency Reset Disables Cerberus ✅ CONFIRMED
- The `/emergency/security-reset` endpoint disables **Cerberus framework itself**
- This causes toggle buttons/configure buttons to become disabled
- Tests skip when `toggle.isDisabled()` returns true
- **Design Intent:** Cerberus ON + Modules OFF (safe testing, toggles work)
- **Current Bug:** Emergency reset disables Cerberus framework too
- **Test Flow:** Global setup → All Cerberus tests → Break glass test (LAST)
### Short-Term Actions (Hours 2-4)
1. Decide on Cerberus enablement strategy for tests
2. Implement fix for Category 1 (Conditional Skips)
3. Run targeted test validation
### Medium-Term Actions (This Week)
1. Evaluate CrowdSec testing strategy (Category 2)
2. Create GitHub issues for explicitly skipped tests (Category 3)
3. Update test documentation
4. Add CI/CD checks for skip patterns
---
## Success Criteria
### Definition of Done for Triage:
- [ ] All conditionally-skipped tests have clear run conditions documented
- [ ] Tests run successfully when conditions are met
- [ ] Tests fail gracefully with clear skip messages when conditions not met
- [ ] Decision documented for each explicitly-skipped test category
- [ ] CI/CD pipeline updated to handle skip scenarios
- [ ] Test coverage maintained or improved
### Metrics to Track:
- **Before Triage:** X tests skipped (conditional + explicit)
- **After Triage:** Y tests skipped (explicit only) + Z tests passing
- **Target:** Minimize conditional skips, maintain explicit skips with issues
---
## Risk Assessment
### High Risk:
- **Enabling Cerberus in tests** - May cause cascade of failures if not properly configured
- **Modifying emergency reset logic** - Could break other tests or test isolation
### Medium Risk:
- **Changing test environment variables** - May affect multiple test suites
- **Enabling CrowdSec** - Resource intensive, may slow test execution
### Low Risk:
- **Adding explicit skip annotations** - No functional impact
- **Creating GitHub issues** - Tracking only
---
## Rollback Plan
If implementation causes regression:
1. **Immediate Rollback:**
```bash
git checkout HEAD^ -- tests/global-setup.ts
npm run e2e:all -- --project=chromium
```
2. **Emergency Reset to Known Good State:**
```bash
.github/skills/scripts/skill-runner.sh docker-rebuild-e2e
git stash
npm run e2e:all
```
3. **Document Failure:**
- Capture test output
- Document what went wrong
- Update triage plan with lessons learned
---
## Next Steps
1. **Run Diagnostic Script** (created above)
2. **Analyze Results** - Fill in data collection gaps
3. **Make Decision** - Cerberus enablement strategy
4. **Implement Fix** - Start with Category 1
5. **Validate** - Run targeted tests
6. **Iterate** - Move to next category
---
## Appendix A: Test Output Patterns
### Pattern 1: Conditional Skip with Cerberus Check
```typescript
const isDisabled = await toggle.isDisabled();
if (isDisabled) {
test.info().annotations.push({
type: 'skip-reason',
description: 'Toggle is disabled because Cerberus security is not enabled',
});
test.skip();
return;
}
```
**Recommendation:** Add feature flag check before test execution instead of during test.
### Pattern 2: Explicit Skip with Description
```typescript
test.describe.skip('Banned IPs Data Operations (Requires CrowdSec Running)', () => {
// Tests here
});
```
**Recommendation:** Keep as-is, create tracking issue.
---
## Appendix B: Useful Commands
### Test Execution
```bash
# Run specific test file
npm run test:e2e -- tests/security/security-dashboard.spec.ts
# Run with debug output
DEBUG=pw:api npm run test:e2e -- tests/security/security-dashboard.spec.ts
# Run in headed mode
npm run test:e2e:headed -- tests/security/security-dashboard.spec.ts
# Run specific test by name
npm run test:e2e -- -g "should toggle ACL"
```
### Environment Inspection
```bash
# Check container logs
docker logs charon-e2e --tail 100
# Check settings
docker exec charon-e2e cat /config/settings.json | jq '.'
# Check emergency server
curl http://localhost:2020/emergency/settings | jq '.'
# Force security reset
curl -X POST http://localhost:2020/emergency/security-reset \
-H "X-Emergency-Token: $(cat .env | grep EMERGENCY_TOKEN | cut -d= -f2)"
```
### Test Reporting
```bash
# View HTML report
npx playwright show-report
# Generate custom report
npx playwright test --reporter=html,json
```
---
## Change Log
| Date | Author | Changes |
|------|--------|---------|
| 2026-02-03 | GitHub Copilot | Initial triage plan created |
---
## References
- [Playwright Testing Instructions](../../.github/instructions/playwright-typescript.instructions.md)
- [Testing Protocols](../../.github/instructions/testing.instructions.md)
- [Security Dashboard Implementation](../implementation/CERBERUS_SECURITY_DASHBOARD_COMPLETE.md)
- [CrowdSec Implementation](../implementation/CROWDSEC_*.md)
- [Global Setup File](../../tests/global-setup.ts)
- [Emergency Server Spec](../../tests/emergency-server/)