chore: Enhance documentation for E2E testing:

- Added clarity and structure to README files, including recent updates and getting started sections.
- Improved manual verification documentation for CrowdSec authentication, emphasizing expected outputs and success criteria.
- Updated debugging guide with detailed output examples and automatic trace capture information.
- Refined best practices for E2E tests, focusing on efficient polling, locator strategies, and state management.
- Documented triage report for DNS Provider feature tests, highlighting issues fixed and test results before and after improvements.
- Revised E2E test writing guide to include when to use specific helper functions and patterns for better test reliability.
- Enhanced troubleshooting documentation with clear resolutions for common issues, including timeout and token configuration problems.
- Updated tests README to provide quick links and best practices for writing robust tests.
This commit is contained in:
GitHub Actions
2026-03-24 01:47:22 +00:00
parent 7d986f2821
commit ca477c48d4
52 changed files with 983 additions and 198 deletions

View File

@@ -53,6 +53,7 @@ logger.Infof("API Key: %s", apiKey)
```
Charon's masking rules:
- Empty: `[empty]`
- Short (< 16 chars): `[REDACTED]`
- Normal (≥ 16 chars): `abcd...xyz9` (first 4 + last 4)
@@ -68,6 +69,7 @@ if !validateAPIKeyFormat(apiKey) {
```
Requirements:
- Length: 16-128 characters
- Charset: Alphanumeric + underscore + hyphen
- No spaces or special characters
@@ -99,6 +101,7 @@ Rotate secrets regularly:
### What to Log
**Safe to log**:
- Timestamps
- User IDs (not usernames if PII)
- IP addresses (consider GDPR implications)
@@ -108,6 +111,7 @@ Rotate secrets regularly:
- Performance metrics
**Never log**:
- Passwords or password hashes
- API keys or tokens (use masking)
- Session IDs (full values)
@@ -139,6 +143,7 @@ logger.Infof("Login attempt: username=%s password=%s", username, password)
### Log Aggregation
If using external log services (CloudWatch, Splunk, Datadog):
- Ensure logs are encrypted in transit (TLS)
- Ensure logs are encrypted at rest
- Redact sensitive data before shipping
@@ -333,6 +338,7 @@ limiter := rate.NewLimiter(rate.Every(36*time.Second), 100)
```
**Critical endpoints** (require stricter limits):
- Login: 5 attempts per 15 minutes
- Password reset: 3 attempts per hour
- API key generation: 5 per day
@@ -369,6 +375,7 @@ return c.JSON(401, gin.H{"error": "invalid API key: abc123"})
**Applicable if**: Processing data of EU residents
**Requirements**:
1. **Data minimization**: Collect only necessary data
2. **Purpose limitation**: Use data only for stated purposes
3. **Storage limitation**: Delete data when no longer needed
@@ -376,6 +383,7 @@ return c.JSON(401, gin.H{"error": "invalid API key: abc123"})
5. **Breach notification**: Report breaches within 72 hours
**Implementation**:
- ✅ Charon masks API keys in logs (prevents exposure of personal data)
- ✅ Secure file permissions (0600) protect sensitive data
- ✅ Log retention policies prevent indefinite storage
@@ -390,12 +398,14 @@ return c.JSON(401, gin.H{"error": "invalid API key: abc123"})
**Applicable if**: Processing, storing, or transmitting credit card data
**Requirements**:
1. **Requirement 3.4**: Render PAN unreadable (encryption, masking)
2. **Requirement 8.2**: Strong authentication
3. **Requirement 10.2**: Audit trails
4. **Requirement 10.7**: Retain audit logs for 1 year
**Implementation**:
- ✅ Charon uses masking for sensitive credentials (same principle for PAN)
- ✅ Secure file permissions align with access control requirements
- ⚠️ Charon doesn't handle payment cards directly (delegated to payment processors)
@@ -409,12 +419,14 @@ return c.JSON(401, gin.H{"error": "invalid API key: abc123"})
**Applicable if**: SaaS providers, cloud services
**Trust Service Criteria**:
1. **CC6.1**: Logical access controls (authentication, authorization)
2. **CC6.6**: Encryption of data in transit
3. **CC6.7**: Encryption of data at rest
4. **CC7.2**: Monitoring and detection (logging, alerting)
**Implementation**:
- ✅ API key validation ensures strong credentials (CC6.1)
- ✅ File permissions (0600) protect data at rest (CC6.7)
- ✅ Masked logging enables monitoring without exposing secrets (CC7.2)
@@ -429,12 +441,14 @@ return c.JSON(401, gin.H{"error": "invalid API key: abc123"})
**Applicable to**: Any organization implementing ISMS
**Key Controls**:
1. **A.9.4.3**: Password management systems
2. **A.10.1.1**: Cryptographic controls
3. **A.12.4.1**: Event logging
4. **A.18.1.5**: Protection of personal data
**Implementation**:
- ✅ API key format validation (minimum 16 chars, charset restrictions)
- ✅ Key rotation procedures documented
- ✅ Secure storage with file permissions (0600)
@@ -491,6 +505,7 @@ grep -i "api[_-]key\|token\|password" playwright-report/index.html
**Recommended schedule**: Annual or after major releases
**Focus areas**:
1. Authentication bypass
2. Authorization vulnerabilities
3. SQL injection

View File

@@ -1,6 +1,6 @@
**Status**: ✅ RESOLVED (January 30, 2026)
https://github.com/Wikid82/Charon/actions/runs/21503634925/job/61955008214
<https://github.com/Wikid82/Charon/actions/runs/21503634925/job/61955008214>
Run # Normalize image name for reference
🔍 Extracting binary from: ghcr.io/wikid82/charon:feature/beta-release
@@ -27,6 +27,7 @@ Add a check to ensure steps.pr-info.outputs.pr_number is set before constructing
Suggested code improvement for the “Extract charon binary from container” step:
YAML
- name: Extract charon binary from container
if: steps.check-artifact.outputs.artifact_exists == 'true'
id: extract
@@ -44,6 +45,7 @@ YAML
echo "🔍 Extracting binary from: ${IMAGE_REF}"
...
This ensures the workflow does not attempt to use an invalid image tag when the PR number is missing. Adjust similar logic throughout the workflow to handle missing variables gracefully.
## Resolution
Fixed by adding proper validation for PR number before constructing Docker image reference, ensuring IMAGE_REF is never constructed with empty/missing variables. Branch name sanitization also implemented to handle slashes in feature branch names.

View File

@@ -2,7 +2,7 @@
**Date:** 2026-01-28
**PR:** #550 - Alpine to Debian Trixie Migration
**CI Run:** https://github.com/Wikid82/Charon/actions/runs/21456678628/job/61799104804
**CI Run:** <https://github.com/Wikid82/Charon/actions/runs/21456678628/job/61799104804>
**Branch:** feature/beta-release
---
@@ -18,16 +18,19 @@ The CrowdSec integration tests are failing after migrating the Dockerfile from A
### 1. **CrowdSec Builder Stage Compatibility**
**Alpine vs Debian Differences:**
- **Alpine** uses `musl libc`, **Debian** uses `glibc`
- Different package managers: `apk` (Alpine) vs `apt` (Debian)
- Different package names and availability
**Current Dockerfile (lines 218-270):**
```dockerfile
FROM --platform=$BUILDPLATFORM golang:1.25.7-trixie AS crowdsec-builder
```
**Dependencies Installed:**
```dockerfile
RUN apt-get update && apt-get install -y --no-install-recommends \
git clang lld \
@@ -36,6 +39,7 @@ RUN xx-apt install -y gcc libc6-dev
```
**Possible Issues:**
- **Missing build dependencies**: CrowdSec might require additional packages on Debian that were implicitly available on Alpine
- **Git clone failures**: Network issues or GitHub rate limiting
- **Dependency resolution**: `go mod tidy` might behave differently
@@ -44,6 +48,7 @@ RUN xx-apt install -y gcc libc6-dev
### 2. **CrowdSec Binary Path Issues**
**Runtime Image (lines 359-365):**
```dockerfile
# Copy CrowdSec binaries from the crowdsec-builder stage (built with Go 1.25.5+)
COPY --from=crowdsec-builder /crowdsec-out/crowdsec /usr/local/bin/crowdsec
@@ -52,17 +57,20 @@ COPY --from=crowdsec-builder /crowdsec-out/config /etc/crowdsec.dist
```
**Possible Issues:**
- If the builder stage fails, these COPY commands will fail
- If fallback stage is used (for non-amd64), paths might be wrong
### 3. **CrowdSec Configuration Issues**
**Entrypoint Script CrowdSec Init (docker-entrypoint.sh):**
- Symlink creation from `/etc/crowdsec` to `/app/data/crowdsec/config`
- Configuration file generation and substitution
- Hub index updates
**Possible Issues:**
- Symlink already exists as directory instead of symlink
- Permission issues with non-root user
- Configuration templates missing or incompatible
@@ -70,12 +78,14 @@ COPY --from=crowdsec-builder /crowdsec-out/config /etc/crowdsec.dist
### 4. **Test Script Environment Issues**
**Integration Test (crowdsec_integration.sh):**
- Builds the image with `docker build -t charon:local .`
- Starts container and waits for API
- Tests CrowdSec Hub connectivity
- Tests preset pull/apply functionality
**Possible Issues:**
- Build step timing out or failing silently
- Container failing to start properly
- CrowdSec processes not starting
@@ -88,6 +98,7 @@ COPY --from=crowdsec-builder /crowdsec-out/config /etc/crowdsec.dist
### Step 1: Check Build Logs
Review the CI build logs for the CrowdSec builder stage:
- Look for `git clone` errors
- Check for `go get` or `go mod tidy` failures
- Verify `xx-go build` completes successfully
@@ -96,6 +107,7 @@ Review the CI build logs for the CrowdSec builder stage:
### Step 2: Verify CrowdSec Binaries
Check if CrowdSec binaries are actually present:
```bash
docker run --rm charon:local which crowdsec
docker run --rm charon:local which cscli
@@ -105,6 +117,7 @@ docker run --rm charon:local cscli version
### Step 3: Check CrowdSec Configuration
Verify configuration is properly initialized:
```bash
docker run --rm charon:local ls -la /etc/crowdsec
docker run --rm charon:local ls -la /app/data/crowdsec
@@ -114,6 +127,7 @@ docker run --rm charon:local cat /etc/crowdsec/config.yaml
### Step 4: Test CrowdSec Locally
Run the integration test locally:
```bash
# Build image
docker build --no-cache -t charon:local .
@@ -129,6 +143,7 @@ docker build --no-cache -t charon:local .
### Fix 1: Add Missing Build Dependencies
If the build is failing due to missing dependencies, add them to the CrowdSec builder:
```dockerfile
RUN apt-get update && apt-get install -y --no-install-recommends \
git clang lld \
@@ -139,6 +154,7 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
### Fix 2: Add Build Stage Debugging
Add debugging output to identify where the build fails:
```dockerfile
# After git clone
RUN echo "CrowdSec source cloned successfully" && ls -la
@@ -153,6 +169,7 @@ RUN echo "Build complete" && ls -la /crowdsec-out/
### Fix 3: Use CrowdSec Fallback
If the build continues to fail, ensure the fallback stage is working:
```dockerfile
# In final stage, use conditional COPY
COPY --from=crowdsec-fallback /crowdsec-out/bin/crowdsec /usr/local/bin/crowdsec || \
@@ -162,6 +179,7 @@ COPY --from=crowdsec-builder /crowdsec-out/crowdsec /usr/local/bin/crowdsec
### Fix 4: Verify cscli Before Test
Add a verification step in the entrypoint:
```bash
if ! command -v cscli >/dev/null; then
echo "ERROR: CrowdSec not installed properly"

View File

@@ -11,11 +11,13 @@
**File**: `tests/settings/system-settings.spec.ts`
**Changes Made**:
1. **Removed** `waitForFeatureFlagPropagation()` call from `beforeEach` hook (lines 35-46)
- This was causing 10s × 31 tests = 310s of polling overhead per shard
- Commented out with clear explanation linking to remediation plan
2. **Added** `test.afterEach()` hook with direct API state restoration:
```typescript
test.afterEach(async ({ page }) => {
await test.step('Restore default feature flag state', async () => {
@@ -34,12 +36,14 @@
```
**Rationale**:
- Tests already verify feature flag state individually after toggle actions
- Initial state verification in beforeEach was redundant
- Explicit cleanup in afterEach ensures test isolation without polling overhead
- Direct API mutation for state restoration is faster than polling
**Expected Impact**:
- 310s saved per shard (10s × 31 tests)
- Elimination of inter-test dependencies
- No state leakage between tests
@@ -51,12 +55,14 @@
**Changes Made**:
1. **Added module-level cache** for in-flight requests:
```typescript
// Cache for in-flight requests (per-worker isolation)
const inflightRequests = new Map<string, Promise<Record<string, boolean>>>();
```
2. **Implemented cache key generation** with sorted keys and worker isolation:
```typescript
function generateCacheKey(
expectedFlags: Record<string, boolean>,
@@ -81,6 +87,7 @@
- Removes promise from cache after completion (success or failure)
4. **Added cleanup function**:
```typescript
export function clearFeatureFlagCache(): void {
inflightRequests.clear();
@@ -89,16 +96,19 @@
```
**Why Sorted Keys?**
- `{a:true, b:false}` vs `{b:false, a:true}` are semantically identical
- Without sorting, they generate different cache keys → cache misses
- Sorting ensures consistent key regardless of property order
**Why Worker Isolation?**
- Playwright workers run in parallel across different browser contexts
- Each worker needs its own cache to avoid state conflicts
- Worker index provides unique namespace per parallel process
**Expected Impact**:
- 30-40% reduction in duplicate API calls (revised from original 70-80% estimate)
- Cache hit rate should be >30% based on similar flag state checks
- Reduced API server load during parallel test execution
@@ -108,21 +118,26 @@
**Status**: Partially Investigated
**Issue**:
- Test: `tests/dns-provider-types.spec.ts` (line 260)
- Symptom: Label locator `/script.*path/i` passes in Chromium, fails in Firefox/WebKit
- Test code:
```typescript
const scriptField = page.getByLabel(/script.*path/i);
await expect(scriptField).toBeVisible({ timeout: 10000 });
```
**Investigation Steps Completed**:
1. ✅ Confirmed E2E environment is running and healthy
2. ✅ Attempted to run DNS provider type tests in Chromium
3. ⏸️ Further investigation deferred due to test execution issues
**Investigation Steps Remaining** (per spec):
1. Run with Playwright Inspector to compare accessibility trees:
```bash
npx playwright test tests/dns-provider-types.spec.ts --project=chromium --headed --debug
npx playwright test tests/dns-provider-types.spec.ts --project=firefox --headed --debug
@@ -137,6 +152,7 @@
5. If not fixable: Use the helper function approach from Phase 2
**Recommendation**:
- Complete investigation in separate session with headed browser mode
- DO NOT add `.or()` chains unless investigation proves it's necessary
- Create formal Decision Record once root cause is identified
@@ -144,31 +160,37 @@
## Validation Checkpoints
### Checkpoint 1: Execution Time
**Status**: ⏸️ In Progress
**Target**: <15 minutes (900s) for full test suite
**Command**:
```bash
time npx playwright test tests/settings/system-settings.spec.ts --project=chromium
```
**Results**:
- Test execution interrupted during validation
- Observed: Tests were picking up multiple spec files from security/ folder
- Need to investigate test file patterns or run with more specific filtering
**Action Required**:
- Re-run with corrected test file path or filtering
- Ensure only system-settings tests are executed
- Measure execution time and compare to baseline
### Checkpoint 2: Test Isolation
**Status**: ⏳ Pending
**Target**: All tests pass with `--repeat-each=5 --workers=4`
**Command**:
```bash
npx playwright test tests/settings/system-settings.spec.ts --project=chromium --repeat-each=5 --workers=4
```
@@ -176,11 +198,13 @@ npx playwright test tests/settings/system-settings.spec.ts --project=chromium --
**Status**: Not executed yet
### Checkpoint 3: Cross-browser
**Status**: ⏳ Pending
**Target**: Firefox/WebKit pass rate >85%
**Command**:
```bash
npx playwright test tests/settings/system-settings.spec.ts --project=firefox --project=webkit
```
@@ -188,11 +212,13 @@ npx playwright test tests/settings/system-settings.spec.ts --project=firefox --p
**Status**: Not executed yet
### Checkpoint 4: DNS provider tests (secondary issue)
**Status**: ⏳ Pending
**Target**: Firefox tests pass or investigation complete
**Command**:
```bash
npx playwright test tests/dns-provider-types.spec.ts --project=firefox
```
@@ -204,11 +230,13 @@ npx playwright test tests/dns-provider-types.spec.ts --project=firefox
### Decision: Use Direct API Mutation for State Restoration
**Context**:
- Tests need to restore default feature flag state after modifications
- Original approach used polling-based verification in beforeEach
- Alternative approaches: polling in afterEach vs direct API mutation
**Options Evaluated**:
1. **Polling in afterEach** - Verify state propagated after mutation
- Pros: Confirms state is actually restored
- Cons: Adds 500ms-2s per test (polling overhead)
@@ -219,12 +247,14 @@ npx playwright test tests/dns-provider-types.spec.ts --project=firefox
- Why chosen: Feature flag updates are synchronous in backend
**Rationale**:
- Feature flag updates via PUT /api/v1/feature-flags are processed synchronously
- Database write is immediate (SQLite WAL mode)
- No async propagation delay in single-process test environment
- Subsequent tests will verify state on first read, catching any issues
**Impact**:
- Test runtime reduced by 15-60s per test file (31 tests × 500ms-2s polling)
- Risk: If state restoration fails, next test will fail loudly (detectable)
- Acceptable trade-off for 10-20% execution time improvement
@@ -234,15 +264,18 @@ npx playwright test tests/dns-provider-types.spec.ts --project=firefox
### Decision: Cache Key Sorting for Semantic Equality
**Context**:
- Multiple tests may check the same feature flag state but with different property order
- Without normalization, `{a:true, b:false}` and `{b:false, a:true}` generate different keys
**Rationale**:
- JavaScript objects have insertion order, but semantically these are identical states
- Sorting keys ensures cache hits for semantically identical flag states
- Minimal performance cost (~1ms for sorting 3-5 keys)
**Impact**:
- Estimated 10-15% cache hit rate improvement
- No downside - pure optimization

View File

@@ -78,6 +78,7 @@ git pull origin development
```
This script:
- Detects the required Go version from `go.work`
- Downloads it from golang.org
- Installs it to `~/sdk/go{version}/`
@@ -103,6 +104,7 @@ Even if you used Option A (which rebuilds automatically), you can always manuall
```
This rebuilds:
- **golangci-lint** — Pre-commit linter (critical)
- **gopls** — IDE language server (critical)
- **govulncheck** — Security scanner
@@ -132,11 +134,13 @@ Current Go version: go version go1.26.0 linux/amd64
Your IDE caches the old Go language server (gopls). Reload to use the new one:
**VS Code:**
- Press `Cmd/Ctrl+Shift+P`
- Type "Developer: Reload Window"
- Press Enter
**GoLand or IntelliJ IDEA:**
- File → Invalidate Caches → Restart
- Wait for indexing to complete
@@ -243,6 +247,7 @@ go install golang.org/x/tools/gopls@latest
### How often do Go versions change?
Go releases **two major versions per year**:
- February (e.g., Go 1.26.0)
- August (e.g., Go 1.27.0)
@@ -255,6 +260,7 @@ Plus occasional patch releases (e.g., Go 1.26.1) for security fixes.
**Usually no**, but it doesn't hurt. Patch releases (like 1.26.0 → 1.26.1) rarely break tool compatibility.
**Rebuild if:**
- Pre-commit hooks start failing
- IDE shows unexpected errors
- Tools report version mismatches
@@ -262,6 +268,7 @@ Plus occasional patch releases (e.g., Go 1.26.1) for security fixes.
### Why don't CI builds have this problem?
CI environments are **ephemeral** (temporary). Every workflow run:
1. Starts with a fresh container
2. Installs Go from scratch
3. Installs tools from scratch
@@ -295,12 +302,14 @@ But for Charon development, you only need **one version** (whatever's in `go.wor
**Short answer:** Your local tools will be out of sync, but CI will still work.
**What breaks:**
- Pre-commit hooks fail (but will auto-rebuild)
- IDE shows phantom errors
- Manual `go test` might fail locally
- CI is unaffected (it always uses the correct version)
**When to catch up:**
- Before opening a PR (CI checks will fail if your code uses old Go features)
- When local development becomes annoying
@@ -326,6 +335,7 @@ But they only take ~400MB each, so cleanup is optional.
Renovate updates **Dockerfile** and **go.work**, but it can't update tools on *your* machine.
**Think of it like this:**
- Renovate: "Hey team, we're now using Go 1.26.0"
- Your machine: "Cool, but my tools are still Go 1.25.6. Let me rebuild them."
@@ -334,18 +344,22 @@ The rebuild script bridges that gap.
### What's the difference between `go.work`, `go.mod`, and my system Go?
**`go.work`** — Workspace file (multi-module projects like Charon)
- Specifies minimum Go version for the entire project
- Used by Renovate to track upgrades
**`go.mod`** — Module file (individual Go modules)
- Each module (backend, tools) has its own `go.mod`
- Inherits Go version from `go.work`
**System Go** (`go version`) — What's installed on your machine
- Must be >= the version in `go.work`
- Tools are compiled with whatever version this is
**Example:**
```
go.work says: "Use Go 1.26.0 or newer"
go.mod says: "I'm part of the workspace, use its Go version"
@@ -364,12 +378,14 @@ Charon's pre-commit hook automatically detects and fixes tool version mismatches
**How it works:**
1. **Check versions:**
```bash
golangci-lint version → "built with go1.25.6"
go version → "go version go1.26.0"
```
2. **Detect mismatch:**
```
⚠️ golangci-lint Go version mismatch:
golangci-lint: 1.25.6
@@ -377,6 +393,7 @@ Charon's pre-commit hook automatically detects and fixes tool version mismatches
```
3. **Auto-rebuild:**
```
🔧 Rebuilding golangci-lint with current Go version...
✅ golangci-lint rebuilt successfully
@@ -406,11 +423,13 @@ If you want manual control, edit `scripts/pre-commit-hooks/golangci-lint-fast.sh
## Need Help?
**Open a [Discussion](https://github.com/Wikid82/charon/discussions)** if:
- These instructions didn't work for you
- You're seeing errors not covered in troubleshooting
- You have suggestions for improving this guide
**Open an [Issue](https://github.com/Wikid82/charon/issues)** if:
- The rebuild script crashes
- Pre-commit auto-rebuild isn't working
- CI is failing for Go version reasons

View File

@@ -3,16 +3,20 @@
This document explains how to run Playwright tests using a real browser (headed) on Linux machines and in the project's Docker E2E environment.
## Key points
- Playwright's interactive Test UI (--ui) requires an X server (a display). On headless CI or servers, use Xvfb.
- Prefer the project's E2E Docker image for integration-like runs; use the local `--ui` flow for manual debugging.
## Quick commands (local Linux)
- Headless (recommended for CI / fast runs):
```bash
npm run e2e
```
- Headed UI on a headless machine (auto-starts Xvfb):
```bash
npm run e2e:ui:headless-server
# or, if you prefer manual control:
@@ -20,37 +24,46 @@ This document explains how to run Playwright tests using a real browser (headed)
```
- Headed UI on a workstation with an X server already running:
```bash
npx playwright test --ui
```
- Open the running Docker E2E app in your system browser (one-step via VS Code task):
- Run the VS Code task: **Open: App in System Browser (Docker E2E)**
- This will rebuild the E2E container (if needed), wait for http://localhost:8080 to respond, and open your system browser automatically.
- This will rebuild the E2E container (if needed), wait for <http://localhost:8080> to respond, and open your system browser automatically.
- Open the running Docker E2E app in VS Code Simple Browser:
- Run the VS Code task: **Open: App in Simple Browser (Docker E2E)**
- Then use the command palette: `Simple Browser: Open URL` → paste `http://localhost:8080`
## Using the project's E2E Docker image (recommended for parity with CI)
1. Rebuild/start the E2E container (this sets up the full test environment):
```bash
.github/skills/scripts/skill-runner.sh docker-rebuild-e2e
```
If you need a clean rebuild after integration alignment changes:
```bash
.github/skills/scripts/skill-runner.sh docker-rebuild-e2e --clean --no-cache
```
2. Run the UI against the container (you still need an X server on your host):
1. Run the UI against the container (you still need an X server on your host):
```bash
PLAYWRIGHT_BASE_URL=http://localhost:8080 npm run e2e:ui:headless-server
```
## CI guidance
- Do not run Playwright `--ui` in CI. Use headless runs or the E2E Docker image and collect traces/videos for failures.
- For coverage, use the provided skill: `.github/skills/scripts/skill-runner.sh test-e2e-playwright-coverage`
## Troubleshooting
- Playwright error: "Looks like you launched a headed browser without having a XServer running." → run `npm run e2e:ui:headless-server` or install Xvfb.
- If `npm run e2e:ui:headless-server` fails with an exit code like `148`:
- Inspect Xvfb logs: `tail -n 200 /tmp/xvfb.playwright.log`
@@ -59,11 +72,13 @@ This document explains how to run Playwright tests using a real browser (headed)
- If running inside Docker, prefer the skill-runner which provisions the required services; the UI still needs host X (or use VNC).
## Developer notes (what we changed)
- Added `scripts/run-e2e-ui.sh` — wrapper that auto-starts Xvfb when DISPLAY is unset.
- Added `npm run e2e:ui:headless-server` to run the Playwright UI on headless machines.
- Playwright config now auto-starts Xvfb when `--ui` is requested locally and prints an actionable error if Xvfb is not available.
## Security & hygiene
- Playwright auth artifacts are ignored by git (`playwright/.auth/`). Do not commit credentials.
---

View File

@@ -23,6 +23,7 @@ Authorization: Bearer your-api-token-here
```
Tokens support granular permissions:
- **Read-only**: View configurations without modification
- **Full access**: Complete CRUD operations
- **Scoped**: Limit to specific resource types

View File

@@ -52,6 +52,7 @@ Caddyfile import parses your existing Caddy configuration files and converts the
Choose one of three methods:
**Paste Content:**
```
example.com {
reverse_proxy localhost:3000
@@ -63,10 +64,12 @@ api.example.com {
```
**Upload File:**
- Click **Choose File**
- Select your Caddyfile
**Fetch from URL:**
- Enter URL to raw Caddyfile content
- Useful for version-controlled configurations

View File

@@ -447,6 +447,7 @@ Charon displays instructions to remove the TXT record after certificate issuance
**Symptom**: Certificate request stuck at "Waiting for Propagation" or validation fails.
**Causes**:
- DNS TTL is high (cached old records)
- DNS provider has slow propagation
- Regional DNS inconsistency
@@ -497,6 +498,7 @@ Charon displays instructions to remove the TXT record after certificate issuance
**Symptom**: Connection test passes, but record creation fails.
**Causes**:
- API token has read-only permissions
- Zone/domain not accessible with current credentials
- Rate limiting or account restrictions
@@ -513,6 +515,7 @@ Charon displays instructions to remove the TXT record after certificate issuance
**Symptom**: "Record already exists" error during certificate request.
**Causes**:
- Previous challenge attempt left orphaned record
- Manual DNS record with same name exists
- Another ACME client managing the same domain
@@ -551,6 +554,7 @@ Charon displays instructions to remove the TXT record after certificate issuance
**Symptom**: "Too many requests" or "Rate limit exceeded" errors.
**Causes**:
- Too many certificate requests in short period
- DNS provider API rate limits
- Let's Encrypt rate limits

View File

@@ -47,6 +47,7 @@ Docker auto-discovery eliminates manual IP address hunting and port memorization
For Charon to discover containers, it needs Docker API access.
**Docker Compose:**
```yaml
services:
charon:
@@ -56,6 +57,7 @@ services:
```
**Docker Run:**
```bash
docker run -v /var/run/docker.sock:/var/run/docker.sock:ro charon
```

View File

@@ -35,18 +35,21 @@ CHARON_PLUGIN_SIGNATURES='{"pluginname": "sha256:..."}'
### Examples
**Permissive mode (default)**:
```bash
# Unset — all plugins load without verification
unset CHARON_PLUGIN_SIGNATURES
```
**Strict block-all**:
```bash
# Empty object — no external plugins will load
export CHARON_PLUGIN_SIGNATURES='{}'
```
**Allowlist specific plugins**:
```bash
# Only powerdns and custom-provider plugins are allowed
export CHARON_PLUGIN_SIGNATURES='{"powerdns": "sha256:a1b2c3d4...", "custom-provider": "sha256:e5f6g7h8..."}'
@@ -63,6 +66,7 @@ sha256sum myplugin.so | awk '{print "sha256:" $1}'
```
**Example output**:
```
sha256:a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9t0u1v2w3x4y5z6a7b8c9d0e1f2
```
@@ -96,6 +100,7 @@ services:
```
This prevents runtime modification of plugin files, mitigating:
- Time-of-check to time-of-use (TOCTOU) attacks
- Malicious plugin replacement after signature verification
@@ -113,6 +118,7 @@ services:
```
Or in Dockerfile:
```dockerfile
FROM charon:latest
USER charon
@@ -128,6 +134,7 @@ Plugin directories must **not** be world-writable. Charon enforces this at start
| `0777` (world-writable) | ❌ Rejected — plugin loading disabled |
**Set secure permissions**:
```bash
chmod 755 /path/to/plugins
chmod 644 /path/to/plugins/*.so # Or 755 for executable
@@ -192,22 +199,26 @@ After updating plugins, always update your `CHARON_PLUGIN_SIGNATURES` with the n
### Checking if a Plugin Loaded
**Check startup logs**:
```bash
docker compose logs charon | grep -i plugin
```
**Expected success output**:
```
INFO Loaded DNS provider plugin type=powerdns name="PowerDNS" version="1.0.0"
INFO Loaded 1 external DNS provider plugins (0 failed)
```
**If using allowlist**:
```
INFO Plugin signature allowlist enabled with 2 entries
```
**Via API**:
```bash
curl http://localhost:8080/api/admin/plugins \
-H "Authorization: Bearer YOUR-TOKEN"
@@ -220,6 +231,7 @@ curl http://localhost:8080/api/admin/plugins \
**Cause**: The plugin filename (without `.so`) is not in `CHARON_PLUGIN_SIGNATURES`.
**Solution**: Add the plugin to your allowlist:
```bash
# Get the signature
sha256sum powerdns.so | awk '{print "sha256:" $1}'
@@ -233,6 +245,7 @@ export CHARON_PLUGIN_SIGNATURES='{"powerdns": "sha256:YOUR_HASH_HERE"}'
**Cause**: The plugin file's SHA-256 hash doesn't match the allowlist.
**Solution**:
1. Verify you have the correct plugin file
2. Re-compute the signature: `sha256sum plugin.so`
3. Update `CHARON_PLUGIN_SIGNATURES` with the correct hash
@@ -242,6 +255,7 @@ export CHARON_PLUGIN_SIGNATURES='{"powerdns": "sha256:YOUR_HASH_HERE"}'
**Cause**: The plugin directory is world-writable (mode `0777` or similar).
**Solution**:
```bash
chmod 755 /path/to/plugins
chmod 644 /path/to/plugins/*.so
@@ -252,11 +266,13 @@ chmod 644 /path/to/plugins/*.so
**Cause**: Malformed JSON in the environment variable.
**Solution**: Validate your JSON:
```bash
echo '{"powerdns": "sha256:abc123"}' | jq .
```
Common issues:
- Missing quotes around keys or values
- Trailing commas
- Single quotes instead of double quotes
@@ -266,6 +282,7 @@ Common issues:
**Cause**: File permissions too restrictive or ownership mismatch.
**Solution**:
```bash
# Check current permissions
ls -la /path/to/plugins/
@@ -278,27 +295,32 @@ chown charon:charon /path/to/plugins/*.so
### Debugging Checklist
1. **Is the plugin directory configured?**
```bash
echo $CHARON_PLUGINS_DIR
```
2. **Does the plugin file exist?**
```bash
ls -la $CHARON_PLUGINS_DIR/*.so
```
3. **Are directory permissions secure?**
```bash
stat -c "%a %n" $CHARON_PLUGINS_DIR
# Should be 755 or stricter
```
4. **Is the signature correct?**
```bash
sha256sum $CHARON_PLUGINS_DIR/myplugin.so
```
5. **Is the JSON valid?**
```bash
echo "$CHARON_PLUGIN_SIGNATURES" | jq .
```

View File

@@ -69,22 +69,26 @@ X-Forwarded-Host preserves the original domain:
Your backend must trust proxy headers from Charon. Common configurations:
**Node.js/Express:**
```javascript
app.set('trust proxy', true);
```
**Django:**
```python
SECURE_PROXY_SSL_HEADER = ('HTTP_X_FORWARDED_PROTO', 'https')
USE_X_FORWARDED_HOST = True
```
**Rails:**
```ruby
config.action_dispatch.trusted_proxies = [IPAddr.new('10.0.0.0/8')]
```
**PHP/Laravel:**
```php
// In TrustProxies middleware
protected $proxies = '*';

View File

@@ -229,16 +229,19 @@ The emergency token is a security feature that allows bypassing all security mod
Choose your platform:
**Linux/macOS (recommended):**
```bash
openssl rand -hex 32
```
**Windows PowerShell:**
```powershell
[Convert]::ToBase64String([System.Security.Cryptography.RandomNumberGenerator]::GetBytes(32))
```
**Node.js (all platforms):**
```bash
node -e "console.log(require('crypto').randomBytes(32).toString('hex'))"
```
@@ -252,11 +255,13 @@ CHARON_EMERGENCY_TOKEN=<paste_64_character_token_here>
```
**Example:**
```bash
CHARON_EMERGENCY_TOKEN=7b3b8a36a6fad839f1b3122131ed4b1f05453118a91b53346482415796e740e2
```
**Verify:**
```bash
# Token should be exactly 64 characters
echo -n "$(grep CHARON_EMERGENCY_TOKEN .env | cut -d= -f2)" | wc -c
@@ -287,20 +292,23 @@ For continuous integration, store the token in GitHub Secrets:
### Security Best Practices
**DO:**
- Generate tokens using cryptographically secure methods
- Store in `.env` (gitignored) or secrets management
- Rotate quarterly or after security events
- Use minimum 64 characters
**DON'T:**
- Commit tokens to repository (even in examples)
- Share tokens via email or chat
- Use weak or predictable values
- Reuse tokens across environments
---
2. **Settings table** for `security.crowdsec.enabled = "true"`
3. **Starts CrowdSec** if either condition is true
1. **Settings table** for `security.crowdsec.enabled = "true"`
2. **Starts CrowdSec** if either condition is true
**How it works:**
@@ -582,7 +590,7 @@ Click "Watch" → "Custom" → Select "Security advisories" on the [Charon repos
**2. Notifications and Automatic Updates with Dockhand**
- Dockhand is a free service that monitors Docker images for updates and can send notifications or trigger auto-updates. https://github.com/Finsys/dockhand
- Dockhand is a free service that monitors Docker images for updates and can send notifications or trigger auto-updates. <https://github.com/Finsys/dockhand>
**Best Practices:**

View File

@@ -68,6 +68,7 @@ E2E tests require an emergency token to be configured in GitHub Secrets. This to
### Why This Is Needed
The emergency token is used by E2E tests to:
- Disable security modules (ACL, WAF, CrowdSec) after testing them
- Prevent cascading test failures due to leftover security state
- Ensure tests can always access the API regardless of security configuration
@@ -77,16 +78,19 @@ The emergency token is used by E2E tests to:
1. **Generate emergency token:**
**Linux/macOS:**
```bash
openssl rand -hex 32
```
**Windows PowerShell:**
```powershell
[Convert]::ToBase64String([System.Security.Cryptography.RandomNumberGenerator]::GetBytes(32))
```
**Node.js (all platforms):**
```bash
node -e "console.log(require('crypto').randomBytes(32).toString('hex'))"
```
@@ -141,11 +145,13 @@ If the secret is missing or invalid, the workflow will fail with a clear error m
### Security Best Practices
✅ **DO:**
- Use cryptographically secure generation methods
- Rotate quarterly or after security events
- Store separately for local dev (`.env`) and CI/CD (GitHub Secrets)
❌ **DON'T:**
- Share tokens via email or chat
- Commit tokens to repository (even in example files)
- Reuse tokens across different environments
@@ -154,11 +160,13 @@ If the secret is missing or invalid, the workflow will fail with a clear error m
### Troubleshooting
**Error: "CHARON_EMERGENCY_TOKEN not set"**
- Check secret name is exactly `CHARON_EMERGENCY_TOKEN` (case-sensitive)
- Verify secret is repository-level, not environment-level
- Re-run workflow after adding secret
**Error: "Token too short"**
- Hex method must generate exactly 64 characters
- Verify you copied the entire token value
- Regenerate if needed

View File

@@ -88,6 +88,7 @@ In CrowdSec terms:
> **✅ Good News: Charon Handles This For You!**
>
> When you enable CrowdSec for the first time, Charon automatically:
>
> 1. Starts the CrowdSec engine
> 2. Registers a bouncer and generates a valid API key
> 3. Saves the key so it survives container restarts
@@ -317,11 +318,13 @@ Replace `YOUR_ENROLLMENT_KEY` with the key from your Console.
**Solution:**
1. Check if you're manually setting an API key:
```bash
grep -i "crowdsec_api_key" docker-compose.yml
```
2. If you find one, **remove it**:
```yaml
# REMOVE this line:
- CHARON_SECURITY_CROWDSEC_API_KEY=anything
@@ -330,6 +333,7 @@ Replace `YOUR_ENROLLMENT_KEY` with the key from your Console.
3. Follow the [Manual Bouncer Registration](#manual-bouncer-registration) steps above
4. Restart the container:
```bash
docker restart charon
```
@@ -347,6 +351,7 @@ Replace `YOUR_ENROLLMENT_KEY` with the key from your Console.
1. Wait 60 seconds after container start
2. Check if CrowdSec is running:
```bash
docker exec charon cscli lapi status
```
@@ -354,6 +359,7 @@ Replace `YOUR_ENROLLMENT_KEY` with the key from your Console.
3. If you see "connection refused," try toggling CrowdSec OFF then ON in the GUI
4. Check the logs:
```bash
docker logs charon | grep -i crowdsec
```
@@ -431,6 +437,7 @@ If you already run CrowdSec separately (not inside Charon), you can connect to i
**Steps:**
1. Register a bouncer on your external CrowdSec:
```bash
cscli bouncers add charon-bouncer
```
@@ -438,6 +445,7 @@ If you already run CrowdSec separately (not inside Charon), you can connect to i
2. Save the API key that's generated (you won't see it again!)
3. In your docker-compose.yml:
```yaml
environment:
- CHARON_SECURITY_CROWDSEC_API_URL=http://your-crowdsec-server:8080
@@ -445,6 +453,7 @@ If you already run CrowdSec separately (not inside Charon), you can connect to i
```
4. Restart Charon:
```bash
docker restart charon
```

View File

@@ -9,6 +9,7 @@ This directory contains operational maintenance guides for keeping Charon runnin
**When to use:** Docker build fails with GeoLite2-Country.mmdb checksum mismatch
**Topics covered:**
- Automated weekly checksum verification workflow
- Manual checksum update procedures (5 minutes)
- Verification script for checking upstream changes
@@ -16,6 +17,7 @@ This directory contains operational maintenance guides for keeping Charon runnin
- Alternative sources if upstream mirrors are unavailable
**Quick fix:**
```bash
# Download and update checksum automatically
NEW_CHECKSUM=$(curl -fsSL "https://github.com/P3TERX/GeoLite.mmdb/raw/download/GeoLite2-Country.mmdb" | sha256sum | cut -d' ' -f1)
@@ -34,6 +36,7 @@ Found a maintenance issue not covered here? Please:
3. **Update this index** with a link to your guide
**Format:**
```markdown
### [Guide Title](filename.md)

View File

@@ -15,6 +15,7 @@ Charon uses the [MaxMind GeoLite2-Country database](https://dev.maxmind.com/geoi
Update the checksum when:
1. **Docker build fails** with the following error:
```
sha256sum: /app/data/geoip/GeoLite2-Country.mmdb: FAILED
sha256sum: WARNING: 1 computed checksum did NOT match
@@ -29,6 +30,7 @@ Update the checksum when:
## Automated Workflow (Recommended)
Charon includes a GitHub Actions workflow that automatically:
- Checks for upstream GeoLite2 database changes weekly
- Calculates the new checksum
- Creates a pull request with the update
@@ -39,6 +41,7 @@ Charon includes a GitHub Actions workflow that automatically:
**Schedule:** Mondays at 2 AM UTC (weekly)
**Manual Trigger:**
```bash
gh workflow run update-geolite2.yml
```
@@ -75,16 +78,19 @@ sha256sum /tmp/geolite2-test.mmdb
**File:** [`Dockerfile`](../../Dockerfile) (line ~352)
**Find this line:**
```dockerfile
ARG GEOLITE2_COUNTRY_SHA256=<old-checksum>
```
**Replace with the new checksum:**
```dockerfile
ARG GEOLITE2_COUNTRY_SHA256=436135ee98a521da715a6d483951f3dbbd62557637f2d50d1987fc048874bd5d
```
**Using sed (automated):**
```bash
NEW_CHECKSUM=$(curl -fsSL "https://github.com/P3TERX/GeoLite.mmdb/raw/download/GeoLite2-Country.mmdb" | sha256sum | cut -d' ' -f1)
@@ -119,6 +125,7 @@ docker run --rm charon:test-checksum /app/charon --version
```
**Expected output:**
```
✅ GeoLite2-Country.mmdb: OK
✅ Successfully tagged charon:test-checksum
@@ -171,11 +178,13 @@ fi
```
**Make executable:**
```bash
chmod +x scripts/verify-geolite2-checksum.sh
```
**Run verification:**
```bash
./scripts/verify-geolite2-checksum.sh
```
@@ -187,22 +196,26 @@ chmod +x scripts/verify-geolite2-checksum.sh
### Issue: Build Still Fails After Update
**Symptoms:**
- Checksum verification fails
- "FAILED" error persists
**Solutions:**
1. **Clear Docker build cache:**
```bash
docker builder prune -af
```
2. **Verify the checksum was committed:**
```bash
git show HEAD:Dockerfile | grep "GEOLITE2_COUNTRY_SHA256"
```
3. **Re-download and verify upstream file:**
```bash
curl -fsSL "https://github.com/P3TERX/GeoLite.mmdb/raw/download/GeoLite2-Country.mmdb" -o /tmp/test.mmdb
sha256sum /tmp/test.mmdb
@@ -212,28 +225,31 @@ chmod +x scripts/verify-geolite2-checksum.sh
### Issue: Upstream File Unavailable (404)
**Symptoms:**
- `curl` returns 404 Not Found
- Automated workflow fails with `download_failed` error
**Investigation Steps:**
1. **Check upstream repository:**
- Visit: https://github.com/P3TERX/GeoLite.mmdb
- Visit: <https://github.com/P3TERX/GeoLite.mmdb>
- Verify the file still exists at the raw URL
- Check for repository status or announcements
2. **Check MaxMind status:**
- Visit: https://status.maxmind.com/
- Visit: <https://status.maxmind.com/>
- Check for service outages or maintenance
**Temporary Solutions:**
1. **Use cached Docker layer** (if available):
```bash
docker build --cache-from ghcr.io/wikid82/charon:latest -t charon:latest .
```
2. **Use local copy** (temporary):
```bash
# Download from a working container
docker run --rm ghcr.io/wikid82/charon:latest cat /app/data/geoip/GeoLite2-Country.mmdb > /tmp/GeoLite2-Country.mmdb
@@ -249,12 +265,14 @@ chmod +x scripts/verify-geolite2-checksum.sh
### Issue: Checksum Mismatch on Re-download
**Symptoms:**
- Checksum calculated locally differs from what's in the Dockerfile
- Checksum changes between downloads
**Investigation Steps:**
1. **Verify file integrity:**
```bash
# Download multiple times and compare
for i in {1..3}; do
@@ -267,12 +285,14 @@ chmod +x scripts/verify-geolite2-checksum.sh
- Try from different network locations
3. **Verify no MITM proxy:**
```bash
# Download via HTTPS and verify certificate
curl -v -fsSL "https://github.com/P3TERX/GeoLite.mmdb/raw/download/GeoLite2-Country.mmdb" -o /tmp/test.mmdb 2>&1 | grep "CN="
```
**If confirmed as supply chain attack:**
- **STOP** and do not proceed
- Report to security team
- See [Security Incident Response](../security-incident-response.md)
@@ -280,6 +300,7 @@ chmod +x scripts/verify-geolite2-checksum.sh
### Issue: Multi-Platform Build Fails (arm64)
**Symptoms:**
- `linux/amd64` build succeeds
- `linux/arm64` build fails with checksum error
@@ -290,12 +311,14 @@ chmod +x scripts/verify-geolite2-checksum.sh
- Should be identical across all platforms
2. **Check buildx platform emulation:**
```bash
docker buildx ls
docker buildx inspect
```
3. **Test arm64 build explicitly:**
```bash
docker buildx build --platform linux/arm64 --load -t test-arm64 .
```
@@ -308,8 +331,8 @@ chmod +x scripts/verify-geolite2-checksum.sh
- **Implementation Plan:** [`docs/plans/current_spec.md`](../plans/current_spec.md)
- **QA Report:** [`docs/reports/qa_report.md`](../reports/qa_report.md)
- **Dockerfile:** [`Dockerfile`](../../Dockerfile) (line ~352)
- **MaxMind GeoLite2:** https://dev.maxmind.com/geoip/geolite2-free-geolocation-data
- **P3TERX Mirror:** https://github.com/P3TERX/GeoLite.mmdb
- **MaxMind GeoLite2:** <https://dev.maxmind.com/geoip/geolite2-free-geolocation-data>
- **P3TERX Mirror:** <https://github.com/P3TERX/GeoLite.mmdb>
---
@@ -321,9 +344,10 @@ chmod +x scripts/verify-geolite2-checksum.sh
**Solution:** Updated one line in `Dockerfile` (line 352) with the correct checksum and implemented an automated workflow to prevent future occurrences.
**Build Failure URL:** https://github.com/Wikid82/Charon/actions/runs/21584236523/job/62188372617
**Build Failure URL:** <https://github.com/Wikid82/Charon/actions/runs/21584236523/job/62188372617>
**Related PRs:**
- Fix implementation: (link to PR)
- Automated workflow addition: (link to PR)

View File

@@ -6,8 +6,9 @@ index efbcccda..64fcc121 100644
if: |
((inputs.browser || 'all') == 'chromium' || (inputs.browser || 'all') == 'all') &&
((inputs.test_category || 'all') == 'security' || (inputs.test_category || 'all') == 'all')
- timeout-minutes: 40
+ timeout-minutes: 60
- timeout-minutes: 40
- timeout-minutes: 60
env:
CHARON_EMERGENCY_TOKEN: ${{ secrets.CHARON_EMERGENCY_TOKEN }}
CHARON_EMERGENCY_SERVER_ENABLED: "true"
@@ -15,42 +16,45 @@ index efbcccda..64fcc121 100644
npx playwright test \
--project=chromium \
+ --output=playwright-output/security-chromium \
- --output=playwright-output/security-chromium \
tests/security-enforcement/ \
tests/security/ \
tests/integration/multi-feature-workflows.spec.ts || STATUS=$?
@@ -370,6 +371,25 @@ jobs:
path: test-results/**/*.zip
retention-days: 7
+ - name: Collect diagnostics
+ if: always()
+ run: |
+ mkdir -p diagnostics
+ uptime > diagnostics/uptime.txt
+ free -m > diagnostics/free-m.txt
+ df -h > diagnostics/df-h.txt
+ ps aux > diagnostics/ps-aux.txt
+ docker ps -a > diagnostics/docker-ps.txt || true
+ docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true
+
+ - name: Upload diagnostics
+ if: always()
+ uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0
+ with:
+ name: e2e-diagnostics-chromium-security
+ path: diagnostics/
+ retention-days: 7
+
- - name: Collect diagnostics
- if: always()
- run: |
- mkdir -p diagnostics
- uptime > diagnostics/uptime.txt
- free -m > diagnostics/free-m.txt
- df -h > diagnostics/df-h.txt
- ps aux > diagnostics/ps-aux.txt
- docker ps -a > diagnostics/docker-ps.txt || true
- docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true
-
- - name: Upload diagnostics
- if: always()
- uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0
- with:
- name: e2e-diagnostics-chromium-security
- path: diagnostics/
- retention-days: 7
-
- name: Collect Docker logs on failure
if: failure()
run: |
@@ -394,7 +414,7 @@ jobs:
if: |
((inputs.browser || 'all') == 'firefox' || (inputs.browser || 'all') == 'all') &&
((inputs.test_category || 'all') == 'security' || (inputs.test_category || 'all') == 'all')
- timeout-minutes: 40
+ timeout-minutes: 60
- timeout-minutes: 40
- timeout-minutes: 60
env:
CHARON_EMERGENCY_TOKEN: ${{ secrets.CHARON_EMERGENCY_TOKEN }}
CHARON_EMERGENCY_SERVER_ENABLED: "true"
@@ -58,42 +62,45 @@ index efbcccda..64fcc121 100644
npx playwright test \
--project=firefox \
+ --output=playwright-output/security-firefox \
- --output=playwright-output/security-firefox \
tests/security-enforcement/ \
tests/security/ \
tests/integration/multi-feature-workflows.spec.ts || STATUS=$?
@@ -559,6 +580,25 @@ jobs:
path: test-results/**/*.zip
retention-days: 7
+ - name: Collect diagnostics
+ if: always()
+ run: |
+ mkdir -p diagnostics
+ uptime > diagnostics/uptime.txt
+ free -m > diagnostics/free-m.txt
+ df -h > diagnostics/df-h.txt
+ ps aux > diagnostics/ps-aux.txt
+ docker ps -a > diagnostics/docker-ps.txt || true
+ docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true
+
+ - name: Upload diagnostics
+ if: always()
+ uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0
+ with:
+ name: e2e-diagnostics-firefox-security
+ path: diagnostics/
+ retention-days: 7
+
- - name: Collect diagnostics
- if: always()
- run: |
- mkdir -p diagnostics
- uptime > diagnostics/uptime.txt
- free -m > diagnostics/free-m.txt
- df -h > diagnostics/df-h.txt
- ps aux > diagnostics/ps-aux.txt
- docker ps -a > diagnostics/docker-ps.txt || true
- docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true
-
- - name: Upload diagnostics
- if: always()
- uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0
- with:
- name: e2e-diagnostics-firefox-security
- path: diagnostics/
- retention-days: 7
-
- name: Collect Docker logs on failure
if: failure()
run: |
@@ -583,7 +623,7 @@ jobs:
if: |
((inputs.browser || 'all') == 'webkit' || (inputs.browser || 'all') == 'all') &&
((inputs.test_category || 'all') == 'security' || (inputs.test_category || 'all') == 'all')
- timeout-minutes: 40
+ timeout-minutes: 60
- timeout-minutes: 40
- timeout-minutes: 60
env:
CHARON_EMERGENCY_TOKEN: ${{ secrets.CHARON_EMERGENCY_TOKEN }}
CHARON_EMERGENCY_SERVER_ENABLED: "true"
@@ -101,42 +108,45 @@ index efbcccda..64fcc121 100644
npx playwright test \
--project=webkit \
+ --output=playwright-output/security-webkit \
- --output=playwright-output/security-webkit \
tests/security-enforcement/ \
tests/security/ \
tests/integration/multi-feature-workflows.spec.ts || STATUS=$?
@@ -748,6 +789,25 @@ jobs:
path: test-results/**/*.zip
retention-days: 7
+ - name: Collect diagnostics
+ if: always()
+ run: |
+ mkdir -p diagnostics
+ uptime > diagnostics/uptime.txt
+ free -m > diagnostics/free-m.txt
+ df -h > diagnostics/df-h.txt
+ ps aux > diagnostics/ps-aux.txt
+ docker ps -a > diagnostics/docker-ps.txt || true
+ docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true
+
+ - name: Upload diagnostics
+ if: always()
+ uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0
+ with:
+ name: e2e-diagnostics-webkit-security
+ path: diagnostics/
+ retention-days: 7
+
- - name: Collect diagnostics
- if: always()
- run: |
- mkdir -p diagnostics
- uptime > diagnostics/uptime.txt
- free -m > diagnostics/free-m.txt
- df -h > diagnostics/df-h.txt
- ps aux > diagnostics/ps-aux.txt
- docker ps -a > diagnostics/docker-ps.txt || true
- docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true
-
- - name: Upload diagnostics
- if: always()
- uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0
- with:
- name: e2e-diagnostics-webkit-security
- path: diagnostics/
- retention-days: 7
-
- name: Collect Docker logs on failure
if: failure()
run: |
@@ -779,7 +839,7 @@ jobs:
if: |
((inputs.browser || 'all') == 'chromium' || (inputs.browser || 'all') == 'all') &&
((inputs.test_category || 'all') == 'non-security' || (inputs.test_category || 'all') == 'all')
- timeout-minutes: 30
+ timeout-minutes: 60
- timeout-minutes: 30
- timeout-minutes: 60
env:
CHARON_EMERGENCY_TOKEN: ${{ secrets.CHARON_EMERGENCY_TOKEN }}
CHARON_EMERGENCY_SERVER_ENABLED: "true"
@@ -144,57 +154,61 @@ index efbcccda..64fcc121 100644
npx playwright test \
--project=chromium \
--shard=${{ matrix.shard }}/${{ matrix.total-shards }} \
+ --output=playwright-output/chromium-shard-${{ matrix.shard }} \
- --output=playwright-output/chromium-shard-${{ matrix.shard }} \
tests/core \
tests/dns-provider-crud.spec.ts \
tests/dns-provider-types.spec.ts \
@@ -915,6 +976,14 @@ jobs:
path: playwright-report/
retention-days: 14
+ - name: Upload Playwright output (Chromium shard ${{ matrix.shard }})
+ if: always()
+ uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0
+ with:
+ name: playwright-output-chromium-shard-${{ matrix.shard }}
+ path: playwright-output/chromium-shard-${{ matrix.shard }}/
+ retention-days: 7
+
- - name: Upload Playwright output (Chromium shard ${{ matrix.shard }})
- if: always()
- uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0
- with:
- name: playwright-output-chromium-shard-${{ matrix.shard }}
- path: playwright-output/chromium-shard-${{ matrix.shard }}/
- retention-days: 7
-
- name: Upload Chromium coverage (if enabled)
if: always() && (inputs.playwright_coverage == 'true' || vars.PLAYWRIGHT_COVERAGE == '1')
uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0
@@ -931,6 +1000,25 @@ jobs:
path: test-results/**/*.zip
retention-days: 7
+ - name: Collect diagnostics
+ if: always()
+ run: |
+ mkdir -p diagnostics
+ uptime > diagnostics/uptime.txt
+ free -m > diagnostics/free-m.txt
+ df -h > diagnostics/df-h.txt
+ ps aux > diagnostics/ps-aux.txt
+ docker ps -a > diagnostics/docker-ps.txt || true
+ docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true
+
+ - name: Upload diagnostics
+ if: always()
+ uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0
+ with:
+ name: e2e-diagnostics-chromium-shard-${{ matrix.shard }}
+ path: diagnostics/
+ retention-days: 7
+
- - name: Collect diagnostics
- if: always()
- run: |
- mkdir -p diagnostics
- uptime > diagnostics/uptime.txt
- free -m > diagnostics/free-m.txt
- df -h > diagnostics/df-h.txt
- ps aux > diagnostics/ps-aux.txt
- docker ps -a > diagnostics/docker-ps.txt || true
- docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true
-
- - name: Upload diagnostics
- if: always()
- uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0
- with:
- name: e2e-diagnostics-chromium-shard-${{ matrix.shard }}
- path: diagnostics/
- retention-days: 7
-
- name: Collect Docker logs on failure
if: failure()
run: |
@@ -955,7 +1043,7 @@ jobs:
if: |
((inputs.browser || 'all') == 'firefox' || (inputs.browser || 'all') == 'all') &&
((inputs.test_category || 'all') == 'non-security' || (inputs.test_category || 'all') == 'all')
- timeout-minutes: 30
+ timeout-minutes: 60
- timeout-minutes: 30
- timeout-minutes: 60
env:
CHARON_EMERGENCY_TOKEN: ${{ secrets.CHARON_EMERGENCY_TOKEN }}
CHARON_EMERGENCY_SERVER_ENABLED: "true"
@@ -202,57 +216,61 @@ index efbcccda..64fcc121 100644
npx playwright test \
--project=firefox \
--shard=${{ matrix.shard }}/${{ matrix.total-shards }} \
+ --output=playwright-output/firefox-shard-${{ matrix.shard }} \
- --output=playwright-output/firefox-shard-${{ matrix.shard }} \
tests/core \
tests/dns-provider-crud.spec.ts \
tests/dns-provider-types.spec.ts \
@@ -1099,6 +1188,14 @@ jobs:
path: playwright-report/
retention-days: 14
+ - name: Upload Playwright output (Firefox shard ${{ matrix.shard }})
+ if: always()
+ uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0
+ with:
+ name: playwright-output-firefox-shard-${{ matrix.shard }}
+ path: playwright-output/firefox-shard-${{ matrix.shard }}/
+ retention-days: 7
+
- - name: Upload Playwright output (Firefox shard ${{ matrix.shard }})
- if: always()
- uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0
- with:
- name: playwright-output-firefox-shard-${{ matrix.shard }}
- path: playwright-output/firefox-shard-${{ matrix.shard }}/
- retention-days: 7
-
- name: Upload Firefox coverage (if enabled)
if: always() && (inputs.playwright_coverage == 'true' || vars.PLAYWRIGHT_COVERAGE == '1')
uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0
@@ -1115,6 +1212,25 @@ jobs:
path: test-results/**/*.zip
retention-days: 7
+ - name: Collect diagnostics
+ if: always()
+ run: |
+ mkdir -p diagnostics
+ uptime > diagnostics/uptime.txt
+ free -m > diagnostics/free-m.txt
+ df -h > diagnostics/df-h.txt
+ ps aux > diagnostics/ps-aux.txt
+ docker ps -a > diagnostics/docker-ps.txt || true
+ docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true
+
+ - name: Upload diagnostics
+ if: always()
+ uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0
+ with:
+ name: e2e-diagnostics-firefox-shard-${{ matrix.shard }}
+ path: diagnostics/
+ retention-days: 7
+
- - name: Collect diagnostics
- if: always()
- run: |
- mkdir -p diagnostics
- uptime > diagnostics/uptime.txt
- free -m > diagnostics/free-m.txt
- df -h > diagnostics/df-h.txt
- ps aux > diagnostics/ps-aux.txt
- docker ps -a > diagnostics/docker-ps.txt || true
- docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true
-
- - name: Upload diagnostics
- if: always()
- uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0
- with:
- name: e2e-diagnostics-firefox-shard-${{ matrix.shard }}
- path: diagnostics/
- retention-days: 7
-
- name: Collect Docker logs on failure
if: failure()
run: |
@@ -1139,7 +1255,7 @@ jobs:
if: |
((inputs.browser || 'all') == 'webkit' || (inputs.browser || 'all') == 'all') &&
((inputs.test_category || 'all') == 'non-security' || (inputs.test_category || 'all') == 'all')
- timeout-minutes: 30
+ timeout-minutes: 60
- timeout-minutes: 30
- timeout-minutes: 60
env:
CHARON_EMERGENCY_TOKEN: ${{ secrets.CHARON_EMERGENCY_TOKEN }}
CHARON_EMERGENCY_SERVER_ENABLED: "true"
@@ -260,48 +278,50 @@ index efbcccda..64fcc121 100644
npx playwright test \
--project=webkit \
--shard=${{ matrix.shard }}/${{ matrix.total-shards }} \
+ --output=playwright-output/webkit-shard-${{ matrix.shard }} \
- --output=playwright-output/webkit-shard-${{ matrix.shard }} \
tests/core \
tests/dns-provider-crud.spec.ts \
tests/dns-provider-types.spec.ts \
@@ -1283,6 +1400,14 @@ jobs:
path: playwright-report/
retention-days: 14
+ - name: Upload Playwright output (WebKit shard ${{ matrix.shard }})
+ if: always()
+ uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6
+ with:
+ name: playwright-output-webkit-shard-${{ matrix.shard }}
+ path: playwright-output/webkit-shard-${{ matrix.shard }}/
+ retention-days: 7
+
- - name: Upload Playwright output (WebKit shard ${{ matrix.shard }})
- if: always()
- uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6
- with:
- name: playwright-output-webkit-shard-${{ matrix.shard }}
- path: playwright-output/webkit-shard-${{ matrix.shard }}/
- retention-days: 7
-
- name: Upload WebKit coverage (if enabled)
if: always() && (inputs.playwright_coverage == 'true' || vars.PLAYWRIGHT_COVERAGE == '1')
uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6
@@ -1299,6 +1424,25 @@ jobs:
path: test-results/**/*.zip
retention-days: 7
+ - name: Collect diagnostics
+ if: always()
+ run: |
+ mkdir -p diagnostics
+ uptime > diagnostics/uptime.txt
+ free -m > diagnostics/free-m.txt
+ df -h > diagnostics/df-h.txt
+ ps aux > diagnostics/ps-aux.txt
+ docker ps -a > diagnostics/docker-ps.txt || true
+ docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true
+
+ - name: Upload diagnostics
+ if: always()
+ uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6
+ with:
+ name: e2e-diagnostics-webkit-shard-${{ matrix.shard }}
+ path: diagnostics/
+ retention-days: 7
+
- - name: Collect diagnostics
- if: always()
- run: |
- mkdir -p diagnostics
- uptime > diagnostics/uptime.txt
- free -m > diagnostics/free-m.txt
- df -h > diagnostics/df-h.txt
- ps aux > diagnostics/ps-aux.txt
- docker ps -a > diagnostics/docker-ps.txt || true
- docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true
-
- - name: Upload diagnostics
- if: always()
- uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6
- with:
- name: e2e-diagnostics-webkit-shard-${{ matrix.shard }}
- path: diagnostics/
- retention-days: 7
-
- name: Collect Docker logs on failure
if: failure()
run: |

View File

@@ -31,6 +31,7 @@ for _, s := range settings {
```
**Key Improvements:**
- **Single Query:** `WHERE key IN (?, ?, ?)` fetches all flags in one database round-trip
- **O(1) Lookups:** Map-based access eliminates linear search overhead
- **Error Handling:** Explicit error logging and HTTP 500 response on failure
@@ -56,6 +57,7 @@ if err := h.DB.Transaction(func(tx *gorm.DB) error {
```
**Key Improvements:**
- **Atomic Updates:** All flag changes commit or rollback together
- **Error Recovery:** Transaction rollback prevents partial state
- **Improved Logging:** Explicit error messages for debugging
@@ -65,10 +67,12 @@ if err := h.DB.Transaction(func(tx *gorm.DB) error {
### Before Optimization (Baseline - N+1 Pattern)
**Architecture:**
- GetFlags(): 3 sequential `WHERE key = ?` queries (one per flag)
- UpdateFlags(): Multiple separate transactions
**Measured Latency (Expected):**
- **GET P50:** 300ms (CI environment)
- **GET P95:** 500ms
- **GET P99:** 600ms
@@ -77,20 +81,24 @@ if err := h.DB.Transaction(func(tx *gorm.DB) error {
- **PUT P99:** 600ms
**Query Count:**
- GET: 3 queries (N+1 pattern, N=3 flags)
- PUT: 1-3 queries depending on flag count
**CI Impact:**
- Test flakiness: ~30% failure rate due to timeouts
- E2E test pass rate: ~70%
### After Optimization (Current - Batch Query + Transaction)
**Architecture:**
- GetFlags(): 1 batch query `WHERE key IN (?, ?, ?)`
- UpdateFlags(): 1 transaction wrapping all updates
**Measured Latency (Target):**
- **GET P50:** 100ms (3x faster)
- **GET P95:** 150ms (3.3x faster)
- **GET P99:** 200ms (3x faster)
@@ -99,10 +107,12 @@ if err := h.DB.Transaction(func(tx *gorm.DB) error {
- **PUT P99:** 200ms (3x faster)
**Query Count:**
- GET: 1 batch query (N+1 eliminated)
- PUT: 1 transaction (atomic)
**CI Impact (Expected):**
- Test flakiness: 0% (with retry logic + polling)
- E2E test pass rate: 100%
@@ -125,11 +135,13 @@ if err := h.DB.Transaction(func(tx *gorm.DB) error {
**Status:** Complete
**Changes:**
- Added `defer` timing to GetFlags() and UpdateFlags()
- Log format: `[METRICS] GET/PUT /feature-flags: {duration}ms`
- CI pipeline captures P50/P95/P99 metrics
**Files Modified:**
- `backend/internal/api/handlers/feature_flags_handler.go`
### Phase 1: Backend Optimization - N+1 Query Fix
@@ -139,16 +151,19 @@ if err := h.DB.Transaction(func(tx *gorm.DB) error {
**Priority:** P0 - Critical CI Blocker
**Changes:**
- **GetFlags():** Replaced N+1 loop with batch query `WHERE key IN (?)`
- **UpdateFlags():** Wrapped updates in single transaction
- **Tests:** Added batch query and transaction rollback tests
- **Benchmarks:** Added BenchmarkGetFlags and BenchmarkUpdateFlags
**Files Modified:**
- `backend/internal/api/handlers/feature_flags_handler.go`
- `backend/internal/api/handlers/feature_flags_handler_test.go`
**Expected Impact:**
- 3-6x latency reduction (600ms → 200ms P99)
- Elimination of N+1 query anti-pattern
- Atomic updates with rollback on error
@@ -159,32 +174,38 @@ if err := h.DB.Transaction(func(tx *gorm.DB) error {
### Test Helpers Used
**Polling Helper:** `waitForFeatureFlagPropagation()`
- Polls `/api/v1/feature-flags` until expected state confirmed
- Default interval: 500ms
- Default timeout: 30s (150x safety margin over 200ms P99)
**Retry Helper:** `retryAction()`
- 3 max attempts with exponential backoff (2s, 4s, 8s)
- Handles transient network/DB failures
### Timeout Strategy
**Helper Defaults:**
- `clickAndWaitForResponse()`: 30s timeout
- `waitForAPIResponse()`: 30s timeout
- No explicit timeouts in test files (rely on helper defaults)
**Typical Poll Count:**
- Local: 1-2 polls (50-200ms response + 500ms interval)
- CI: 1-3 polls (50-200ms response + 500ms interval)
### Test Files
**E2E Tests:**
- `tests/settings/system-settings.spec.ts` - Feature toggle tests
- `tests/utils/wait-helpers.ts` - Polling and retry helpers
**Backend Tests:**
- `backend/internal/api/handlers/feature_flags_handler_test.go`
- `backend/internal/api/handlers/feature_flags_handler_coverage_test.go`
@@ -205,11 +226,13 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$
### Benchmark Analysis
**GetFlags Benchmark:**
- Measures single batch query performance
- Tests with 3 flags in database
- Includes JSON serialization overhead
**UpdateFlags Benchmark:**
- Measures transaction wrapping performance
- Tests atomic update of 3 flags
- Includes JSON deserialization and validation
@@ -219,14 +242,17 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$
### Why Batch Query Over Individual Queries?
**Problem:** N+1 pattern causes linear latency scaling
- 3 flags = 3 queries × 200ms = 600ms total
- 10 flags = 10 queries × 200ms = 2000ms total
**Solution:** Single batch query with IN clause
- N flags = 1 query × 200ms = 200ms total
- Constant time regardless of flag count
**Trade-offs:**
- ✅ 3-6x latency reduction
- ✅ Scales to more flags without performance degradation
- ⚠️ Slightly more complex code (map-based lookup)
@@ -234,14 +260,17 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$
### Why Transaction Wrapping?
**Problem:** Multiple separate writes risk partial state
- Flag 1 succeeds, Flag 2 fails → inconsistent state
- No rollback mechanism for failed updates
**Solution:** Single transaction for all updates
- All succeed together or all rollback
- ACID guarantees for multi-flag updates
**Trade-offs:**
- ✅ Atomic updates with rollback on error
- ✅ Prevents partial state corruption
- ⚠️ Slightly longer locks (mitigated by fast SQLite)
@@ -253,11 +282,13 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$
**Status:** Not implemented (not needed after Phase 1 optimization)
**Rationale:**
- Current latency (50-200ms) is acceptable for feature flags
- Feature flags change infrequently (not a hot path)
- Adding cache increases complexity without significant benefit
**If Needed:**
- Use Redis or in-memory cache with TTL=60s
- Invalidate on PUT operations
- Expected improvement: 50-200ms → 10-50ms
@@ -267,11 +298,13 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$
**Status:** SQLite default indexes sufficient
**Rationale:**
- `settings.key` column used in WHERE clauses
- SQLite automatically indexes primary key
- Query plan analysis shows index usage
**If Needed:**
- Add explicit index: `CREATE INDEX idx_settings_key ON settings(key)`
- Expected improvement: Minimal (already fast)
@@ -280,11 +313,13 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$
**Status:** GORM default pooling sufficient
**Rationale:**
- GORM uses `database/sql` pool by default
- Current concurrency limits adequate
- No connection exhaustion observed
**If Needed:**
- Tune `SetMaxOpenConns()` and `SetMaxIdleConns()`
- Expected improvement: 10-20% under high load
@@ -293,12 +328,14 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$
### Metrics to Track
**Backend Metrics:**
- P50/P95/P99 latency for GET and PUT operations
- Query count per request (should remain 1 for GET)
- Transaction count per PUT (should remain 1)
- Error rate (target: <0.1%)
**E2E Metrics:**
- Test pass rate for feature toggle tests
- Retry attempt frequency (target: <5%)
- Polling iteration count (typical: 1-3)
@@ -307,11 +344,13 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$
### Alerting Thresholds
**Backend Alerts:**
- P99 > 500ms → Investigate regression (2.5x slower than optimized)
- Error rate > 1% → Check database health
- Query count > 1 for GET → N+1 pattern reintroduced
**E2E Alerts:**
- Test pass rate < 95% → Check for new flakiness
- Timeout errors > 0 → Investigate CI environment
- Retry rate > 10% → Investigate transient failure source
@@ -319,10 +358,12 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$
### Dashboard
**CI Metrics:**
- Link: `.github/workflows/e2e-tests.yml` artifacts
- Extracts `[METRICS]` logs for P50/P95/P99 analysis
**Backend Logs:**
- Docker container logs with `[METRICS]` tag
- Example: `[METRICS] GET /feature-flags: 120ms`
@@ -331,15 +372,18 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$
### High Latency (P99 > 500ms)
**Symptoms:**
- E2E tests timing out
- Backend logs show latency spikes
**Diagnosis:**
1. Check query count: `grep "SELECT" backend/logs/query.log`
2. Verify batch query: Should see `WHERE key IN (...)`
3. Check transaction wrapping: Should see single `BEGIN ... COMMIT`
**Remediation:**
- If N+1 pattern detected: Verify batch query implementation
- If transaction missing: Verify transaction wrapping
- If database locks: Check concurrent access patterns
@@ -347,15 +391,18 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$
### Transaction Rollback Errors
**Symptoms:**
- PUT requests return 500 errors
- Backend logs show transaction failure
**Diagnosis:**
1. Check error message: `grep "Failed to update feature flags" backend/logs/app.log`
2. Verify database constraints: Unique key constraints, foreign keys
3. Check database connectivity: Connection pool exhaustion
**Remediation:**
- If constraint violation: Fix invalid flag key or value
- If connection issue: Tune connection pool settings
- If deadlock: Analyze concurrent access patterns
@@ -363,15 +410,18 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$
### E2E Test Flakiness
**Symptoms:**
- Tests pass locally, fail in CI
- Timeout errors in Playwright logs
**Diagnosis:**
1. Check backend latency: `grep "[METRICS]" ci-logs.txt`
2. Verify retry logic: Should see retry attempts in logs
3. Check polling behavior: Should see multiple GET requests
**Remediation:**
- If backend slow: Investigate CI environment (disk I/O, CPU)
- If no retries: Verify `retryAction()` wrapper in test
- If no polling: Verify `waitForFeatureFlagPropagation()` usage

View File

@@ -11,6 +11,7 @@
### Issue 1: `rate_limit` handler never appears in running Caddy config
**Observed symptom** (from CI log):
```
Attempt 10/10: rate_limit handler not found, waiting...
✗ rate_limit handler verification failed after 10 attempts
@@ -22,6 +23,7 @@ Rate limit enforcement test FAILED
#### Code path trace
The `verify_rate_limit_config` function in `scripts/rate_limit_integration.sh` (lines ~3558) executes:
```bash
caddy_config=$(curl -s http://localhost:2119/config 2>/dev/null || echo "")
if echo "$caddy_config" | grep -q '"handler":"rate_limit"'; then
@@ -48,6 +50,7 @@ The handler is absent from Caddy's running config because `ApplyConfig` in `back
**Root cause A — silent failure of the security config POST step** (contributing):
The security config POST step in the script discards stdout only; curl exits 0 for HTTP 4xx without -f flag, so auth failures are invisible:
```bash
# scripts/rate_limit_integration.sh, ~line 248
curl -s -X POST -H "Content-Type: application/json" \
@@ -55,9 +58,11 @@ curl -s -X POST -H "Content-Type: application/json" \
-b ${TMP_COOKIE} \
http://localhost:8280/api/v1/security/config >/dev/null
```
No HTTP status check is performed. If this returns 4xx (e.g., `403 Forbidden` because the requesting user lacks the admin role, or `401 Unauthorized` because the cookie was not accepted), the config is never saved to DB, `ApplyConfig` is never called with the rate_limit values, and the handler is never injected.
The route is protected by `middleware.RequireRole(models.RoleAdmin)` (routes.go:572573):
```go
securityAdmin := management.Group("/security")
securityAdmin.Use(middleware.RequireRole(models.RoleAdmin))
@@ -69,6 +74,7 @@ A non-admin authenticated user, or an unauthenticated request, returns `403` sil
**Root cause B — warn-and-proceed instead of fail-hard** (amplifier):
`verify_rate_limit_config` returns `1` on failure, but the calling site in the script treats the failure as non-fatal:
```bash
# scripts/rate_limit_integration.sh, ~line 269
if ! verify_rate_limit_config; then
@@ -76,11 +82,13 @@ if ! verify_rate_limit_config; then
echo "Proceeding with test anyway..."
fi
```
The enforcement test that follows is guaranteed to fail when the handler is absent (all requests pass through with HTTP 200, never hitting 429), yet the test proceeds unconditionally. The verification failure should be a hard exit.
**Root cause C — no response code check for proxy host creation** (contributing):
The proxy host creation at step 5 checks the status code (`201` vs other), but allows non-201 with a soft log message:
```bash
if [ "$CREATE_STATUS" = "201" ]; then
echo "✓ Proxy host created successfully"
@@ -88,11 +96,13 @@ else
echo " Proxy host may already exist (status: $CREATE_STATUS)"
fi
```
If this returns `401` (auth failure), no proxy host is registered. Requests to `http://localhost:8180/get` with `Host: ratelimit.local` then hit Caddy's catch-all route returning HTTP 200 (the Charon frontend), not the backend. No 429 will ever appear regardless of rate limit configuration.
**Root cause D — `ApplyConfig` failure is swallowed; Caddy not yet ready when config is posted** (primary):
In `UpdateConfig` (`security_handler.go:289292`):
```go
if h.caddyManager != nil {
if err := h.caddyManager.ApplyConfig(c.Request.Context()); err != nil {
@@ -101,6 +111,7 @@ if h.caddyManager != nil {
}
c.JSON(http.StatusOK, gin.H{"config": payload})
```
If `ApplyConfig` fails (Caddy not yet fully initialized, config validation error), the error is logged as a warning but the HTTP response is still `200 OK`. The test script sees 200, assumes success, and proceeds.
---
@@ -110,11 +121,13 @@ If `ApplyConfig` fails (Caddy not yet fully initialized, config validation error
**Observed symptom**: During non-CI Docker builds, the GeoIP download step prints `⚠️ Checksum failed` and creates a `.placeholder` file, but the downloaded `.mmdb` is left on disk alongside the placeholder.
**Code location**: `Dockerfile`, lines that contain:
```dockerfile
ARG GEOLITE2_COUNTRY_SHA256=aa154fc6bcd712644de232a4abcdd07dac1f801308c0b6f93dbc2b375443da7b
```
**Non-CI verification block** (Dockerfile, local build path):
```dockerfile
if [ -s /app/data/geoip/GeoLite2-Country.mmdb ] && \
echo "${GEOLITE2_COUNTRY_SHA256} /app/data/geoip/GeoLite2-Country.mmdb" | sha256sum -c -; then
@@ -146,6 +159,7 @@ fi;
**Required change**: Capture the HTTP status code from the login response. Fail fast if login returns non-200.
Exact change — replace:
```bash
curl -s -X POST -H "Content-Type: application/json" \
-d '{"email":"ratelimit@example.local","password":"password123"}' \
@@ -156,6 +170,7 @@ echo "✓ Authentication complete"
```
With:
```bash
LOGIN_STATUS=$(curl -s -w "\n%{http_code}" -X POST -H "Content-Type: application/json" \
-d '{"email":"ratelimit@example.local","password":"password123"}' \
@@ -174,6 +189,7 @@ echo "✓ Authentication complete (HTTP $LOGIN_STATUS)"
**Current behavior**: Non-201 responses are treated as "may already exist" and execution continues — including `401`/`403` auth failures.
Required change — replace:
```bash
if [ "$CREATE_STATUS" = "201" ]; then
echo "✓ Proxy host created successfully"
@@ -183,6 +199,7 @@ fi
```
With:
```bash
if [ "$CREATE_STATUS" = "201" ]; then
echo "✓ Proxy host created successfully"
@@ -201,6 +218,7 @@ fi
**Rationale**: Root Cause D is the primary driver of handler-not-found failures. If Caddy's admin API is not yet fully initialized when the security config is POSTed, `ApplyConfig` fails silently (logged as a warning only), the rate_limit handler is never injected into Caddy's running config, and the verification loop times out. The readiness gate ensures Caddy is accepting admin API requests before any config change is attempted.
**Required change** — insert before the security config POST:
```bash
echo "Waiting for Caddy admin API to be ready..."
for i in {1..20}; do
@@ -224,6 +242,7 @@ done
**Current behavior**: Response is discarded with `>/dev/null`. No status check.
Required change — replace:
```bash
curl -s -X POST -H "Content-Type: application/json" \
-d "${SEC_CFG_PAYLOAD}" \
@@ -234,6 +253,7 @@ echo "✓ Rate limiting configured"
```
With:
```bash
SEC_CONFIG_RESP=$(curl -s -w "\n%{http_code}" -X POST -H "Content-Type: application/json" \
-d "${SEC_CFG_PAYLOAD}" \
@@ -258,6 +278,7 @@ echo "✓ Rate limiting configured (HTTP $SEC_CONFIG_STATUS)"
**Current behavior**: Failed verification logs a warning and continues.
Required change — replace:
```bash
echo "Waiting for Caddy to apply configuration..."
sleep 5
@@ -270,6 +291,7 @@ fi
```
With:
```bash
echo "Waiting for Caddy to apply configuration..."
sleep 8
@@ -307,6 +329,7 @@ local wait=5 # was: 3
#### Change 7 — Use trailing slash on Caddy admin API URL in `verify_rate_limit_config`
**Location**: `verify_rate_limit_config`, line ~42:
```bash
caddy_config=$(curl -s http://localhost:2119/config 2>/dev/null || echo "")
```
@@ -314,11 +337,13 @@ caddy_config=$(curl -s http://localhost:2119/config 2>/dev/null || echo "")
Caddy's admin API specification defines `GET /config/` (with trailing slash) as the canonical endpoint for the full running config. Omitting the slash works in practice because Caddy does not redirect, but using the canonical form is more correct and avoids any future behavioral change:
Replace:
```bash
caddy_config=$(curl -s http://localhost:2119/config 2>/dev/null || echo "")
```
With:
```bash
caddy_config=$(curl -s http://localhost:2119/config/ 2>/dev/null || echo "")
```
@@ -377,6 +402,7 @@ fi
**Important**: Do NOT remove the `ARG GEOLITE2_COUNTRY_SHA256` declaration from the Dockerfile. The `update-geolite2.yml` workflow uses `sed` to update that ARG. If the ARG disappears, the workflow's `sed` command will silently no-op and fail to update the Dockerfile on next run, leaving the stale hash in source while the workflow reports success. Keeping the ARG (even unused) preserves Renovate/workflow compatibility.
Keep:
```dockerfile
ARG GEOLITE2_COUNTRY_SHA256=aa154fc6bcd712644de232a4abcdd07dac1f801308c0b6f93dbc2b375443da7b
```
@@ -402,6 +428,7 @@ This ARG is now only referenced by the `update-geolite2.yml` workflow (to know i
### Validating Issue 1 fix
**Step 1 — Build and run the integration test locally:**
```bash
# From /projects/Charon
chmod +x scripts/rate_limit_integration.sh
@@ -409,6 +436,7 @@ scripts/rate_limit_integration.sh 2>&1 | tee /tmp/ratelimit-test.log
```
**Expected output sequence (key lines)**:
```
✓ Charon API is ready
✓ Authentication complete (HTTP 200)
@@ -428,16 +456,20 @@ Sending request 3+1 (should return 429 Too Many Requests)...
**Step 2 — Deliberately break auth to verify the new guard fires:**
Temporarily change `password123` in the login curl to a wrong password. The test should now print:
```
✗ Login failed (HTTP 401) — aborting
```
and exit with code 1, rather than proceeding to a confusing 429-enforcement failure.
**Step 3 — Verify Caddy config contains the handler before enforcement:**
```bash
# After security config step and sleep 8:
curl -s http://localhost:2119/config/ | python3 -m json.tool | grep -A2 '"handler": "rate_limit"'
```
Expected: handler block with `"rate_limits"` sub-key containing `"static"` zone.
**Step 4 — CI validation:** Push to a PR and observe the `Rate Limiting Integration` workflow. The workflow now exits at the first unmissable error rather than proceeding to a deceptive "enforcement test FAILED" message.
@@ -445,21 +477,27 @@ Expected: handler block with `"rate_limits"` sub-key containing `"static"` zone.
### Validating Issue 2 fix
**Step 1 — Local build without CI flag:**
```bash
docker build -t charon:geolip-test --build-arg CI=false . 2>&1 | grep -E "GeoIP|GeoLite|checksum|✅|⚠️"
```
Expected: `✅ GeoIP downloaded` (no mention of checksum failure).
**Step 2 — Verify file is present and readable:**
```bash
docker run --rm charon:geolip-test stat /app/data/geoip/GeoLite2-Country.mmdb
```
Expected: file exists with non-zero size, no `.placeholder` alongside.
**Step 3 — Confirm ARG still exists for workflow compatibility:**
```bash
grep "GEOLITE2_COUNTRY_SHA256" Dockerfile
```
Expected: `ARG GEOLITE2_COUNTRY_SHA256=<hash>` line is present.
---

View File

@@ -37,6 +37,7 @@ Content-Type: application/json
```
**Key design decisions:**
- **Token storage:** The bot token is stored in `NotificationProvider.Token` (`json:"-"`, encrypted at rest) — never in the URL field. This mirrors the Gotify pattern where secrets are separated from endpoints.
- **URL field:** Stores only the `chat_id` (e.g., `987654321`). At dispatch time, the full API URL is constructed dynamically: `https://api.telegram.org/bot` + decryptedToken + `/sendMessage`. The `chat_id` is passed in the POST body alongside the message text. This prevents token leakage via API responses since URL is `json:"url"`.
- **SSRF mitigation:** Before dispatching, validate that the constructed URL hostname is exactly `api.telegram.org`. This prevents SSRF if stored data is tampered with.
@@ -475,6 +476,7 @@ Request/response schemas are unchanged. The `type` field now accepts `"telegram"
Modeled after `tests/settings/email-notification-provider.spec.ts`.
Test scenarios:
1. Create a Telegram provider (name, chat_id in URL field, bot token in token field, enable events)
2. Verify provider appears in the list
3. Edit the Telegram provider (change name, verify token preservation)
@@ -611,6 +613,7 @@ Add telegram to the payload matrix test scenarios.
**Scope:** Feature flags, service layer, handler layer, all Go unit tests
**Files changed:**
- `backend/internal/notifications/feature_flags.go`
- `backend/internal/api/handlers/feature_flags_handler.go`
- `backend/internal/notifications/router.go`
@@ -624,6 +627,7 @@ Add telegram to the payload matrix test scenarios.
**Dependencies:** None (self-contained backend change)
**Validation gates:**
- `go test ./...` passes
- `make lint-fast` passes
- Coverage ≥ 85%
@@ -636,6 +640,7 @@ Add telegram to the payload matrix test scenarios.
**Scope:** Frontend API client, Notifications page, i18n strings, frontend unit tests, Playwright E2E tests
**Files changed:**
- `frontend/src/api/notifications.ts`
- `frontend/src/pages/Notifications.tsx`
- `frontend/src/locales/en/translation.json`
@@ -648,6 +653,7 @@ Add telegram to the payload matrix test scenarios.
**Dependencies:** PR-1 must be merged first (backend must accept `type: "telegram"`)
**Validation gates:**
- `npm test` passes
- `npm run type-check` passes
- `npx playwright test --project=firefox` passes

View File

@@ -55,6 +55,7 @@ disabled={testMutation.isPending || (isNew && !isEmail)}
**Why it was added:** The backend `Test` handler at `notification_provider_handler.go` (L333-336) requires a saved provider ID for all non-email types. For Gotify/Telegram, the server needs the stored token. For Discord/Webhook, the server still fetches the provider from DB. Without a saved provider, the backend returns `MISSING_PROVIDER_ID`.
**Why it breaks tests:** Many existing E2E and unit tests click the test button from a **new (unsaved) provider form** using mocked endpoints. With the new guard:
1. The `<button>` is `disabled` → browser ignores clicks → mocked routes never receive requests
2. Even if not disabled, `handleTest()` returns early with a toast instead of calling `testMutation.mutate()`
3. Tests that `waitForRequest` on `/providers/test` time out (60s default)
@@ -103,6 +104,7 @@ These tests open the "Add Provider" form (no `id`), click `provider-test-btn`, a
| 2 | retry split distinguishes retryable and non-retryable failures | L410 | webhook | `provider-test-btn` disabled for new webhook form; `waitForResponse` times out |
**Tests that should still pass:**
- `valid payload flows for discord, gotify, and webhook` (L54) — uses `provider-save-btn`, not test button
- `malformed payload scenarios` (L158) — API-level tests via `page.request.post`
- `missing required fields block submit` (L192) — uses save button
@@ -119,6 +121,7 @@ These tests open the "Add Provider" form (no `id`), click `provider-test-btn`, a
| 2 | should test telegram notification provider | L265 | Row-level Send Test button; possible accessible name mismatch in WebKit with `title` attribute |
**Tests that should pass:**
- Form rendering tests (L25, L65) — UI assertions only
- Create telegram provider (L89) — mocked POST
- Delete telegram provider (L324) — mocked DELETE + confirm dialog
@@ -265,6 +268,7 @@ it('disables test button when provider is new (unsaved) and not email type', asy
**File:** `tests/settings/notifications.spec.ts`
**Strategy:** For tests that click the test button from a new form, restructure the flow to:
1. First **save** the provider (mocked create → returns id)
2. Then **test** from the saved provider row's Send Test button (row buttons are not gated by `isNew`)
@@ -360,6 +364,7 @@ Same pattern: save first, then test from row.
#### Fix 9: "should edit telegram notification provider and preserve token" (L159)
**Problem:** Uses fragile keyboard navigation to reach the Edit button:
```typescript
await sendTestButton.focus();
await page.keyboard.press('Tab');
@@ -388,6 +393,7 @@ Or use a structural locator based on the edit icon class.
**Probable issue:** The `getByRole('button', { name: /send test/i })` relies on `title` for accessible name. WebKit may not compute accessible name from `title` the same way.
**Fix (source — preferred):** Add explicit `aria-label` to the row Send Test button in `Notifications.tsx` (L703):
```tsx
<Button
variant="secondary"
@@ -399,6 +405,7 @@ Or use a structural locator based on the edit icon class.
```
**Fix (test — alternative):** Use structural locator:
```typescript
const sendTestButton = providerRow.locator('button').first();
```
@@ -469,18 +476,21 @@ Consider adding `aria-label` attributes to all icon-only buttons in the provider
**Rationale:** All fixes are tightly coupled to the Telegram feature PR and represent test adaptations to a correct behavioral change. No cross-domain changes. Small total diff.
### Commit 1: "fix(test): adapt notification tests to save-before-test guard"
- **Scope:** All unit test and E2E test fixes (Phases 1-3)
- **Files:** `Notifications.test.tsx`, `notifications.spec.ts`, `notifications-payload.spec.ts`, `telegram-notification-provider.spec.ts`
- **Dependencies:** None
- **Validation Gate:** All notification-related tests pass locally on at least one browser
### Commit 2: "feat(a11y): add aria-labels to notification provider row buttons"
- **Scope:** Source code accessibility improvement (Phase 4)
- **Files:** `Notifications.tsx`
- **Dependencies:** Depends on Commit 1 (tests must pass first)
- **Validation Gate:** Telegram spec tests pass consistently on WebKit
### Rollback
- These are test-only changes (except the optional aria-label). Reverting either commit has zero production impact.
- If tests still fail after fixes, the next step is to run with `--debug` and capture trace artifacts.

View File

@@ -32,12 +32,14 @@ Successfully implemented Bug #1 fix per investigation report `docs/issues/crowds
**Purpose**: Validates API key by making authenticated request to LAPI `/v1/decisions/stream` endpoint.
**Behavior**:
- **Connection Refused** → Retry with exponential backoff (500ms → 750ms → 1125ms → ..., max 5s per attempt)
- **403 Forbidden** → Fail immediately (indicates invalid key, no retry)
- **200 OK** → Key valid
- **Timeout**: 30 seconds total, 5 seconds per HTTP request
**Example Log Output**:
```
time="..." level=info msg="LAPI not ready, retrying with backoff" attempt=1 error="connection refused" next_attempt_ms=500
time="..." level=info msg="CrowdSec bouncer authentication successful" masked_key="abcd...wxyz" source=file
@@ -48,6 +50,7 @@ time="..." level=info msg="CrowdSec bouncer authentication successful" masked_ke
**Purpose**: Ensures valid bouncer authentication using environment variable → file → auto-generation priority.
**Updated Logic**:
1. Check `CROWDSEC_API_KEY` environment variable → **Test against LAPI**
2. Check `CHARON_SECURITY_CROWDSEC_API_KEY` environment variable → **Test against LAPI**
3. Check file `/app/data/crowdsec/bouncer_key`**Test against LAPI**
@@ -60,6 +63,7 @@ time="..." level=info msg="CrowdSec bouncer authentication successful" masked_ke
**Updated**: Atomic write pattern using temp file + rename.
**Security Improvements**:
- Directory created with `0700` permissions (owner only)
- Key file created with `0600` permissions (owner read/write only)
- Atomic write prevents corruption if process killed mid-write
@@ -86,6 +90,7 @@ time="..." level=info msg="CrowdSec bouncer authentication successful" masked_ke
| `TestGetBouncerAPIKeyFromEnv_Priority` | ✅ | Verifies env var precedence |
**Coverage Results**:
```
crowdsec_handler.go:1548: testKeyAgainstLAPI 75.0%
crowdsec_handler.go:1641: ensureBouncerRegistration 83.3%
@@ -109,6 +114,7 @@ crowdsec_handler.go:1830: saveKeyToFile 58.3%
| `TestBouncerAuth_FileKeyPersistsAcrossRestarts` | Verifies key persistence across container restarts | Yes |
**Execution**:
```bash
cd backend
go test -tags=integration ./integration/ -run "TestBouncerAuth"
@@ -168,10 +174,12 @@ time="..." level=info msg="CrowdSec bouncer authentication successful" masked_ke
**Function**: `maskAPIKey()` (line 1752)
**Behavior**:
- Keys < 8 chars: Return `[REDACTED]`
- Keys >= 8 chars: Return `first4...last4` (e.g., `abcd...wxyz`)
**Example**:
```go
maskAPIKey("valid-api-key-12345678")
// Returns: "vali...5678"
@@ -187,6 +195,7 @@ maskAPIKey("valid-api-key-12345678")
| `/app/data/crowdsec/bouncer_key` | `0600` | Owner read/write only |
**Code**:
```go
os.MkdirAll(filepath.Dir(keyFile), 0700)
os.WriteFile(tempPath, []byte(apiKey), 0600)
@@ -209,6 +218,7 @@ os.Rename(tempPath, keyFile) // Atomic rename
## Breaking Changes
**None**. All changes are backward compatible:
- Old `validateBouncerKey()` method preserved but unused
- Environment variable names unchanged (`CROWDSEC_API_KEY` and `CHARON_SECURITY_CROWDSEC_API_KEY`)
- File path unchanged (`/app/data/crowdsec/bouncer_key`)
@@ -221,12 +231,14 @@ os.Rename(tempPath, keyFile) // Atomic rename
**Document**: `docs/testing/crowdsec_auth_manual_verification.md`
**Test Scenarios**:
1. Invalid Environment Variable Auto-Recovery
2. LAPI Startup Delay Handling (30s retry window)
3. No More "Access Forbidden" Errors in Production
4. Key Source Visibility in Logs (env var vs file vs auto-generated)
**How to Test**:
```bash
# Scenario 1: Invalid env var
echo "CHARON_SECURITY_CROWDSEC_API_KEY=fakeinvalidkey" >> docker-compose.yml
@@ -258,6 +270,7 @@ docker logs -f charon | grep -i "invalid"
**Formula**: `nextBackoff = currentBackoff * 1.5` (exponential)
**Timings**:
- Attempt 1: 500ms delay
- Attempt 2: 750ms delay
- Attempt 3: 1.125s delay

View File

@@ -72,12 +72,14 @@ For test and development environments (`CHARON_ENV=test|e2e|development`), the e
E2E tests validate both break glass tiers to ensure defense in depth:
**Tier 1 (Main Endpoint):**
```bash
curl -X POST http://localhost:8080/api/v1/emergency/security-reset \
-H "X-Emergency-Token: $TOKEN"
```
**Tier 2 (Emergency Server):**
```bash
curl -X POST http://localhost:2020/emergency/security-reset \
-H "X-Emergency-Token: $TOKEN" \

View File

@@ -17,11 +17,13 @@ The following tests failed during the `firefox` project execution against the E2
**Test:** `tests/security/crowdsec-config.spec.ts`
**Case:** `CrowdSec Configuration @security Accessibility should have accessible form controls`
**Error:**
```text
Error: expect(received).toBeTruthy()
Received: null
Location: crowdsec-config.spec.ts:296:28
```
**Analysis:** Input fields in the CrowdSec configuration form are missing accessible labels (via `aria-label`, `aria-labelledby`, or `<label for="...">`). This violates WCAG 2.1 guidelines and causes test failure.
### 2.2. Keyboard Navigation Failures (Severity: Medium)
@@ -29,11 +31,13 @@ Location: crowdsec-config.spec.ts:296:28
**Test:** `tests/security/crowdsec-decisions.spec.ts`
**Case:** `CrowdSec Banned IPs Management Accessibility should be keyboard navigable`
**Error:**
```text
Error: expect(locator).toBeVisible() failed
Locator: locator(':focus')
Expected: visible
```
**Analysis:** The "Banned IPs" card or table does not properly handle initial focus or tab navigation, resulting in focus being lost or placed on a non-visible element.
### 2.3. Test Interruption / Potential Timeout (Severity: Low/Flaky)
@@ -58,7 +62,7 @@ The vulnerabilities are detected in the base OS (`glibc`). Currently, there is n
## 4. Recommendations
1. **Remediate Accessibility:** Update `CrowdSecConfig` React component to add `aria-label` to form inputs, specifically those used for configuration toggles or text fields.
2. **Fix Focus Management:** Ensure the Banned IPs table has a valid tab order and visually indicates focus.
3. **Monitor Flakiness:** Re-run diagnostics tests in isolation to confirm if the interruption is persistent.
4. **Accept Risk (OS):** Acknowledge the `glibc` vulnerabilities and schedule a base image update check in 30 days.
1. **Remediate Accessibility:** Update `CrowdSecConfig` React component to add `aria-label` to form inputs, specifically those used for configuration toggles or text fields.
2. **Fix Focus Management:** Ensure the Banned IPs table has a valid tab order and visually indicates focus.
3. **Monitor Flakiness:** Re-run diagnostics tests in isolation to confirm if the interruption is persistent.
4. **Accept Risk (OS):** Acknowledge the `glibc` vulnerabilities and schedule a base image update check in 30 days.

View File

@@ -28,12 +28,14 @@ All Phase 2.3 critical fixes have been **successfully implemented, tested, and v
## Phase 2.3a: Dependency Security Update
### Implementation Completed
- ✅ golang.org/x/crypto v0.48.0 (exceeds requirement v0.31.0+)
- ✅ golang.org/x/net v0.50.0
- ✅ golang.org/x/oauth2 v0.30.0
- ✅ github.com/quic-go/quic-go v0.59.0
### Docker Build Status
-**Build Status:** SUCCESS
-**Image Size:** < 700MB (expected)
-**Base Image:** Alpine 3.23.3
@@ -90,6 +92,7 @@ Total Vulns: 1 (CRITICAL: 0, HIGH: 1)
## Phase 2.3b: InviteUser Async Email Refactoring
### Implementation Completed
- ✅ InviteUser handler refactored to async pattern
- ✅ Email sending executed in background goroutine
- ✅ HTTP response returns immediately (no blocking)
@@ -190,6 +193,7 @@ Response: New JWT token + expiry timestamp
### Implementation Required
The auth token refresh endpoint has been verified to exist and function correctly:
- ✅ Token refresh via POST /api/v1/auth/refresh
- ✅ Returns new token with updated expiry
- ✅ Supports Bearer token authentication
@@ -197,6 +201,7 @@ The auth token refresh endpoint has been verified to exist and function correctl
### Fixture Implementation Status
**Ready for:** Token refresh integration into Playwright test fixtures
- ✅ Endpoint verified
- ✅ No blocking issues identified
- ✅ Can proceed with fixture implementation
@@ -204,6 +209,7 @@ The auth token refresh endpoint has been verified to exist and function correctl
### Expected Implementation
The test fixtures will include:
1. Automatic token refresh 5 minutes before expiry
2. File-based token caching between test runs
3. Cache validation and reuse
@@ -227,6 +233,7 @@ The test fixtures will include:
**Objective:** Verify dependency updates resolve CVEs and no new vulnerabilities introduced
**Results:**
- ✅ Trivy CRITICAL: 0 found
- ✅ Trivy HIGH: 1 found (CVE-2026-25793 in unrelated caddy/nebula, already patched v1.10.3)
- ✅ golang.org/x/crypto v0.48.0: Includes CVE-2024-45337 fix
@@ -240,6 +247,7 @@ The test fixtures will include:
**Objective:** Verify InviteUser endpoint reliably handles user creation without timeouts
**Results:**
- ✅ Unit test suite: 10/10 passing
- ✅ Response time: ~100ms (exceeds <200ms requirement)
- ✅ No timeout errors observed
@@ -248,6 +256,7 @@ The test fixtures will include:
- ✅ Error handling verified
**Regression Testing:**
- ✅ Backend unit tests: All passing
- ✅ No deprecated functions used
- ✅ API compatibility maintained
@@ -259,12 +268,14 @@ The test fixtures will include:
**Objective:** Verify token refresh mechanism prevents 401 errors during extended test sessions
**Pre-Validation Results:**
- ✅ Auth token endpoint functional
- ✅ Token refresh endpoint verified working
- ✅ Token expiry extraction possible
- ✅ Can implement automatic refresh logic
**Expected Implementation:**
- Token automatically refreshed 5 minutes before expiry
- File-based caching reduces login overhead
- 60+ minute test sessions supported
@@ -371,18 +382,21 @@ Service Version: dev (expected for this environment)
### Three Phases Completed Successfully
**Phase 2.3a: Dependency Security**
- Dependencies updated to latest stable versions
- CVE-2024-45337 remediated
- Trivy scan clean (0 CRITICAL)
- Docker build successful
**Phase 2.3b: Async Email Refactoring**
- InviteUser refactored to async pattern
- 10/10 unit tests passing
- Response time <200ms (actual ~100ms)
- No blocking observed
**Phase 2.3c: Token Refresh**
- Refresh endpoint verified working
- Token format valid
- Ready for fixture implementation
@@ -420,6 +434,7 @@ Service Version: dev (expected for this environment)
**ALL GATES PASSED**
The system is:
- ✅ Secure (0 CRITICAL CVEs)
- ✅ Stable (tests passing, no regressions)
- ✅ Reliable (async patterns, error handling)
@@ -458,6 +473,7 @@ The system has successfully completed Phase 2.3 critical fixes. All three remedi
### Validation Team
**QA Verification:** ✅ Complete
- Status: All validation steps completed
- Findings: No blocking issues
- Confidence Level: High (15-point validation checklist passed)
@@ -465,6 +481,7 @@ The system has successfully completed Phase 2.3 critical fixes. All three remedi
### Security Review
**Security Assessment:** ✅ Passed
- Vulnerabilities: 0 CRITICAL
- Code Security: GORM scan passed
- Dependency Security: CVE-2024-45337 resolved
@@ -475,6 +492,7 @@ The system has successfully completed Phase 2.3 critical fixes. All three remedi
**Authorization Status:** Ready for approval ([Awaiting Tech Lead])
**Approval Required From:**
- [ ] Tech Lead (Architecture authority)
- [x] QA Team (Validation complete)
- [x] Security Review (No issues)

View File

@@ -8,9 +8,11 @@
**Fixed Version:** v1.10.3
## Decision
Accept the High severity vulnerability in nebula v1.9.7 as a documented known issue.
## Rationale
- Nebula is a transitive dependency via CrowdSec bouncer -> ipstore chain
- Upgrading to v1.10.3 breaks compilation:
- smallstep/certificates removed nebula APIs (NebulaCAPool, NewCAPoolFromBytes, etc.)
@@ -21,30 +23,37 @@ Accept the High severity vulnerability in nebula v1.9.7 as a documented known is
- This is an upstream dependency management issue beyond our immediate control
## Dependency Chain
- Caddy (xcaddy builder)
- github.com/hslatman/caddy-crowdsec-bouncer@v0.9.2
- github.com/hslatman/ipstore@v0.3.0
- github.com/slackhq/nebula@v1.9.7 (vulnerable)
## Exploitability Assessment
- Nebula is present in Docker image build artifacts
- Used by CrowdSec bouncer for IP address management
- Attack surface: [Requires further analysis - see monitoring plan]
## Monitoring Plan
Watch for upstream fixes in:
- github.com/hslatman/caddy-crowdsec-bouncer (primary)
- github.com/hslatman/ipstore (secondary)
- github.com/smallstep/certificates (nebula API compatibility)
- github.com/slackhq/nebula (direct upgrade if dependency chain updates)
Check quarterly (or when Dependabot/security scans alert):
- CrowdSec bouncer releases: https://github.com/hslatman/caddy-crowdsec-bouncer/releases
- ipstore releases: https://github.com/hslatman/ipstore/releases
- smallstep/certificates releases: https://github.com/smallstep/certificates/releases
- CrowdSec bouncer releases: <https://github.com/hslatman/caddy-crowdsec-bouncer/releases>
- ipstore releases: <https://github.com/hslatman/ipstore/releases>
- smallstep/certificates releases: <https://github.com/smallstep/certificates/releases>
## Remediation Trigger
Revisit and remediate when ANY of:
- caddy-crowdsec-bouncer releases version with nebula v1.10.3+ support
- smallstep/certificates releases version compatible with nebula v1.10.3
- ipstore releases version fixing GetAndDelete compatibility
@@ -52,12 +61,15 @@ Revisit and remediate when ANY of:
- Proof-of-concept exploit published targeting Charon's attack surface
## Alternative Mitigation (Future)
If upstream remains stalled:
- Consider removing CrowdSec bouncer plugin (loss of CrowdSec integration)
- Evaluate alternative IP blocking/rate limiting solutions
- Implement CrowdSec integration at reverse proxy layer instead of Caddy
## References
- CVE Details: https://github.com/advisories/GHSA-69x3-g4r3-p962
- CVE Details: <https://github.com/advisories/GHSA-69x3-g4r3-p962>
- Analysis Report: [docs/reports/nebula_upgrade_analysis.md](../reports/nebula_upgrade_analysis.md)
- Version Test Results: [docs/reports/nebula_upgrade_analysis.md](../reports/nebula_upgrade_analysis.md#6-version-compatibility-test-results)

View File

@@ -21,6 +21,7 @@ This document provides formal acceptance and risk assessment for vulnerabilities
**Decision**: Temporary acceptance pending Alpine Linux migration (already planned).
**Rationale**:
- CrowdSec LAPI authentication fix is CRITICAL for production users
- CVEs are in Debian base packages, NOT application code
- CVEs exist in `main` branch (blocking fix provides zero security improvement)
@@ -30,6 +31,7 @@ This document provides formal acceptance and risk assessment for vulnerabilities
**Mitigation Plan**: Full Alpine migration (see `docs/plans/alpine_migration_spec.md`)
**Expected Timeline**:
- Week 1 (Feb 5-8): Verify Alpine CVE-2025-60876 is patched
- Weeks 2-3 (Feb 11-22): Dockerfile migration + testing
- Week 4 (Feb 26-28): Staging validation
@@ -40,6 +42,7 @@ This document provides formal acceptance and risk assessment for vulnerabilities
**Detailed Security Advisory**: [`advisory_2026-02-04_debian_cves_temporary.md`](./advisory_2026-02-04_debian_cves_temporary.md)
**Affected CVEs**:
| CVE | CVSS | Package | Status |
|-----|------|---------|--------|
| CVE-2026-0861 | 8.4 | libc6 | No fix available → Alpine migration |
@@ -48,6 +51,7 @@ This document provides formal acceptance and risk assessment for vulnerabilities
| CVE-2026-0915 | 7.5 | libc6 | No fix available → Alpine migration |
**Approval Record**:
- **Security Team**: APPROVED (temporary acceptance with mitigation) ✅
- **QA Team**: APPROVED (conditions met) ✅
- **DevOps Team**: APPROVED (Alpine migration feasible) ✅
@@ -77,6 +81,7 @@ PR #461 supply chain scan identified **9 vulnerabilities** in Alpine Linux 3.23.
**Decision**: All vulnerabilities are **ACCEPTED** pending upstream Alpine Security Team patches. No application-level vulnerabilities were found.
**Rationale**:
- All CVEs are Alpine OS package issues, not Charon application code
- No patches available from Alpine upstream as of 2026-01-13
- Low exploitability in containerized deployment environment

View File

@@ -29,11 +29,13 @@
The golang.org/x/crypto/ssh package contains a vulnerability where improper use of the ServerConfig.PublicKeyCallback function could lead to authorization bypass. This is particularly critical for applications using SSH key-based authentication.
**Risk Assessment:**
- **Likelihood:** Medium (requires specific misuse pattern)
- **Impact:** High (authorization bypass possible)
- **Overall Risk:** HIGH
**Remediation:**
```bash
# Update crypto package to latest version
go get -u golang.org/x/crypto@latest
@@ -46,6 +48,7 @@ go list -m golang.org/x/crypto
```
**Verification Steps:**
1. Run: `go mod tidy`
2. Run: `trivy fs . --severity CRITICAL --format json | jq '.Results[] | select(.Vulnerabilities!=null) | .Vulnerabilities[] | select(.VulnerabilityID=="CVE-2024-45337")'`
3. Confirm vulnerability no longer appears
@@ -249,6 +252,7 @@ git push
### Automated Dependency Updates
**Recommended Setup:**
1. Enable Dependabot on GitHub
2. Set up automatic PR creation for security updates
3. Configure CI to run on dependency PRs
@@ -257,6 +261,7 @@ git push
### Configuration
**.github/dependabot.yml:**
```yaml
version: 2
updates:
@@ -305,6 +310,7 @@ updates:
## Timeline & Tracking
### Phase 1: Immediate (Today)
- [ ] Review this report
- [ ] Run remediation steps
- [ ] Verify updates resolve CVEs
@@ -312,12 +318,14 @@ updates:
- [ ] Commit and push updates
### Phase 2: Within 1 Week
- [ ] Test updated dependencies
- [ ] Run full E2E test suite
- [ ] Performance verification
- [ ] Deploy to staging
### Phase 3: Within 2 Weeks
- [ ] Deploy to production
- [ ] Monitor for issues
- [ ] Set up automated scanning

View File

@@ -25,11 +25,13 @@ The CrowdSec "Ban IP" and "Unban IP" modals were identified as lacking standard
Verification was performed using the Playwright E2E test suite running against a Dockerized environment.
### Test Environment
- **Container**: `charon-e2e`
- **Base URL**: `http://localhost:8080`
- **Browser**: Firefox
### Test Execution
**Command**: `npx playwright test tests/security/crowdsec-decisions.spec.ts -g "should open ban modal"`
**Result**: ✅ **PASSED**
@@ -49,6 +51,7 @@ A broader run of `tests/security/crowdsec-decisions.spec.ts` was also executed,
## 4. Code Snippets
### Ban Modal
```tsx
<div
className="fixed inset-0 z-50 flex items-center justify-center"

View File

@@ -15,6 +15,7 @@
**This risk acceptance expires on May 2, 2026.**
A fresh security review **MUST** be conducted before the expiration date to:
- ✅ Verify patch availability from Debian Security
- ✅ Re-assess risk level based on new threat intelligence
- ✅ Renew or revoke this risk acceptance
@@ -27,6 +28,7 @@ A fresh security review **MUST** be conducted before the expiration date to:
## Executive Summary
**Vulnerability Overview**:
- **Total Vulnerabilities Detected**: 409
- **HIGH Severity**: 7 (requires documentation and monitoring)
- **Patches Available**: 0 (all HIGH CVEs unpatched as of February 1, 2026)
@@ -63,11 +65,13 @@ All HIGH severity vulnerabilities are in Debian Trixie base image system librari
A heap overflow vulnerability exists in the memory alignment functions (`memalign`, `aligned_alloc`, `posix_memalign`) of GNU C Library (glibc). Exploitation requires an attacker to control the size or alignment parameters passed to these functions.
**Charon Impact**: **MINIMAL**
- Charon does not directly call `memalign` or related functions
- Go's runtime memory allocator does not use these glibc functions for heap management
- Attack requires direct control of memory allocation parameters
**Exploitation Complexity**: **HIGH**
- Requires vulnerable application code path
- Attacker must control function parameters
- Heap layout manipulation needed
@@ -84,12 +88,14 @@ A heap overflow vulnerability exists in the memory alignment functions (`memalig
A stack buffer overflow exists in the ASN.1 parsing library (libtasn1) when processing maliciously crafted ASN.1 encoded data. This library is used by TLS/SSL implementations for certificate parsing.
**Charon Impact**: **MINIMAL**
- Charon uses Go's native `crypto/tls` package, not system libtasn1
- Attack requires malformed TLS certificates presented to the application
- Go's ASN.1 parser is memory-safe and not affected by this CVE
- System libtasn1 is only used by OS-level services (e.g., system certificate validation)
**Exploitation Complexity**: **HIGH**
- Requires attacker-controlled certificate uploaded or presented
- Go's TLS stack provides defense-in-depth
@@ -105,12 +111,14 @@ A stack buffer overflow exists in the ASN.1 parsing library (libtasn1) when proc
The `wordexp()` function in glibc, when used with the `WRDE_REUSE` flag, can lead to improper memory management. This function performs shell-like word expansion and is typically used to parse configuration files or user input.
**Charon Impact**: **NONE**
- Charon is written in Go, does not call glibc `wordexp()`
- Go's standard library does not use `wordexp()` internally
- No shell expansion performed by Charon application code
- Attack requires application to call vulnerable glibc function
**Exploitation Complexity**: **VERY HIGH**
- Requires vulnerable C/C++ application using `wordexp(WRDE_REUSE)`
- Charon (Go) is not affected
@@ -126,12 +134,14 @@ The `wordexp()` function in glibc, when used with the `WRDE_REUSE` flag, can lea
A vulnerability in the Name Service Switch (NSS) subsystem's handling of network address resolution (`getnetbyaddr`) can be exploited through malicious `nsswitch.conf` configurations.
**Charon Impact**: **MINIMAL**
- Charon uses Go's `net` package for DNS resolution, not glibc NSS
- Go's resolver does not parse `/etc/nsswitch.conf`
- Attack requires root/container escape to modify system configuration
- Charon runs as non-root user with read-only filesystem
**Exploitation Complexity**: **VERY HIGH**
- Requires root access to modify `/etc/nsswitch.conf`
- If attacker has root, this CVE is not the primary concern
@@ -208,6 +218,7 @@ A vulnerability in the Name Service Switch (NSS) subsystem's handling of network
6. **Alternative Complexity**: Migrating to Alpine Linux requires significant testing effort
**Acceptance Conditions**:
- ✅ Weekly Grype scans to monitor for patches
- ✅ Subscription to Debian Security Announce mailing list
- ✅ 90-day re-evaluation mandatory (expires May 2, 2026)
@@ -236,6 +247,7 @@ cap_add:
```
**Rationale**:
- **`no-new-privileges`**: Prevents privilege escalation via setuid binaries
- **Read-only filesystem**: Prevents modification of system libraries or binaries
- **Non-root user**: Limits impact of container escape
@@ -244,12 +256,14 @@ cap_add:
#### Application-Level Security
**Cerberus Security Suite** (enabled in production):
-**WAF (Coraza)**: Blocks common attack payloads (SQLi, XSS, RCE)
-**ACL**: IP-based access control to admin interface
-**Rate Limiting**: Prevents brute-force and DoS attempts
-**CrowdSec**: Community-driven threat intelligence and IP reputation
**TLS Configuration**:
- ✅ TLS 1.3 minimum (enforced by Caddy reverse proxy)
- ✅ Strong cipher suites only (no weak ciphers)
- ✅ HTTP Strict Transport Security (HSTS)
@@ -258,6 +272,7 @@ cap_add:
#### Network Security
**Firewall Rules** (example for production deployment):
```bash
# Allow only HTTPS and SSH
iptables -A INPUT -p tcp --dport 443 -j ACCEPT
@@ -280,6 +295,7 @@ iptables -A FORWARD -i docker0 -o eth0 -d 10.0.0.0/8 -j DROP # Block internal n
**CI Integration**: GitHub Actions workflow
**Workflow**:
```yaml
# .github/workflows/security-scan-weekly.yml
name: Weekly Security Scan
@@ -300,6 +316,7 @@ jobs:
```
**Alert Triggers**:
- ✅ Patch available for any HIGH CVE → Create PR automatically
- ✅ New CRITICAL CVE discovered → Slack/email alert to security team
- ✅ 7 days before expiration (April 25, 2026) → Calendar reminder
@@ -308,11 +325,12 @@ jobs:
### Debian Security Mailing List Subscription
**Mailing List**: security-announce@lists.debian.org
**Subscriber**: security-team@example.com
**Mailing List**: <security-announce@lists.debian.org>
**Subscriber**: <security-team@example.com>
**Filter Rule**: Flag emails mentioning CVE-2026-0861, CVE-2025-13151, CVE-2025-15281, CVE-2026-0915
**Response SLA**:
- **Patch announced**: Review and test within 48 hours
- **Backport required**: Create PR within 5 business days
- **Breaking change**: Schedule maintenance window within 2 weeks
@@ -336,9 +354,10 @@ jobs:
- 🟠 **High Priority**: Assess impact and plan migration to Alpine Linux if needed
**Contact List**:
- Security Team Lead: security-lead@example.com
- DevOps On-Call: oncall-devops@example.com
- CTO: cto@example.com
- Security Team Lead: <security-lead@example.com>
- DevOps On-Call: <oncall-devops@example.com>
- CTO: <cto@example.com>
---
@@ -347,18 +366,21 @@ jobs:
### Alpine Linux (Considered for Future Migration)
**Advantages**:
- ✅ Smaller attack surface (~5MB vs. ~120MB Debian base)
- ✅ musl libc (not affected by glibc CVEs)
- ✅ Faster security updates
- ✅ Immutable infrastructure friendly
**Disadvantages**:
- ❌ Different C library (musl) - potential compatibility issues
- ❌ Limited pre-built binary packages (Go binaries are fine)
- ❌ Less mature ecosystem vs. Debian
- ❌ Requires extensive regression testing
**Decision**: Defer Alpine migration until:
- Debian Trixie reaches end-of-life, OR
- CRITICAL unpatched CVE with active exploit
@@ -401,22 +423,24 @@ For use during compliance audits (SOC 2, ISO 27001, etc.):
### Vulnerability Trackers
- **Debian Security Tracker**: https://security-tracker.debian.org/tracker/
- **CVE-2026-0861**: https://security-tracker.debian.org/tracker/CVE-2026-0861
- **CVE-2025-13151**: https://security-tracker.debian.org/tracker/CVE-2025-13151
- **CVE-2025-15281**: https://security-tracker.debian.org/tracker/CVE-2025-15281
- **CVE-2026-0915**: https://security-tracker.debian.org/tracker/CVE-2026-0915
- **Debian Security Tracker**: <https://security-tracker.debian.org/tracker/>
- **CVE-2026-0861**: <https://security-tracker.debian.org/tracker/CVE-2026-0861>
- **CVE-2025-13151**: <https://security-tracker.debian.org/tracker/CVE-2025-13151>
- **CVE-2025-15281**: <https://security-tracker.debian.org/tracker/CVE-2025-15281>
- **CVE-2026-0915**: <https://security-tracker.debian.org/tracker/CVE-2026-0915>
### Scan Results
**Grype Scan Executed**: February 1, 2026
**Scan Command**:
```bash
grype charon:latest -o json > grype-results.json
grype charon:latest -o sarif > grype-results.sarif
```
**Full Results**:
- JSON: `/projects/Charon/grype-results.json`
- SARIF: `/projects/Charon/grype-results.sarif`
- Summary: 409 total vulnerabilities (0 Critical, 7 High, 20 Medium, 2 Low, 380 Negligible)

View File

@@ -26,12 +26,14 @@ During Docker image security scanning, 7 HIGH severity CVEs were identified in t
**Actual Risk Level**: 🟢 **LOW**
**Justification**:
- CVEs affect Debian system libraries, NOT application code
- No direct exploit paths identified in Charon's usage patterns
- Application runs in isolated container environment
- User-facing services do not expose vulnerable library functionality
**Mitigating Factors**:
1. Container isolation limits exploit surface area
2. Charon does not directly invoke vulnerable libc/libtiff functions
3. Network ingress filtered through Caddy proxy
@@ -42,6 +44,7 @@ During Docker image security scanning, 7 HIGH severity CVEs were identified in t
**Strategy**: Migrate back to Alpine Linux base image
**Timeline**:
- **Week 1 (Feb 5-8)**: Verify Alpine CVE-2025-60876 is patched
- **Weeks 2-3 (Feb 11-22)**: Dockerfile migration + comprehensive testing
- **Week 4 (Feb 26-28)**: Staging deployment validation
@@ -64,6 +67,7 @@ During Docker image security scanning, 7 HIGH severity CVEs were identified in t
### Why Not Block?
Blocking the CrowdSec fix would:
- Leave user's production environment broken
- Provide ZERO security improvement (CVEs pre-exist in all branches)
- Delay critical authentication fixes unrelated to base image
@@ -72,17 +76,20 @@ Blocking the CrowdSec fix would:
## Monitoring
**Continuous Tracking**:
- Debian security advisories (daily monitoring)
- Alpine CVE status (Phase 1 gate: must be clean)
- Exploit database updates (CISA KEV, Exploit-DB)
**Alerting**:
- Notify if Debian releases patches (expedite Alpine migration)
- Alert if active exploits published (emergency Alpine migration)
## User Communication
**Transparency Commitment**:
- Document in CHANGELOG.md
- Include in release notes
- Update SECURITY.md with mitigation timeline
@@ -99,6 +106,7 @@ Blocking the CrowdSec fix would:
---
**References**:
- Alpine Migration Spec: [`docs/plans/alpine_migration_spec.md`](../plans/alpine_migration_spec.md)
- QA Report: [`docs/reports/qa_report.md`](../reports/qa_report.md)
- Vulnerability Acceptance Policy: [`docs/security/VULNERABILITY_ACCEPTANCE.md`](VULNERABILITY_ACCEPTANCE.md)

View File

@@ -33,6 +33,7 @@ The `maskAPIKey()` function implements these rules:
3. **Normal keys (≥ 16 chars)**: Shows first 4 + last 4 characters (e.g., `abcd...xyz9`)
These rules ensure that:
- Keys cannot be reconstructed from logs
- Users can still identify which key was used (by prefix/suffix)
- Debugging remains possible without exposing secrets
@@ -49,6 +50,7 @@ err := os.WriteFile(keyFile, []byte(apiKey), 0600)
```
**Required permissions**: `0600` (rw-------)
- Owner: read + write
- Group: no access
- Others: no access
@@ -96,6 +98,7 @@ strings.Repeat("a", 129) // ❌ Too long (> 128 chars)
### Log Aggregation Risks
If logs are shipped to external services (CloudWatch, Splunk, Datadog, etc.):
- Masked keys are safe to log
- Full keys would be exposed across multiple systems
- Log retention policies apply to all destinations
@@ -148,11 +151,13 @@ c.JSON(200, gin.H{
### Rotation Procedure
1. Generate new bouncer in CrowdSec:
```bash
cscli bouncers add new-bouncer-name
```
2. Update Charon configuration:
```bash
# Update environment variable
CHARON_SECURITY_CROWDSEC_API_KEY=new-key-here
@@ -165,6 +170,7 @@ c.JSON(200, gin.H{
3. Restart Charon to apply new key
4. Revoke old bouncer:
```bash
cscli bouncers delete old-bouncer-name
```
@@ -233,6 +239,7 @@ go test ./backend/internal/api/handlers -run TestSaveKeyToFile_SecurePermissions
### Test Scenarios
Tests cover:
- ✅ Empty keys → `[empty]`
- ✅ Short keys (< 16) → `[REDACTED]`
- ✅ Normal keys → `abcd...xyz9`

View File

@@ -158,6 +158,7 @@ These checks help estimate practical risk and verify assumptions. They do **not*
7. Reassess exception validity on each CI security scan cycle.
## Notes
- As of the testing on 2026-02-19, just updating nebula to `1.10.3` in the Dockerfile causes build failures due to upstream incompatibilities, which supports the attribution and reproduction evidence for the temporary exception path.
- The conflict between `smallstep/certificates` and `nebula` API changes is a known issue in the ecosystem, which adds external validity to the hypothesis about the dependency chain.
- Will need to monitor upstream releases of `smallstep/certificates` and `Caddy` for compatible versions that allow upgrading `nebula` without breaking builds.

View File

@@ -16,6 +16,7 @@ A complete debugging ecosystem has been implemented to provide maximum observabi
**File**: `tests/utils/debug-logger.ts` (291 lines)
**Features**:
- Class-based logger with methods: `step()`, `network()`, `pageState()`, `locator()`, `assertion()`, `error()`
- Automatic duration tracking for operations
- Color-coded console output for local runs (ANSI colors)
@@ -26,6 +27,7 @@ A complete debugging ecosystem has been implemented to provide maximum observabi
- Integration with Playwright test.step() system
**Key Methods**:
```typescript
step(name: string, duration?: number) // Log test steps
network(entry: NetworkLogEntry) // Log HTTP activity
@@ -38,6 +40,7 @@ printSummary() // Print colored summary to cons
```
**Output Example**:
```
├─ Navigate to home page
├─ Fill login form (234ms)
@@ -51,6 +54,7 @@ printSummary() // Print colored summary to cons
**File**: `tests/global-setup.ts` (Updated with timing logs)
**Enhancements**:
- Timing information for health checks (all operations timed)
- Port connectivity checks with timing (Caddy admin, emergency server)
- IPv4 vs IPv6 detection in URL parsing
@@ -60,6 +64,7 @@ printSummary() // Print colored summary to cons
- Error context on failures with next steps
**Sample Output**:
```
🔍 Checking Caddy admin API health at http://localhost:2019...
✅ Caddy admin API (port 2019) is healthy [45ms]
@@ -76,6 +81,7 @@ printSummary() // Print colored summary to cons
**File**: `playwright.config.js` (Updated)
**Enhancements**:
- `trace: 'on-first-retry'` - Captures traces for all retries (not just first)
- `video: 'retain-on-failure'` - Records videos only for failed tests
- `screenshot: 'only-on-failure'` - Screenshots on failure only
@@ -83,6 +89,7 @@ printSummary() // Print colored summary to cons
- Comprehensive comments explaining each option
**Configuration Added**:
```javascript
use: {
trace: process.env.CI ? 'on-first-retry' : 'on-first-retry',
@@ -96,6 +103,7 @@ use: {
**File**: `tests/reporters/debug-reporter.ts` (130 lines)
**Features**:
- Parses test step execution and identifies slow operations (>5s)
- Aggregates failures by type (timeout, assertion, network, locator)
- Generates structured summary output to stdout
@@ -104,6 +112,7 @@ use: {
- Creates visual bar charts for failure distribution
**Sample Output**:
```
╔════════════════════════════════════════════════════════════╗
║ E2E Test Execution Summary ║
@@ -130,6 +139,7 @@ network │ ░░░░░░░░░░░░░░░░░░░░ 1
**File**: `tests/fixtures/network.ts` (286 lines)
**Features**:
- Intercepts all HTTP requests and responses
- Tracks metrics per request:
- URL, method, status code, elapsed time
@@ -150,6 +160,7 @@ network │ ░░░░░░░░░░░░░░░░░░░░ 1
- Per-test request logging to debug logger
**Export Example**:
```csv
"Timestamp","Method","URL","Status","Duration (ms)","Content-Type","Body Size","Error"
"2024-01-27T10:30:45.123Z","GET","https://api.example.com/health","200","45","application/json","234",""
@@ -161,6 +172,7 @@ network │ ░░░░░░░░░░░░░░░░░░░░ 1
**File**: `tests/utils/test-steps.ts` (148 lines)
**Features**:
- `testStep()` - Wrapper around test.step() with automatic logging
- `LoggedPage` - Page wrapper that logs all interactions
- `testAssert()` - Assertion helper with logging
@@ -171,6 +183,7 @@ network │ ░░░░░░░░░░░░░░░░░░░░ 1
- Performance tracking per test
**Usage Example**:
```typescript
await testStep('Login', async () => {
await page.click('[role="button"]');
@@ -187,6 +200,7 @@ console.log(`Completed in ${result.duration}ms`);
**File**: `.github/workflows/e2e-tests.yml` (Updated)
**Environment Variables Added**:
```yaml
env:
DEBUG: 'charon:*,charon-test:*'
@@ -195,12 +209,14 @@ env:
```
**Shard Step Enhancements**:
- Per-shard start/end logging with timestamps
- Shard duration tracking
- Sequential output format for easy parsing
- Status banner for each shard completion
**Sample Shard Output**:
```
════════════════════════════════════════════════════════════
E2E Test Shard 1/4
@@ -214,6 +230,7 @@ Shard 1 Complete | Duration: 125s
```
**Job Summary Enhancements**:
- Per-shard status table with timestamps
- Test artifact locations (HTML report, videos, traces, logs)
- Debugging tips for common scenarios
@@ -254,6 +271,7 @@ Shard 1 Complete | Duration: 125s
**File**: `docs/testing/debugging-guide.md` (600+ lines)
**Sections**:
- Quick start for local testing
- VS Code debug task usage guide
- Debug logger method reference
@@ -265,6 +283,7 @@ Shard 1 Complete | Duration: 125s
- Troubleshooting tips
**Features**:
- Code examples for all utilities
- Sample output for each feature
- Commands for common debugging tasks
@@ -276,6 +295,7 @@ Shard 1 Complete | Duration: 125s
## File Inventory
### Created Files (4)
| File | Lines | Purpose |
|------|-------|---------|
| `tests/utils/debug-logger.ts` | 291 | Core debug logging utility |
@@ -287,6 +307,7 @@ Shard 1 Complete | Duration: 125s
**Total New Code**: 1,455+ lines
### Modified Files (3)
| File | Changes |
|------|---------|
| `tests/global-setup.ts` | Enhanced timing logs, error context, detailed output |
@@ -314,6 +335,7 @@ PLAYWRIGHT_BASE_URL=http://localhost:8080
### In CI (GitHub Actions)
Set automatically in workflow:
```yaml
env:
DEBUG: 'charon:*,charon-test:*'
@@ -333,6 +355,7 @@ All new tasks are in the "test" group in VS Code:
4.`Test: E2E Playwright - View Coverage Report`
Plus existing tasks:
- `Test: E2E Playwright (Chromium)`
- `Test: E2E Playwright (All Browsers)`
- `Test: E2E Playwright (Headed)`
@@ -434,6 +457,7 @@ Plus existing tasks:
### After Implementation
**Local Debugging**
- Interactive step-by-step debugging
- Full trace capture with Playwright Inspector
- Color-coded console output with timing
@@ -441,6 +465,7 @@ Plus existing tasks:
- Automatic slow operation detection
**CI Diagnostics**
- Per-shard status tracking with timing
- Failure categorization by type (timeout, assertion, network)
- Aggregated statistics across all shards
@@ -448,6 +473,7 @@ Plus existing tasks:
- Artifact collection for detailed analysis
**Performance Analysis**
- Per-operation duration tracking
- Network request metrics (status, size, timing)
- Automatic identification of slow operations (>5s)
@@ -455,6 +481,7 @@ Plus existing tasks:
- Request/response size analysis
**Network Visibility**
- All HTTP requests logged
- Status codes and response times tracked
- Request/response headers (sanitized)
@@ -462,6 +489,7 @@ Plus existing tasks:
- Error context with messages
**Data Export**
- Network logs as CSV for spreadsheet analysis
- Structured JSON for programmatic access
- Test metrics for trend analysis
@@ -487,6 +515,7 @@ Plus existing tasks:
## Next Steps for Users
1. **Try Local Debugging**:
```bash
npm run e2e -- --grep="test-name"
```
@@ -497,11 +526,13 @@ Plus existing tasks:
- Select a debug task
3. **View Test Reports**:
```bash
npx playwright show-report
```
4. **Inspect Traces**:
```bash
npx playwright show-trace test-results/[test-name]/trace.zip
```

View File

@@ -5,6 +5,7 @@ This document explains how the new comprehensive debugging infrastructure helps
## What Changed: Before vs. After
### BEFORE: Generic Failure Output
```
✗ [chromium] tests/settings/account-settings.spec.ts should validate certificate email format
Timeout 30s exceeded, waiting for expect(locator).toBeDisabled()
@@ -12,6 +13,7 @@ This document explains how the new comprehensive debugging infrastructure helps
```
**Problem**: No information about:
- What page was displayed when it failed
- What network requests were in flight
- What the actual button state was
@@ -22,6 +24,7 @@ This document explains how the new comprehensive debugging infrastructure helps
### AFTER: Rich Debug Logging Output
#### 1. **Test Step Logging** (From enhanced global-setup.ts)
```
✅ Global setup complete
@@ -37,6 +40,7 @@ This document explains how the new comprehensive debugging infrastructure helps
```
#### 2. **Network Activity Logging** (From network.ts interceptor)
```
📡 Network Log (automatic)
────────────────────────────────────────────────────────────
@@ -52,6 +56,7 @@ Timestamp │ Method │ URL │ Status │ Duration
**Key Insight**: The 422 error on email update shows the API is rejecting the input, which explains why the button didn't disable—the form never validated successfully.
#### 3. **Locator Matching Logs** (From debug-logger.ts)
```
🎯 Locator Actions:
────────────────────────────────────────────────────────────
@@ -71,6 +76,7 @@ Timestamp │ Method │ URL │ Status │ Duration
**Key Insight**: The form wasn't visible in the DOM when the test tried to click the button.
#### 4. **Assertion Logging** (From debug-logger.ts)
```
✓ Assert: "button is enabled" PASS [15ms]
└─ Expected: enabled=true
@@ -89,6 +95,7 @@ Timestamp │ Method │ URL │ Status │ Duration
**Key Insight**: The validation error exists but is hidden, so the button remains enabled. The test expected it to disable.
#### 5. **Timing Analysis** (From debug reporter)
```
📊 Test Timeline:
────────────────────────────────────────────────────────────
@@ -108,14 +115,17 @@ Timestamp │ Method │ URL │ Status │ Duration
## How to Read the Debug Output in Playwright Report
### Step 1: Open the Report
```bash
npx playwright show-report
```
### Step 2: Click Failed Test
The test details page shows:
**Console Logs Section**:
```
[debug] 03:48:12.456: Step "Navigate to account settings"
[debug] └─ URL transitioned from / to /account
@@ -141,14 +151,18 @@ The test details page shows:
```
### Step 3: Check the Trace
Click "Trace" tab:
- **Timeline**: See each action with exact timing
- **Network**: View all HTTP requests and responses
- **DOM Snapshots**: Inspect page state at each step
- **Console**: See browser console messages
### Step 4: Watch the Video
The video shows:
- What the user would have seen
- Where the UI hung or stalled
- If spinners/loading states appeared
@@ -157,9 +171,11 @@ The video shows:
## Failure Category Examples
### Category 1: Timeout Failures
**Indicator**: `Timeout 30s exceeded, waiting for...`
**Debug Output**:
```
⏱️ Operation Timeline:
[03:48:14.000] ← Start waiting for locator
@@ -173,6 +189,7 @@ The video shows:
**Diagnosis**: The network was slow (2.4s for a 50KB response). Test didn't wait long enough.
**Fix**:
```javascript
await page.waitForLoadState('networkidle'); // Wait for network before assertion
await expect(locator).toBeVisible({timeout: 10000}); // Increase timeout
@@ -181,9 +198,11 @@ await expect(locator).toBeVisible({timeout: 10000}); // Increase timeout
---
### Category 2: Assertion Failures
**Indicator**: `expect(locator).toBeDisabled() failed`
**Debug Output**:
```
✋ Assertion failed: toBeDisabled()
Expected: disabled=true
@@ -213,6 +232,7 @@ await expect(locator).toBeVisible({timeout: 10000}); // Increase timeout
**Diagnosis**: The component's disable logic isn't working correctly.
**Fix**:
```jsx
// In React component:
const isFormValid = !hasValidationErrors;
@@ -227,9 +247,11 @@ const isFormValid = !hasValidationErrors;
---
### Category 3: Locator Failures
**Indicator**: `getByRole('button', {name: /save/i}): multiple elements found`
**Debug Output**:
```
🚨 Strict Mode Violation: Multiple elements matched
Selector: getByRole('button', {name: /save/i})
@@ -255,6 +277,7 @@ const isFormValid = !hasValidationErrors;
**Diagnosis**: Locator is too broad and matches multiple elements.
**Fix**:
```javascript
// ✅ Good: Scoped to dialog
await page.getByRole('dialog').getByRole('button', {name: /save certificate/i}).click();
@@ -269,9 +292,11 @@ await page.getByRole('button', {name: /save/i}).click();
---
### Category 4: Network/API Failures
**Indicator**: `API returned 422` or `POST /api/endpoint failed with 500`
**Debug Output**:
```
❌ Network Error
Request: POST /api/account/email
@@ -307,6 +332,7 @@ await page.getByRole('button', {name: /save/i}).click();
**Diagnosis**: The API is working correctly, but the frontend error handling isn't working.
**Fix**:
```javascript
// In frontend error handler:
try {
@@ -326,6 +352,7 @@ try {
## Real-World Example: The Certificate Email Test
**Test Code** (simplified):
```javascript
test('should validate certificate email format', async ({page}) => {
await page.goto('/account');
@@ -344,6 +371,7 @@ test('should validate certificate email format', async ({page}) => {
```
**Debug Output Sequence**:
```
1⃣ Navigate to /account
✅ Page loaded [1234ms]
@@ -399,6 +427,7 @@ test('should validate certificate email format', async ({page}) => {
```
**How to Fix**:
1. Check the `Account.tsx` form submission error handler
2. Ensure API errors update form state: `setFormErrors(response.errors)`
3. Ensure button disable logic: `disabled={Object.keys(formErrors).length > 0}`
@@ -433,6 +462,7 @@ other │ ██░░░░░░░░░░░░░░░░░░ 2/
```
**What this tells you**:
- **36% Timeout**: Network is slow or test expectations unrealistic
- **27% Assertion**: Component behavior wrong (disable logic, form state, etc.)
- **18% Locator**: Selector strategy needs improvement

View File

@@ -5,6 +5,7 @@ This guide explains how to use the comprehensive debugging infrastructure to dia
## Quick Access Tools
### 1. **Playwright HTML Report** (Visual Analysis)
```bash
# When tests complete, open the report
npx playwright show-report
@@ -14,6 +15,7 @@ npx playwright show-report --port 9323
```
**What to look for:**
- Click on each failed test
- View the trace timeline (shows each action, network request, assertion)
- Check the video recording to see exactly what went wrong
@@ -21,30 +23,35 @@ npx playwright show-report --port 9323
- Check browser console logs
### 2. **Debug Logger CSV Export** (Network Analysis)
```bash
# After tests complete, check for network logs in test-results
find test-results -name "*.csv" -type f
```
**What to look for:**
- HTTP requests that failed or timed out
- Slow network operations (>1000ms)
- Authentication failures (401/403)
- API response errors
### 3. **Trace Files** (Step-by-Step Replay)
```bash
# View detailed trace for a failed test
npx playwright show-trace test-results/[test-name]/trace.zip
```
**Features:**
- Pause and step through each action
- Inspect DOM at any point
- Review network timing
- Check locator matching
### 4. **Video Recordings** (Visual Feedback Loop)
- Located in: `test-results/.playwright-artifacts-1/`
- Map filenames to test names in Playwright report
- Watch to understand timing and UI state when failure occurred
@@ -54,24 +61,28 @@ npx playwright show-trace test-results/[test-name]/trace.zip
Based on the summary showing "other" category failures, these issues likely fall into:
### Category A: Timing/Flakiness Issues
- Tests intermittently fail due to timeouts
- Elements not appearing in expected timeframe
- **Diagnosis**: Check videos for loading spinners, network delays
- **Fix**: Increase timeout or add wait for specific condition
### Category B: Locator Issues
- Selectors matching wrong elements or multiple elements
- Elements appearing in different UI states
- **Diagnosis**: Check traces to see selector matching logic
- **Fix**: Make selectors more specific or use role-based locators
### Category C: State/Data Issues
- Form data not persisting
- Navigation not working correctly
- **Diagnosis**: Check network logs for API failures
- **Fix**: Add wait for API completion, verify mock data
### Category D: Accessibility/Keyboard Navigation
- Keyboard events not triggering actions
- Focus not moving as expected
- **Diagnosis**: Review traces for keyboard action handling
@@ -79,7 +90,7 @@ Based on the summary showing "other" category failures, these issues likely fall
## Step-by-Step Failure Analysis Process
### For Each Failed Test:
### For Each Failed Test
1. **Get Test Name**
- Open Playwright report
@@ -87,9 +98,11 @@ Based on the summary showing "other" category failures, these issues likely fall
- Note the test file + test name
2. **View the Trace**
```bash
npx playwright show-trace test-results/[test-name-hash]/trace.zip
```
- Go through each step
- Note which step failed and why
- Check the actual error message
@@ -129,60 +142,75 @@ Our debug logger outputs structured messages like:
## Common Failure Patterns & Solutions
### Pattern 1: "Timeout waiting for locator"
**Cause**: Element not appearing within timeout
**Diagnosis**:
- Check video - is the page still loading?
- Check network tab - any pending requests?
- Check DOM snapshot - does element exist but hidden?
**Solution**:
- Add `await page.waitForLoadState('networkidle')`
- Use more robust locators (role-based instead of ID)
- Increase timeout if it's a legitimate slow operation
### Pattern 2: "Assertion failed: expect(locator).toBeDisabled()"
**Cause**: Button not in expected state
**Diagnosis**:
- Check trace - what's the button's actual state?
- Check console - any JS errors?
- Check network - is a form submission in progress?
**Solution**:
- Add explicit wait: `await expect(button).toBeDisabled({timeout: 10000})`
- Wait for preceding action: `await page.getByRole('button').click(); await page.waitForLoadState()`
- Check form library state
### Pattern 3: "Strict mode violation: multiple elements found"
**Cause**: Selector matches 2+ elements
**Diagnosis**:
- Check trace DOM snapshots - count matching elements
- Check test file - is selector too broad?
**Solution**:
- Scope to container: `page.getByRole('dialog').getByRole('button', {name: 'Save'})`
- Use .first() or .nth(0): `getByRole('button').first()`
- Make selector more specific
### Pattern 4: "Element not found by getByRole(...)"
**Cause**: Accessibility attributes missing
**Diagnosis**:
- Check DOM in trace - what tags/attributes exist?
- Is it missing role attribute?
- Is aria-label/aria-labelledby correct?
**Solution**:
- Add role attribute to element
- Add accessible name (aria-label, aria-labelledby, or text content)
- Use more forgiving selectors temporarily to confirm
### Pattern 5: "Test timed out after 30000ms"
**Cause**: Test execution exceeded timeout
**Diagnosis**:
- Check videos - where did it hang?
- Check traces - last action before timeout?
- Check network - any concurrent long-running requests?
**Solution**:
- Break test into smaller steps
- Add explicit waits between actions
- Check for infinite loops or blocking operations
@@ -208,6 +236,7 @@ other │ ██░░░░░░░░░░░░░░░░░░ 2/
```
**Key insights:**
- **Timeout**: Look for network delays or missing waits
- **Assertion**: Check state management and form validation
- **Locator**: Focus on selector robustness
@@ -216,6 +245,7 @@ other │ ██░░░░░░░░░░░░░░░░░░ 2/
## Advanced Debugging Techniques
### 1. Run Single Failed Test Locally
```bash
# Get exact test name from report, then:
npx playwright test --grep "should show user status badges"
@@ -225,6 +255,7 @@ DEBUG=charon:* npx playwright test --grep "should show user status badges" --deb
```
### 2. Inspect Network Logs CSV
```bash
# Convert CSV to readable format
column -t -s',' tests/network-logs.csv | less
@@ -233,16 +264,19 @@ column -t -s',' tests/network-logs.csv | less
```
### 3. Compare Videos Side-by-Side
- Download videos from test-results/.playwright-artifacts-1/
- Open in VLC with playlist
- Play at 2x speed to spot behavior differences
### 4. Check Browser Console
- In trace player, click "Console" tab
- Look for JS errors or warnings
- Check for 404/500 API responses in network tab
### 5. Reproduce Locally with Same Conditions
```bash
# Use the exact same seed (if randomization is involved)
SEED=12345 npx playwright test --grep "failing-test"
@@ -256,6 +290,7 @@ npx playwright test --grep "failing-test" --project=chromium --debug
If tests pass locally but fail in CI Docker container:
### Check Container Logs
```bash
# View Docker container output
docker compose -f .docker/compose/docker-compose.test.yml logs charon
@@ -265,12 +300,14 @@ docker compose logs --tail=50
```
### Compare Environments
- Docker: Running on 0.0.0.0:8080
- Local: Running on localhost:8080/http://127.0.0.1:8080
- Local: Running on localhost:8080/<http://127.0.0.1:8080>
- **Check**: Are there IPv4/IPv6 differences?
- **Check**: Are there DNS resolution issues?
### Port Accessibility
```bash
# From inside Docker, check if ports are accessible
docker exec charon curl -v http://localhost:8080
@@ -281,6 +318,7 @@ docker exec charon curl -v http://localhost:2020
## Escalation Path
### When to Investigate Code
- Same tests fail consistently (not flaky)
- Error message points to specific feature
- Video shows incorrect behavior
@@ -289,12 +327,14 @@ docker exec charon curl -v http://localhost:2020
**Action**: Fix the code/feature being tested
### When to Improve Test
- Tests flaky (fail 1 in 5 times)
- Timeout errors on slow operations
- Intermittent locator matching issues
- **Action**: Add waits, use more robust selectors, increase timeouts
### When to Update Test Infrastructure
- Port/networking issues
- Authentication failures
- Global setup incomplete

View File

@@ -3,6 +3,7 @@
> **Recent Updates**: See [Sprint 1 Improvements](sprint1-improvements.md) for information about recent E2E test reliability and performance enhancements (February 2026).
### Getting Started with E2E Tests
- **Running Tests**: `npm run e2e`
- **All Browsers**: `npm run e2e:all`
- **Headed UI on headless Linux**: `npm run e2e:ui:headless-server` — see `docs/development/running-e2e.md` for details

View File

@@ -53,6 +53,7 @@ This document provides step-by-step procedures for manually verifying the Bug #1
```
**Expected Output**:
```
time="..." level=warning msg="Environment variable CHARON_SECURITY_CROWDSEC_API_KEY is set but invalid. Either remove it from docker-compose.yml or update it to match the auto-generated key. A new valid key will be generated and saved." masked_key=fake...345
```
@@ -82,11 +83,13 @@ This document provides step-by-step procedures for manually verifying the Bug #1
```
**Expected Output**:
```
time="..." level=info msg="CrowdSec bouncer authentication successful" masked_key="abcd...wxyz" source=file
```
**Success Criteria**:
- ✅ Warning logged about invalid env var
- ✅ New key auto-generated and saved to `/app/data/crowdsec/bouncer_key`
- ✅ Bouncer authenticates successfully with new key
@@ -119,6 +122,7 @@ This document provides step-by-step procedures for manually verifying the Bug #1
```
**Expected Output**:
```
time="..." level=info msg="LAPI not ready, retrying with backoff" attempt=1 error="connection refused" next_attempt_ms=500
time="..." level=info msg="LAPI not ready, retrying with backoff" attempt=2 error="connection refused" next_attempt_ms=750
@@ -128,6 +132,7 @@ This document provides step-by-step procedures for manually verifying the Bug #1
4. **Wait for LAPI to Start** (up to 30 seconds)
Look for success message:
```
time="..." level=info msg="CrowdSec bouncer authentication successful" masked_key="abcd...wxyz" source=file
```
@@ -142,6 +147,7 @@ This document provides step-by-step procedures for manually verifying the Bug #1
**Expected**: HTTP 200 OK
**Success Criteria**:
- ✅ Logs show retry attempts with exponential backoff (500ms → 750ms → 1125ms → ...)
- ✅ Connection succeeds after LAPI starts (within 30s max)
- ✅ No immediate failure on first connection refused error
@@ -157,6 +163,7 @@ This document provides step-by-step procedures for manually verifying the Bug #1
1. **Reproduce Pre-Fix Behavior** (for comparison - requires reverting to old code)
With old code, setting invalid env var would cause:
```
time="..." level=error msg="LAPI authentication failed" error="access forbidden (403)" key="[REDACTED]"
```
@@ -164,12 +171,14 @@ This document provides step-by-step procedures for manually verifying the Bug #1
2. **Apply Fix and Repeat Scenario 1**
With new code, same invalid env var should produce:
```
time="..." level=warning msg="Environment variable CHARON_SECURITY_CROWDSEC_API_KEY is set but invalid..."
time="..." level=info msg="CrowdSec bouncer authentication successful" masked_key="abcd...wxyz" source=file
```
**Success Criteria**:
- ✅ No "access forbidden" errors after auto-recovery
- ✅ Bouncer connects successfully with auto-generated key
@@ -190,6 +199,7 @@ docker restart charon
```
**Expected Log**:
```
time="..." level=info msg="CrowdSec bouncer authentication successful" masked_key="vali...test" source=environment_variable
```
@@ -203,6 +213,7 @@ docker restart charon
```
**Expected Log**:
```
time="..." level=info msg="CrowdSec bouncer authentication successful" masked_key="abcd...wxyz" source=file
```
@@ -216,12 +227,14 @@ docker restart charon
```
**Expected Log**:
```
time="..." level=info msg="Registering new CrowdSec bouncer: caddy-bouncer"
time="..." level=info msg="CrowdSec bouncer registration successful" masked_key="new-...123" source=auto_generated
```
**Success Criteria**:
- ✅ Logs clearly show `source=environment_variable`, `source=file`, or `source=auto_generated`
- ✅ User can determine which key is active without reading code
@@ -240,6 +253,7 @@ time="..." level=info msg="CrowdSec bouncer registration successful" masked_key=
**Cause**: CrowdSec process failed to start or crashed
**Debug Steps**:
1. Check LAPI process: `docker exec charon ps aux | grep crowdsec`
2. Check LAPI logs: `docker exec charon cat /var/log/crowdsec/crowdsec.log`
3. Verify config: `docker exec charon cat /etc/crowdsec/config.yaml`
@@ -249,6 +263,7 @@ time="..." level=info msg="CrowdSec bouncer registration successful" masked_key=
**Cause**: Key not properly registered with LAPI
**Resolution**:
```bash
# List registered bouncers
docker exec charon cscli bouncers list

View File

@@ -60,6 +60,7 @@ logger.step('Click login button', 245); // with duration in ms
```
**Output:**
```
├─ Navigate to home page
├─ Click login button (245ms)
@@ -81,6 +82,7 @@ logger.network({
```
**Output:**
```
✅ POST https://api.example.com/login [200] 342ms
```
@@ -94,6 +96,7 @@ logger.locator('[role="button"]', 'click', true, 45);
```
**Output:**
```
✓ click "[role="button"]" 45ms
```
@@ -108,6 +111,7 @@ logger.assertion('URL is correct', false, 'http://old.com', 'http://new.com');
```
**Output:**
```
✓ Assert: Button is visible
✗ Assert: URL is correct | expected: "http://new.com", actual: "http://old.com"
@@ -122,6 +126,7 @@ logger.error('Network request failed', new Error('TIMEOUT'), 1);
```
**Output:**
```
❌ ERROR: Network request failed - TIMEOUT
🔄 Recovery: 1 attempts remaining
@@ -134,6 +139,7 @@ Traces capture all interactions, network activity, and DOM snapshots. They're in
### Automatic Trace Capture
Traces are automatically captured:
- On first retry of failed tests
- On failure when running locally (if configured)
@@ -166,6 +172,7 @@ npx playwright show-trace test-results/path/to/trace.zip
```
The Trace Viewer shows:
- **Timeline**: Chronological list of all actions
- **Network**: HTTP requests/responses with full details
- **Console**: Page JS console output
@@ -490,18 +497,21 @@ test('should toggle security features', async ({ page }) => {
```
**Key Features**:
- Automatically finds parent `<label>` element
- Scrolls element into view (sticky header aware)
- Cross-browser compatible (Chromium, Firefox, WebKit)
- No `force: true` or hard-coded waits needed
**When to Use**:
- Any test that clicks Switch/Toggle components
- Settings pages with enable/disable toggles
- Security dashboard module toggles (CrowdSec, ACL, WAF, Rate Limiting)
- Access lists and configuration toggles
**References**:
- [Implementation](../../tests/utils/ui-helpers.ts) - Full helper code
- [QA Report](../reports/qa_report.md) - Test results and validation

View File

@@ -19,6 +19,7 @@
### ❌ AVOID: Polling in beforeEach Hooks
**Anti-Pattern**:
```typescript
test.beforeEach(async ({ page, adminUser }) => {
await loginUser(page, adminUser);
@@ -37,6 +38,7 @@ test.beforeEach(async ({ page, adminUser }) => {
```
**Why This Is Bad**:
- Polls `/api/v1/feature-flags` endpoint **31 times** per test file (once per test)
- With 12 parallel processes (4 shards × 3 browsers), causes API server bottleneck
- Adds 310s minimum execution time per shard (31 tests × 10s timeout)
@@ -49,6 +51,7 @@ test.beforeEach(async ({ page, adminUser }) => {
### ✅ PREFER: Per-Test Verification Only When Toggled
**Correct Pattern**:
```typescript
test('should toggle Cerberus feature', async ({ page }) => {
await test.step('Navigate to system settings', async () => {
@@ -74,12 +77,14 @@ test('should toggle Cerberus feature', async ({ page }) => {
```
**Why This Is Better**:
- API calls reduced by **90%** (from 31 per shard to 3-5 per shard)
- Only tests that actually toggle flags incur the polling cost
- Faster test execution (shards complete in <15 minutes vs >30 minutes)
- Clearer test intent—verification is tied to the action that requires it
**Rule of Thumb**:
- **No toggle, no propagation check**: If a test reads flag state without changing it, don't poll.
- **Toggle = verify**: Always verify propagation after toggling to ensure state change persisted.
@@ -90,6 +95,7 @@ test('should toggle Cerberus feature', async ({ page }) => {
### ❌ AVOID: Label-Only Locators
**Anti-Pattern**:
```typescript
await test.step('Verify Script path/command field appears', async () => {
// ⚠️ PROBLEM: Fails in Firefox/WebKit
@@ -99,6 +105,7 @@ await test.step('Verify Script path/command field appears', async () => {
```
**Why This Fails**:
- Label locators depend on browser-specific DOM rendering
- Firefox/WebKit may render Label components differently than Chromium
- Regex patterns may not match if label has extra whitespace or is split across nodes
@@ -109,6 +116,7 @@ await test.step('Verify Script path/command field appears', async () => {
### ✅ PREFER: Multi-Strategy Locators with Fallbacks
**Correct Pattern**:
```typescript
import { getFormFieldByLabel } from './utils/ui-helpers';
@@ -127,6 +135,7 @@ await test.step('Verify Script path/command field appears', async () => {
```
**Helper Implementation** (`tests/utils/ui-helpers.ts`):
```typescript
/**
* Get form field with cross-browser label matching
@@ -169,12 +178,14 @@ export function getFormFieldByLabel(
```
**Why This Is Better**:
- **95%+ pass rate** on Firefox/WebKit (up from 70%)
- Gracefully degrades through fallback strategies
- No browser-specific workarounds needed in test code
- Single helper enforces consistent pattern across all tests
**When to Use**:
- Any test that interacts with form fields
- Tests that must pass on all three browsers (Chromium, Firefox, WebKit)
- Accessibility-critical tests (label locators are user-facing)
@@ -186,6 +197,7 @@ export function getFormFieldByLabel(
### ❌ AVOID: Duplicate API Requests
**Anti-Pattern**:
```typescript
// Multiple tests in parallel all polling the same endpoint
test('test 1', async ({ page }) => {
@@ -198,6 +210,7 @@ test('test 2', async ({ page }) => {
```
**Why This Is Bad**:
- 12 parallel workers all hit `/api/v1/feature-flags` simultaneously
- No request coalescing or caching
- API server degrades under concurrent load
@@ -208,6 +221,7 @@ test('test 2', async ({ page }) => {
### ✅ PREFER: Request Coalescing with Worker Isolation
**Correct Pattern** (`tests/utils/wait-helpers.ts`):
```typescript
// Cache in-flight requests per worker
const inflightRequests = new Map<string, Promise<Record<string, boolean>>>();
@@ -249,12 +263,14 @@ export async function waitForFeatureFlagPropagation(
```
**Why This Is Better**:
- **30-40% reduction** in duplicate API calls
- Multiple tests requesting same state share one API call
- Worker isolation prevents cache collisions between parallel processes
- Sorted keys ensure semantic equivalence (`{a:true, b:false}` === `{b:false, a:true}`)
**Cache Behavior**:
- **Hit**: Another test in same worker already polling for same state
- **Miss**: First test in worker to request this state OR different state requested
- **Clear**: Cache cleared after all tests in worker complete (`test.afterAll()`)
@@ -266,6 +282,7 @@ export async function waitForFeatureFlagPropagation(
### ❌ PROBLEM: Shards Exceeding Timeout
**Symptom**:
```bash
# GitHub Actions logs
Error: The operation was canceled.
@@ -273,6 +290,7 @@ Job duration: 31m 45s (exceeds 30m limit)
```
**Root Causes**:
1. Feature flag polling in beforeEach (31 tests × 10s = 310s minimum)
2. API bottleneck under parallel load
3. Slow browser startup in CI environment
@@ -283,6 +301,7 @@ Job duration: 31m 45s (exceeds 30m limit)
### ✅ SOLUTION: Enforce 15-Minute Budget Per Shard
**CI Configuration** (`.github/workflows/e2e-tests.yml`):
```yaml
- name: Verify shard performance budget
if: always()
@@ -300,23 +319,30 @@ Job duration: 31m 45s (exceeds 30m limit)
```
**Why This Is Better**:
- **Early detection** of performance regressions in CI
- Forces developers to optimize slow tests before merge
- Prevents accumulation of "death by a thousand cuts" slowdowns
- Clear failure message directs investigation to bottleneck
**How to Debug Timeouts**:
1. **Check metrics**: Review API call counts in test output
```bash
grep "CACHE HIT\|CACHE MISS" test-output.log
```
2. **Profile locally**: Instrument slow helpers
```typescript
const startTime = Date.now();
await waitForLoadingComplete(page);
console.log(`Loading took ${Date.now() - startTime}ms`);
```
3. **Isolate shard**: Run failing shard locally to reproduce
```bash
npx playwright test --shard=2/4 --project=firefox
```
@@ -328,6 +354,7 @@ Job duration: 31m 45s (exceeds 30m limit)
### ❌ AVOID: State Leakage Between Tests
**Anti-Pattern**:
```typescript
test('enable Cerberus', async ({ page }) => {
await toggleCerberus(page, true);
@@ -342,6 +369,7 @@ test('ACL settings require Cerberus', async ({ page }) => {
```
**Why This Is Bad**:
- Tests depend on execution order (serial execution works, parallel fails)
- Flakiness when running with `--workers=4` or `--repeat-each=5`
- Hard to debug failures (root cause is in different test file)
@@ -351,6 +379,7 @@ test('ACL settings require Cerberus', async ({ page }) => {
### ✅ PREFER: Explicit State Restoration
**Correct Pattern**:
```typescript
test.afterEach(async ({ page }) => {
await test.step('Restore default feature flag state', async () => {
@@ -375,12 +404,14 @@ test.afterEach(async ({ page }) => {
```
**Why This Is Better**:
- **Zero inter-test dependencies**: Tests can run in any order
- Passes randomization testing: `--repeat-each=5 --workers=4`
- Explicit cleanup makes state management visible in code
- Fast restoration (no polling required, direct API call)
**Validation Command**:
```bash
# Verify test isolation with randomization
npx playwright test tests/settings/system-settings.spec.ts \
@@ -398,6 +429,7 @@ npx playwright test tests/settings/system-settings.spec.ts \
### ❌ AVOID: Boolean Logic on Transient States
**Anti-Pattern**:
```typescript
const hasEmptyMessage = await emptyCellMessage.isVisible().catch(() => false);
const hasTable = await table.isVisible().catch(() => false);
@@ -405,6 +437,7 @@ expect(hasEmptyMessage || hasTable).toBeTruthy();
```
**Why This Is Bad**:
- Fails during the split second where neither element is fully visible (loading transitions).
- Playwright's auto-retrying logic is bypassed by the `catch()` block.
- Leads to flaky "false negatives" where both checks return false before content loads.
@@ -412,6 +445,7 @@ expect(hasEmptyMessage || hasTable).toBeTruthy();
### ✅ PREFER: Locator Composition with `.or()`
**Correct Pattern**:
```typescript
await expect(
page.getByRole('table').or(page.getByText(/no.*certificates.*found/i))
@@ -419,6 +453,7 @@ await expect(
```
**Why This Is Better**:
- Leverages Playwright's built-in **auto-retry** mechanism.
- Waits for *either* condition to become true.
- Handles loading spinners and layout shifts gracefully.
@@ -431,6 +466,7 @@ await expect(
### ❌ AVOID: Fixed Timeouts or Custom Loops
**Anti-Pattern**:
```typescript
// Flaky custom retry loop
for (let i = 0; i < 3; i++) {
@@ -446,6 +482,7 @@ for (let i = 0; i < 3; i++) {
### ✅ PREFER: `.toPass()` for Verification Loops
**Correct Pattern**:
```typescript
await expect(async () => {
const response = await request.post('/endpoint');
@@ -457,6 +494,7 @@ await expect(async () => {
```
**Why This Is Better**:
- Built-in assertion retry logic.
- Configurable backoff intervals.
- Cleaner syntax for verifying eventual success (e.g. valid API response after background processing).

View File

@@ -11,6 +11,7 @@ Successfully triaged and fixed Playwright E2E tests for the DNS Provider feature
## Test Results
### Before Fixes
| Status | Count |
|--------|-------|
| ❌ Failed | 7 |
@@ -18,6 +19,7 @@ Successfully triaged and fixed Playwright E2E tests for the DNS Provider feature
| ⏭️ Skipped | 3 |
### After Fixes
| Status | Count |
|--------|-------|
| ❌ Failed | 0 |
@@ -27,12 +29,15 @@ Successfully triaged and fixed Playwright E2E tests for the DNS Provider feature
## Test Files Summary
### 1. `tests/auth.setup.ts`
| Test | Status |
|------|--------|
| authenticate | ✅ Pass |
### 2. `tests/dns-provider-types.spec.ts`
**API Tests:**
| Test | Status |
|------|--------|
| GET /dns-providers/types returns all built-in and custom providers | ✅ Pass |
@@ -43,6 +48,7 @@ Successfully triaged and fixed Playwright E2E tests for the DNS Provider feature
| Script provider type has command/path field | ✅ Pass |
**UI Tests:**
| Test | Status |
|------|--------|
| Provider selector shows all provider types in dropdown | ✅ Pass |
@@ -54,7 +60,9 @@ Successfully triaged and fixed Playwright E2E tests for the DNS Provider feature
| Script type selection shows script path field | ✅ Pass |
### 3. `tests/dns-provider-crud.spec.ts`
**Create Provider:**
| Test | Status |
|------|--------|
| Create Manual DNS provider | ✅ Pass |
@@ -63,6 +71,7 @@ Successfully triaged and fixed Playwright E2E tests for the DNS Provider feature
| Validate webhook URL format | ✅ Pass |
**Provider List:**
| Test | Status |
|------|--------|
| Display provider list or empty state | ✅ Pass |
@@ -70,17 +79,20 @@ Successfully triaged and fixed Playwright E2E tests for the DNS Provider feature
| Show provider details in list | ✅ Pass |
**Edit Provider:**
| Test | Status |
|------|--------|
| Open edit dialog for existing provider | ⏭️ Skipped (conditional) |
| Update provider name | ⏭️ Skipped (conditional) |
**Delete Provider:**
| Test | Status |
|------|--------|
| Show delete confirmation dialog | ⏭️ Skipped (conditional) |
**API Operations:**
| Test | Status |
|------|--------|
| List providers via API | ✅ Pass |
@@ -89,6 +101,7 @@ Successfully triaged and fixed Playwright E2E tests for the DNS Provider feature
| Get single provider via API | ✅ Pass |
**Form Accessibility:**
| Test | Status |
|------|--------|
| Form has accessible labels | ✅ Pass |
@@ -96,7 +109,9 @@ Successfully triaged and fixed Playwright E2E tests for the DNS Provider feature
| Errors announced to screen readers | ✅ Pass |
### 4. `tests/manual-dns-provider.spec.ts`
**Provider Selection Flow:**
| Test | Status |
|------|--------|
| Navigate to DNS Providers page | ✅ Pass |
@@ -104,6 +119,7 @@ Successfully triaged and fixed Playwright E2E tests for the DNS Provider feature
| Display Manual option in provider selection | ✅ Pass (Fixed) |
**Manual Challenge UI Display:**
| Test | Status |
|------|--------|
| Display challenge panel with required elements | ✅ Pass |
@@ -112,12 +128,14 @@ Successfully triaged and fixed Playwright E2E tests for the DNS Provider feature
| Display status indicator | ✅ Pass (Fixed) |
**Copy to Clipboard:**
| Test | Status |
|------|--------|
| Have accessible copy buttons | ✅ Pass |
| Show copied feedback on click | ✅ Pass |
**Verify Button Interactions:**
| Test | Status |
|------|--------|
| Have Check DNS Now button | ✅ Pass |
@@ -125,6 +143,7 @@ Successfully triaged and fixed Playwright E2E tests for the DNS Provider feature
| Have Verify button with description | ✅ Pass |
**Accessibility Checks:**
| Test | Status |
|------|--------|
| Keyboard accessible interactive elements | ✅ Pass |
@@ -134,6 +153,7 @@ Successfully triaged and fixed Playwright E2E tests for the DNS Provider feature
| Validate accessibility tree structure | ✅ Pass (Fixed) |
**Component Tests:**
| Test | Status |
|------|--------|
| Render all required challenge information | ✅ Pass |
@@ -141,6 +161,7 @@ Successfully triaged and fixed Playwright E2E tests for the DNS Provider feature
| Handle verified challenge state | ✅ Pass |
**Error Handling:**
| Test | Status |
|------|--------|
| Display error message on verification failure | ✅ Pass |
@@ -149,6 +170,7 @@ Successfully triaged and fixed Playwright E2E tests for the DNS Provider feature
## Issues Fixed
### 1. URL Path Mismatch
**Issue**: `manual-dns-provider.spec.ts` used `/dns-providers` URL while the frontend uses `/dns/providers`.
**Fix**: Updated all occurrences to use `/dns/providers`.
@@ -156,11 +178,13 @@ Successfully triaged and fixed Playwright E2E tests for the DNS Provider feature
**Files Changed**: `tests/manual-dns-provider.spec.ts`
### 2. Button Selector Too Strict
**Issue**: Tests used `getByRole('button', { name: /add provider/i })` without `.first()` which failed when multiple buttons matched.
**Fix**: Added `.first()` to handle both header button and empty state button.
### 3. Dropdown Search Filter Test
**Issue**: Test tried to fill text into a combobox that doesn't support text input.
**Fix**: Changed test to verify keyboard navigation works instead.
@@ -168,6 +192,7 @@ Successfully triaged and fixed Playwright E2E tests for the DNS Provider feature
**File**: `tests/dns-provider-types.spec.ts`
### 4. Dynamic Field Locators
**Issue**: Tests used `getByLabel(/url/i)` but credential fields are rendered dynamically without proper labels.
**Fix**: Changed to locate fields by label text followed by input structure.
@@ -175,6 +200,7 @@ Successfully triaged and fixed Playwright E2E tests for the DNS Provider feature
**Files Changed**: `tests/dns-provider-types.spec.ts`
### 5. Conditional Status Icon Test
**Issue**: Test expected SVG icon in status indicator but icon may not always be present.
**Fix**: Made icon check conditional.
@@ -194,6 +220,7 @@ This is expected behavior — these tests only run when provider cards with edit
## Test Fixtures Created
Created `tests/fixtures/dns-providers.ts` with:
- Mock provider types (built-in and custom)
- Mock provider data for different types
- Mock API responses

View File

@@ -42,6 +42,7 @@ await scriptPath.fill('/path/to/script.sh');
```
**Error (Firefox/WebKit)**:
```
TimeoutError: locator.fill: Timeout 5000ms exceeded.
=========================== logs ===========================
@@ -78,12 +79,14 @@ await scriptPath.fill('/path/to/script.sh');
### When to Use `getFormFieldByLabel()`
**Use when**:
- Form fields have complex label structures (nested elements, icons, tooltips)
- Tests fail in Firefox/WebKit but pass in Chromium
- Label text is dynamic or internationalized
- Multiple fields have similar labels
**Don't use when**:
- Standard `getByLabel()` works reliably across all browsers
- Field has a unique `data-testid` or `name` attribute
- Field is the only one of its type on the page
@@ -184,11 +187,13 @@ if (alreadyMatches(currentState, expectedFlags)) {
```
**Cache Key Format**:
```
[worker_index]:[sorted_flags_json]
```
**Example**:
```
Worker 0: "0:{\"feature.cerberus.enabled\":false,\"feature.crowdsec.enabled\":false}"
Worker 1: "1:{\"feature.cerberus.enabled\":false,\"feature.crowdsec.enabled\":false}"
@@ -201,11 +206,13 @@ Worker 1: "1:{\"feature.cerberus.enabled\":false,\"feature.crowdsec.enabled\":fa
### When to Use `waitForFeatureFlagPropagation()`
**Use when**:
- A test **toggles** a feature flag via the UI
- Backend state changes and you need to verify propagation
- Test depends on a specific flag state being active
**Don't use when**:
- Setting up initial state in `beforeEach` (use API directly instead)
- Flags haven't changed since last verification
- Test doesn't modify flags
@@ -239,6 +246,7 @@ test.describe('System Settings', () => {
```
**Why This Works**:
- Each test starts from known defaults (restored by previous test's `afterEach`)
- No unnecessary polling in `beforeEach`
- Cleanup happens once, not N times per describe block
@@ -261,6 +269,7 @@ export async function waitForFeatureFlagPropagation(...) {
```
**You don't need to manually wait for the overlay** — it's handled by:
- `clickSwitch()`
- `clickAndWaitForResponse()`
- `waitForFeatureFlagPropagation()`
@@ -272,6 +281,7 @@ export async function waitForFeatureFlagPropagation(...) {
### Why Isolation Matters
Tests running in parallel can interfere with each other if they:
- Share mutable state (database, config files, feature flags)
- Don't clean up resources
- Rely on global defaults
@@ -423,11 +433,13 @@ await field.fill('value');
**Symptom**: `Feature flag propagation timeout after 120 attempts (60000ms)`
**Causes**:
1. Backend not updating flags
2. Config reload overlay blocking UI
3. Database transaction not committed
**Fix Steps**:
1. Check backend logs: Does PUT `/api/v1/feature-flags` succeed?
2. Check overlay state: Is `[data-testid="config-reload-overlay"]` stuck visible?
3. Increase timeout temporarily: `waitForFeatureFlagPropagation(page, flags, { timeout: 120000 })`
@@ -499,6 +511,7 @@ test.afterEach(async ({ request }) => {
---
**See Also**:
- [Testing README](./README.md) — Quick reference and debugging guide
- [Switch Component Testing](./README.md#-switchtoggle-component-testing) — Detailed switch patterns
- [Debugging Guide](./debugging-guide.md) — Troubleshooting slow/flaky tests

View File

@@ -11,11 +11,13 @@ During Sprint 1, we resolved critical issues affecting E2E test reliability and
**What was happening**: Some tests would hang indefinitely or timeout after 30 seconds, especially in CI/CD pipelines.
**Root cause**:
- Config reload overlay was blocking test interactions
- Feature flag propagation was too slow during high load
- API polling happened unnecessarily for every test
**What we did**:
1. Added smart detection to wait for config reloads to complete
2. Increased timeouts to accommodate slower environments
3. Implemented request caching to reduce redundant API calls

View File

@@ -11,11 +11,13 @@ Common issues and solutions for Playwright E2E tests.
**Symptoms**: Tests timing out after 30 seconds, config reload overlay blocking interactions
**Resolution**:
- Extended timeout from 30s to 60s for feature flag propagation
- Added automatic detection and waiting for config reload overlay
- Improved test isolation with proper cleanup in afterEach hooks
**If you still experience timeouts**:
1. Rebuild the E2E container: `.github/skills/scripts/skill-runner.sh docker-rebuild-e2e`
2. Check Docker logs for health check failures
3. Verify emergency token is set in `.env` file
@@ -25,6 +27,7 @@ Common issues and solutions for Playwright E2E tests.
**Symptoms**: Feature flag tests failing with propagation timeout
**Resolution**:
- Added key normalization to handle both `feature.cerberus.enabled` and `cerberus.enabled` formats
- Tests now automatically detect and adapt to API response format
@@ -67,21 +70,25 @@ Emergency token not configured in `.env` file.
### Solution
1. **Generate token:**
```bash
openssl rand -hex 32
```
2. **Add to `.env` file:**
```bash
echo "CHARON_EMERGENCY_TOKEN=<paste_token_here>" >> .env
```
3. **Verify:**
```bash
grep CHARON_EMERGENCY_TOKEN .env
```
4. **Run tests:**
```bash
npx playwright test --project=chromium
```
@@ -104,16 +111,19 @@ Token is shorter than 64 characters (security requirement).
### Solution
1. **Regenerate token with correct length:**
```bash
openssl rand -hex 32 # Generates 64-char hex string
```
2. **Update `.env` file:**
```bash
sed -i "s/CHARON_EMERGENCY_TOKEN=.*/CHARON_EMERGENCY_TOKEN=<new_token>/" .env
```
3. **Verify length:**
```bash
echo -n "$(grep CHARON_EMERGENCY_TOKEN .env | cut -d= -f2)" | wc -c
# Should output: 64
@@ -139,6 +149,7 @@ Token is shorter than 64 characters (security requirement).
### Solution
**Step 1: Verify token configuration**
```bash
# Check token exists and is 64 chars
echo -n "$(grep CHARON_EMERGENCY_TOKEN .env | cut -d= -f2)" | wc -c
@@ -148,12 +159,14 @@ docker exec charon env | grep CHARON_EMERGENCY_TOKEN
```
**Step 2: Verify backend is running**
```bash
curl http://localhost:8080/api/v1/health
# Should return: {"status":"ok"}
```
**Step 3: Test emergency endpoint directly**
```bash
curl -X POST http://localhost:8080/api/v1/emergency/security-reset \
-H "X-Emergency-Token: $(grep CHARON_EMERGENCY_TOKEN .env | cut -d= -f2)" \
@@ -162,6 +175,7 @@ curl -X POST http://localhost:8080/api/v1/emergency/security-reset \
```
**Step 4: Check backend logs**
```bash
# Docker Compose
docker compose logs charon | tail -50
@@ -171,6 +185,7 @@ docker logs charon | tail -50
```
**Step 5: Regenerate token if needed**
```bash
# Generate new token
NEW_TOKEN=$(openssl rand -hex 32)
@@ -201,6 +216,7 @@ Security teardown did not successfully disable ACL before tests ran.
### Solution
1. **Run teardown script manually:**
```bash
npx playwright test tests/security-teardown.setup.ts
```
@@ -210,12 +226,14 @@ Security teardown did not successfully disable ACL before tests ran.
- Verify no error messages about missing token
3. **Verify ACL is disabled:**
```bash
curl http://localhost:8080/api/v1/security/status | jq
# acl.enabled should be false
```
4. **If still blocked, manually disable via API:**
```bash
# Using emergency token
curl -X POST http://localhost:8080/api/v1/emergency/security-reset \
@@ -225,6 +243,7 @@ Security teardown did not successfully disable ACL before tests ran.
```
5. **Run tests again:**
```bash
npx playwright test --project=chromium
```
@@ -282,11 +301,13 @@ Backend container not running or not accessible.
### Solution
1. **Check container status:**
```bash
docker ps | grep charon
```
2. **If not running, start it:**
```bash
# Docker Compose
docker compose up -d
@@ -296,11 +317,13 @@ Backend container not running or not accessible.
```
3. **Wait for health:**
```bash
timeout 60 bash -c 'until curl -f http://localhost:8080/api/v1/health; do sleep 2; done'
```
4. **Check logs if still failing:**
```bash
docker logs charon | tail -50
```
@@ -317,6 +340,7 @@ Backend container not running or not accessible.
### Cause
Token contains common placeholder strings like:
- `test-emergency-token`
- `your_64_character`
- `replace_this`
@@ -325,16 +349,19 @@ Token contains common placeholder strings like:
### Solution
1. **Generate a unique token:**
```bash
openssl rand -hex 32
```
2. **Replace placeholder in `.env`:**
```bash
sed -i "s/CHARON_EMERGENCY_TOKEN=.*/CHARON_EMERGENCY_TOKEN=<new_token>/" .env
```
3. **Verify it's not a placeholder:**
```bash
grep CHARON_EMERGENCY_TOKEN .env
# Should show a random hex string
@@ -389,16 +416,19 @@ Enables all debug output.
**Solutions:**
1. **Use sharding (parallel execution):**
```bash
npx playwright test --shard=1/4 --project=chromium
```
2. **Run specific test files:**
```bash
npx playwright test tests/manual-dns-provider.spec.ts
```
3. **Skip slow tests during development:**
```bash
npx playwright test --grep-invert "@slow"
```
@@ -406,19 +436,23 @@ Enables all debug output.
### Feature Flag Toggle Tests Timing Out
**Symptoms:**
- Tests in `tests/settings/system-settings.spec.ts` fail with timeout errors
- Error messages mention feature flag toggles (Cerberus, CrowdSec, Uptime, Persist)
**Cause:**
- Backend N+1 query pattern causing 300-600ms latency in CI
- Hard-coded waits insufficient for slower CI environments
**Solution (Fixed in v2.x):**
- Backend now uses batch query pattern (3-6x faster: 600ms → 200ms P99)
- Tests use condition-based polling with `waitForFeatureFlagPropagation()`
- Retry logic with exponential backoff handles transient failures
**If you still experience issues:**
1. Check backend latency: `grep "[METRICS]" docker logs charon`
2. Verify batch query is being used (should see `WHERE key IN (...)` in logs)
3. Ensure you're running latest version with the optimization
@@ -432,16 +466,19 @@ Enables all debug output.
**Solutions:**
1. **Increase health check timeout:**
```bash
timeout 120 bash -c 'until curl -f http://localhost:8080/api/v1/health; do sleep 2; done'
```
2. **Pre-pull Docker image:**
```bash
docker pull wikid82/charon:latest
```
3. **Check Docker resource limits:**
```bash
docker stats charon
# Ensure adequate CPU/memory
@@ -458,6 +495,7 @@ If you're still stuck after trying these solutions:
- Search [GitHub Issues](https://github.com/Wikid82/charon/issues)
2. **Collect diagnostic info:**
```bash
# Environment
echo "OS: $(uname -a)"