chore: Enhance documentation for E2E testing:

- Added clarity and structure to README files, including recent updates and getting started sections.
- Improved manual verification documentation for CrowdSec authentication, emphasizing expected outputs and success criteria.
- Updated debugging guide with detailed output examples and automatic trace capture information.
- Refined best practices for E2E tests, focusing on efficient polling, locator strategies, and state management.
- Documented triage report for DNS Provider feature tests, highlighting issues fixed and test results before and after improvements.
- Revised E2E test writing guide to include when to use specific helper functions and patterns for better test reliability.
- Enhanced troubleshooting documentation with clear resolutions for common issues, including timeout and token configuration problems.
- Updated tests README to provide quick links and best practices for writing robust tests.
This commit is contained in:
GitHub Actions
2026-03-24 01:47:22 +00:00
parent 7d986f2821
commit ca477c48d4
52 changed files with 983 additions and 198 deletions

View File

@@ -31,6 +31,7 @@ for _, s := range settings {
```
**Key Improvements:**
- **Single Query:** `WHERE key IN (?, ?, ?)` fetches all flags in one database round-trip
- **O(1) Lookups:** Map-based access eliminates linear search overhead
- **Error Handling:** Explicit error logging and HTTP 500 response on failure
@@ -56,6 +57,7 @@ if err := h.DB.Transaction(func(tx *gorm.DB) error {
```
**Key Improvements:**
- **Atomic Updates:** All flag changes commit or rollback together
- **Error Recovery:** Transaction rollback prevents partial state
- **Improved Logging:** Explicit error messages for debugging
@@ -65,10 +67,12 @@ if err := h.DB.Transaction(func(tx *gorm.DB) error {
### Before Optimization (Baseline - N+1 Pattern)
**Architecture:**
- GetFlags(): 3 sequential `WHERE key = ?` queries (one per flag)
- UpdateFlags(): Multiple separate transactions
**Measured Latency (Expected):**
- **GET P50:** 300ms (CI environment)
- **GET P95:** 500ms
- **GET P99:** 600ms
@@ -77,20 +81,24 @@ if err := h.DB.Transaction(func(tx *gorm.DB) error {
- **PUT P99:** 600ms
**Query Count:**
- GET: 3 queries (N+1 pattern, N=3 flags)
- PUT: 1-3 queries depending on flag count
**CI Impact:**
- Test flakiness: ~30% failure rate due to timeouts
- E2E test pass rate: ~70%
### After Optimization (Current - Batch Query + Transaction)
**Architecture:**
- GetFlags(): 1 batch query `WHERE key IN (?, ?, ?)`
- UpdateFlags(): 1 transaction wrapping all updates
**Measured Latency (Target):**
- **GET P50:** 100ms (3x faster)
- **GET P95:** 150ms (3.3x faster)
- **GET P99:** 200ms (3x faster)
@@ -99,10 +107,12 @@ if err := h.DB.Transaction(func(tx *gorm.DB) error {
- **PUT P99:** 200ms (3x faster)
**Query Count:**
- GET: 1 batch query (N+1 eliminated)
- PUT: 1 transaction (atomic)
**CI Impact (Expected):**
- Test flakiness: 0% (with retry logic + polling)
- E2E test pass rate: 100%
@@ -125,11 +135,13 @@ if err := h.DB.Transaction(func(tx *gorm.DB) error {
**Status:** Complete
**Changes:**
- Added `defer` timing to GetFlags() and UpdateFlags()
- Log format: `[METRICS] GET/PUT /feature-flags: {duration}ms`
- CI pipeline captures P50/P95/P99 metrics
**Files Modified:**
- `backend/internal/api/handlers/feature_flags_handler.go`
### Phase 1: Backend Optimization - N+1 Query Fix
@@ -139,16 +151,19 @@ if err := h.DB.Transaction(func(tx *gorm.DB) error {
**Priority:** P0 - Critical CI Blocker
**Changes:**
- **GetFlags():** Replaced N+1 loop with batch query `WHERE key IN (?)`
- **UpdateFlags():** Wrapped updates in single transaction
- **Tests:** Added batch query and transaction rollback tests
- **Benchmarks:** Added BenchmarkGetFlags and BenchmarkUpdateFlags
**Files Modified:**
- `backend/internal/api/handlers/feature_flags_handler.go`
- `backend/internal/api/handlers/feature_flags_handler_test.go`
**Expected Impact:**
- 3-6x latency reduction (600ms → 200ms P99)
- Elimination of N+1 query anti-pattern
- Atomic updates with rollback on error
@@ -159,32 +174,38 @@ if err := h.DB.Transaction(func(tx *gorm.DB) error {
### Test Helpers Used
**Polling Helper:** `waitForFeatureFlagPropagation()`
- Polls `/api/v1/feature-flags` until expected state confirmed
- Default interval: 500ms
- Default timeout: 30s (150x safety margin over 200ms P99)
**Retry Helper:** `retryAction()`
- 3 max attempts with exponential backoff (2s, 4s, 8s)
- Handles transient network/DB failures
### Timeout Strategy
**Helper Defaults:**
- `clickAndWaitForResponse()`: 30s timeout
- `waitForAPIResponse()`: 30s timeout
- No explicit timeouts in test files (rely on helper defaults)
**Typical Poll Count:**
- Local: 1-2 polls (50-200ms response + 500ms interval)
- CI: 1-3 polls (50-200ms response + 500ms interval)
### Test Files
**E2E Tests:**
- `tests/settings/system-settings.spec.ts` - Feature toggle tests
- `tests/utils/wait-helpers.ts` - Polling and retry helpers
**Backend Tests:**
- `backend/internal/api/handlers/feature_flags_handler_test.go`
- `backend/internal/api/handlers/feature_flags_handler_coverage_test.go`
@@ -205,11 +226,13 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$
### Benchmark Analysis
**GetFlags Benchmark:**
- Measures single batch query performance
- Tests with 3 flags in database
- Includes JSON serialization overhead
**UpdateFlags Benchmark:**
- Measures transaction wrapping performance
- Tests atomic update of 3 flags
- Includes JSON deserialization and validation
@@ -219,14 +242,17 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$
### Why Batch Query Over Individual Queries?
**Problem:** N+1 pattern causes linear latency scaling
- 3 flags = 3 queries × 200ms = 600ms total
- 10 flags = 10 queries × 200ms = 2000ms total
**Solution:** Single batch query with IN clause
- N flags = 1 query × 200ms = 200ms total
- Constant time regardless of flag count
**Trade-offs:**
- ✅ 3-6x latency reduction
- ✅ Scales to more flags without performance degradation
- ⚠️ Slightly more complex code (map-based lookup)
@@ -234,14 +260,17 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$
### Why Transaction Wrapping?
**Problem:** Multiple separate writes risk partial state
- Flag 1 succeeds, Flag 2 fails → inconsistent state
- No rollback mechanism for failed updates
**Solution:** Single transaction for all updates
- All succeed together or all rollback
- ACID guarantees for multi-flag updates
**Trade-offs:**
- ✅ Atomic updates with rollback on error
- ✅ Prevents partial state corruption
- ⚠️ Slightly longer locks (mitigated by fast SQLite)
@@ -253,11 +282,13 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$
**Status:** Not implemented (not needed after Phase 1 optimization)
**Rationale:**
- Current latency (50-200ms) is acceptable for feature flags
- Feature flags change infrequently (not a hot path)
- Adding cache increases complexity without significant benefit
**If Needed:**
- Use Redis or in-memory cache with TTL=60s
- Invalidate on PUT operations
- Expected improvement: 50-200ms → 10-50ms
@@ -267,11 +298,13 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$
**Status:** SQLite default indexes sufficient
**Rationale:**
- `settings.key` column used in WHERE clauses
- SQLite automatically indexes primary key
- Query plan analysis shows index usage
**If Needed:**
- Add explicit index: `CREATE INDEX idx_settings_key ON settings(key)`
- Expected improvement: Minimal (already fast)
@@ -280,11 +313,13 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$
**Status:** GORM default pooling sufficient
**Rationale:**
- GORM uses `database/sql` pool by default
- Current concurrency limits adequate
- No connection exhaustion observed
**If Needed:**
- Tune `SetMaxOpenConns()` and `SetMaxIdleConns()`
- Expected improvement: 10-20% under high load
@@ -293,12 +328,14 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$
### Metrics to Track
**Backend Metrics:**
- P50/P95/P99 latency for GET and PUT operations
- Query count per request (should remain 1 for GET)
- Transaction count per PUT (should remain 1)
- Error rate (target: <0.1%)
**E2E Metrics:**
- Test pass rate for feature toggle tests
- Retry attempt frequency (target: <5%)
- Polling iteration count (typical: 1-3)
@@ -307,11 +344,13 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$
### Alerting Thresholds
**Backend Alerts:**
- P99 > 500ms → Investigate regression (2.5x slower than optimized)
- Error rate > 1% → Check database health
- Query count > 1 for GET → N+1 pattern reintroduced
**E2E Alerts:**
- Test pass rate < 95% → Check for new flakiness
- Timeout errors > 0 → Investigate CI environment
- Retry rate > 10% → Investigate transient failure source
@@ -319,10 +358,12 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$
### Dashboard
**CI Metrics:**
- Link: `.github/workflows/e2e-tests.yml` artifacts
- Extracts `[METRICS]` logs for P50/P95/P99 analysis
**Backend Logs:**
- Docker container logs with `[METRICS]` tag
- Example: `[METRICS] GET /feature-flags: 120ms`
@@ -331,15 +372,18 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$
### High Latency (P99 > 500ms)
**Symptoms:**
- E2E tests timing out
- Backend logs show latency spikes
**Diagnosis:**
1. Check query count: `grep "SELECT" backend/logs/query.log`
2. Verify batch query: Should see `WHERE key IN (...)`
3. Check transaction wrapping: Should see single `BEGIN ... COMMIT`
**Remediation:**
- If N+1 pattern detected: Verify batch query implementation
- If transaction missing: Verify transaction wrapping
- If database locks: Check concurrent access patterns
@@ -347,15 +391,18 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$
### Transaction Rollback Errors
**Symptoms:**
- PUT requests return 500 errors
- Backend logs show transaction failure
**Diagnosis:**
1. Check error message: `grep "Failed to update feature flags" backend/logs/app.log`
2. Verify database constraints: Unique key constraints, foreign keys
3. Check database connectivity: Connection pool exhaustion
**Remediation:**
- If constraint violation: Fix invalid flag key or value
- If connection issue: Tune connection pool settings
- If deadlock: Analyze concurrent access patterns
@@ -363,15 +410,18 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$
### E2E Test Flakiness
**Symptoms:**
- Tests pass locally, fail in CI
- Timeout errors in Playwright logs
**Diagnosis:**
1. Check backend latency: `grep "[METRICS]" ci-logs.txt`
2. Verify retry logic: Should see retry attempts in logs
3. Check polling behavior: Should see multiple GET requests
**Remediation:**
- If backend slow: Investigate CI environment (disk I/O, CPU)
- If no retries: Verify `retryAction()` wrapper in test
- If no polling: Verify `waitForFeatureFlagPropagation()` usage