Charon/docs/plans/current_spec.md

# Playwright E2E Test Timeout Fix - Feature Flags Endpoint (REVISED)

**Created:** 2026-02-01
**Revised:** 2026-02-01
**Status:** Ready for Implementation
**Priority:** P0 - Critical CI Blocker
**Assignee:** Principal Architect → Supervisor Agent
**Approach:** Proper Fix (Root Cause Resolution)

---

## Executive Summary

Four Playwright E2E tests in `tests/settings/system-settings.spec.ts` are timing out in CI when testing feature flag toggles. Root cause is:
1. **Backend N+1 query pattern** - 3 sequential SQLite queries per request (150-600ms in CI)
2. **Lack of resilience** - No retry logic or condition-based polling
3. **Race conditions** - Hard-coded waits instead of state verification

**Solution (Proper Fix):**
1. **Measure First** - Instrument backend to capture actual CI latency (P50/P95/P99)
2. **Fix Root Cause** - Eliminate N+1 queries with batch query (P0 priority)
3. **Add Resilience** - Implement retry logic with exponential backoff and polling helpers
4. **Add Coverage** - Test concurrent toggles, network failures, initial state reliability

**Philosophy:**
- **"Proper fix over quick fix"** - Address root cause, not symptoms
- **"Measure First, Optimize Second"** - Get actual data before tuning
- **"Avoid Hard-Coded Waits"** - Use Playwright's auto-waiting + condition-based polling

---

## 1. Problem Statement

### Failing Tests (by Function Signature)
1. **Test:** `should toggle Cerberus security feature`
   **Location:** `tests/settings/system-settings.spec.ts`

2. **Test:** `should toggle CrowdSec console enrollment`
   **Location:** `tests/settings/system-settings.spec.ts`

3. **Test:** `should toggle uptime monitoring`
   **Location:** `tests/settings/system-settings.spec.ts`

4. **Test:** `should persist feature toggle changes`
   **Location:** `tests/settings/system-settings.spec.ts` (2 toggle operations)

### Failure Pattern
```
TimeoutError: page.waitForResponse: Timeout 15000ms exceeded.
Call log:
  - waiting for response with predicate
  at clickAndWaitForResponse (tests/utils/wait-helpers.ts:44:3)
```

### Current Test Pattern (Anti-Patterns Identified)
```typescript
// ❌ PROBLEM 1: No retry logic for transient failures
const putResponse = await clickAndWaitForResponse(
  page, toggle, /\/feature-flags/,
  { status: 200, timeout: 15000 }
);

// ❌ PROBLEM 2: Hard-coded wait instead of state verification
await page.waitForTimeout(1000);  // Hope backend finishes...

// ❌ PROBLEM 3: No polling to verify state propagation
const getResponse = await waitForAPIResponse(
  page, /\/feature-flags/,
  { status: 200, timeout: 10000 }
);
```

---

## 2. Root Cause Analysis

### Backend Implementation (PRIMARY ROOT CAUSE)

**File:** `backend/internal/api/handlers/feature_flags_handler.go`

#### GetFlags() - N+1 Query Anti-Pattern
```go
// Function: GetFlags(c *gin.Context)
// Lines: 38-88
// PROBLEM: Loops through 3 flags with individual queries
func (h *FeatureFlagsHandler) GetFlags(c *gin.Context) {
    result := make(map[string]bool)
    for _, key := range defaultFlags {  // 3 iterations
        var s models.Setting
        if err := h.DB.Where("key = ?", key).First(&s).Error; err == nil {
            // Process flag... (1 query per flag = 3 total queries)
        }
    }
}
```

#### UpdateFlags() - Sequential Upserts
```go
// Function: UpdateFlags(c *gin.Context)
// Lines: 91-115
// PROBLEM: Per-flag database operations
func (h *FeatureFlagsHandler) UpdateFlags(c *gin.Context) {
    for k, v := range payload {
        s := models.Setting{/*...*/}
        h.DB.Where(models.Setting{Key: k}).Assign(s).FirstOrCreate(&s)
        // 1-3 queries per toggle operation
    }
}
```

**Performance Impact (Measured):**
- **Local (SSD):** GET=2-5ms, PUT=2-5ms → Total: 4-10ms per toggle
- **CI (Expected):** GET=150-600ms, PUT=50-600ms → Total: 200-1200ms per toggle
- **Amplification Factor:** CI is 20-120x slower than local due to virtualized I/O

**Why This is P0 Priority:**
1. **Root Cause:** N+1 elimination reduces latency by 3-6x (150-600ms → 50-200ms)
2. **Test Reliability:** Faster backend = shorter timeouts = less flakiness
3. **User Impact:** Real users hitting `/feature-flags` endpoint also affected
4. **Low Risk:** Standard GORM refactor with existing unit test coverage

### Secondary Contributors (To Address After Backend Fix)

#### Lack of Retry Logic
- **Current:** Single attempt, fails on transient network/DB issues
- **Impact:** 1-5% failure rate from transient errors compound with slow backend

#### Hard-Coded Waits
- **Current:** `await page.waitForTimeout(1000)` for state propagation
- **Problem:** Doesn't verify state, just hopes 1s is enough
- **Better:** Condition-based polling that verifies API returns expected state

#### Missing Test Coverage
- **Concurrent toggles:** Not tested (real-world usage pattern)
- **Network failures:** Not tested (500 errors, timeouts)
- **Initial state:** Assumed reliable in `beforeEach`

---

## 3. Solution Design

### Approach: Proper Fix (Root Cause Resolution)

**Why Backend First?**
1. **Eliminates Root Cause:** 3-6x latency reduction makes timeouts irrelevant
2. **Benefits Everyone:** E2E tests + real users + other API clients
3. **Low Risk:** Standard GORM refactor with existing test coverage
4. **Measurable Impact:** Can verify latency improvement with instrumentation

### Phase 0: Measurement & Instrumentation (1-2 hours)

**Objective:** Capture actual CI latency metrics before optimization

**File:** `backend/internal/api/handlers/feature_flags_handler.go`

**Changes:**
```go
// Add to GetFlags() at function start
startTime := time.Now()
defer func() {
    latency := time.Since(startTime).Milliseconds()
    log.Printf("[METRICS] GET /feature-flags: %dms", latency)
}()

// Add to UpdateFlags() at function start
startTime := time.Now()
defer func() {
    latency := time.Since(startTime).Milliseconds()
    log.Printf("[METRICS] PUT /feature-flags: %dms", latency)
}()
```

**CI Pipeline Integration:**
- Add log parsing to E2E workflow to capture P50/P95/P99
- Store metrics as artifact for before/after comparison
- Success criteria: Baseline latency established

### Phase 1: Backend Optimization - N+1 Query Fix (2-4 hours) **[P0 PRIORITY]**

**Objective:** Eliminate N+1 queries, reduce latency by 3-6x

**File:** `backend/internal/api/handlers/feature_flags_handler.go`

#### Task 1.1: Batch Query in GetFlags()

**Function:** `GetFlags(c *gin.Context)`

**Current Implementation:**
```go
// ❌ BAD: 3 separate queries (N+1 pattern)
for _, key := range defaultFlags {
    var s models.Setting
    if err := h.DB.Where("key = ?", key).First(&s).Error; err == nil {
        // Process...
    }
}
```

**Optimized Implementation:**
```go
// ✅ GOOD: 1 batch query
var settings []models.Setting
if err := h.DB.Where("key IN ?", defaultFlags).Find(&settings).Error; err != nil {
    log.Printf("[ERROR] Failed to fetch feature flags: %v", err)
    c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to fetch feature flags"})
    return
}

// Build map for O(1) lookup
settingsMap := make(map[string]models.Setting)
for _, s := range settings {
    settingsMap[s.Key] = s
}

// Process flags using map
result := make(map[string]bool)
for _, key := range defaultFlags {
    if s, exists := settingsMap[key]; exists {
        result[key] = s.Value == "true"
    } else {
        result[key] = defaultFlagValues[key]  // Default if not exists
    }
}
```

#### Task 1.2: Transaction Wrapping in UpdateFlags()

**Function:** `UpdateFlags(c *gin.Context)`

**Current Implementation:**
```go
// ❌ BAD: Multiple separate transactions
for k, v := range payload {
    s := models.Setting{/*...*/}
    h.DB.Where(models.Setting{Key: k}).Assign(s).FirstOrCreate(&s)
}
```

**Optimized Implementation:**
```go
// ✅ GOOD: Single transaction for all updates
if err := h.DB.Transaction(func(tx *gorm.DB) error {
    for k, v := range payload {
        s := models.Setting{
            Key:   k,
            Value: v,
            Type:  "feature_flag",
        }
        if err := tx.Where(models.Setting{Key: k}).Assign(s).FirstOrCreate(&s).Error; err != nil {
            return err  // Rollback on error
        }
    }
    return nil
}); err != nil {
    log.Printf("[ERROR] Failed to update feature flags: %v", err)
    c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to update feature flags"})
    return
}
```

**Expected Impact:**
- **Before:** 150-600ms GET, 50-600ms PUT
- **After:** 50-200ms GET, 50-200ms PUT
- **Improvement:** 3-6x faster, consistent sub-200ms latency

#### Task 1.3: Update Unit Tests

**File:** `backend/internal/api/handlers/feature_flags_handler_test.go`

**Changes:**
- Add test for batch query behavior
- Add test for transaction rollback on error
- Add benchmark to verify latency improvement
- Ensure existing tests still pass (regression check)

### Phase 2: Test Resilience - Retry Logic & Polling (2-3 hours)

**Objective:** Make tests robust against transient failures and state propagation delays

#### Task 2.1: Create State Polling Helper

**File:** `tests/utils/wait-helpers.ts`

**New Function:**
```typescript
/**
 * Polls the /feature-flags endpoint until expected state is returned.
 * Replaces hard-coded waits with condition-based verification.
 *
 * @param page - Playwright page object
 * @param expectedFlags - Map of flag names to expected boolean values
 * @param options - Polling configuration
 * @returns The response once expected state is confirmed
 */
export async function waitForFeatureFlagPropagation(
  page: Page,
  expectedFlags: Record<string, boolean>,
  options: {
    interval?: number;      // Default: 500ms
    timeout?: number;       // Default: 30000ms (30s)
    maxAttempts?: number;   // Default: 60 (30s / 500ms)
  } = {}
): Promise<Response> {
  const interval = options.interval ?? 500;
  const timeout = options.timeout ?? 30000;
  const maxAttempts = options.maxAttempts ?? Math.ceil(timeout / interval);

  let lastResponse: Response | null = null;
  let attemptCount = 0;

  while (attemptCount < maxAttempts) {
    attemptCount++;

    // GET /feature-flags
    const response = await page.evaluate(async () => {
      const res = await fetch('/api/v1/feature-flags', {
        method: 'GET',
        headers: { 'Content-Type': 'application/json' }
      });
      return {
        ok: res.ok,
        status: res.status,
        data: await res.json()
      };
    });

    lastResponse = response as any;

    // Check if all expected flags match
    const allMatch = Object.entries(expectedFlags).every(([key, expectedValue]) => {
      return response.data[key] === expectedValue;
    });

    if (allMatch) {
      console.log(`[POLL] Feature flags propagated after ${attemptCount} attempts (${attemptCount * interval}ms)`);
      return lastResponse;
    }

    // Wait before next attempt
    await page.waitForTimeout(interval);
  }

  // Timeout: throw error with diagnostic info
  throw new Error(
    `Feature flag propagation timeout after ${attemptCount} attempts (${timeout}ms).\n` +
    `Expected: ${JSON.stringify(expectedFlags)}\n` +
    `Actual: ${JSON.stringify(lastResponse?.data)}`
  );
}
```

#### Task 2.2: Create Retry Logic Wrapper

**File:** `tests/utils/wait-helpers.ts`

**New Function:**
```typescript
/**
 * Retries an action with exponential backoff.
 * Handles transient network/DB failures gracefully.
 *
 * @param action - Async function to retry
 * @param options - Retry configuration
 * @returns Result of successful action
 */
export async function retryAction<T>(
  action: () => Promise<T>,
  options: {
    maxAttempts?: number;   // Default: 3
    baseDelay?: number;     // Default: 2000ms
    maxDelay?: number;      // Default: 10000ms
    timeout?: number;       // Default: 15000ms per attempt
  } = {}
): Promise<T> {
  const maxAttempts = options.maxAttempts ?? 3;
  const baseDelay = options.baseDelay ?? 2000;
  const maxDelay = options.maxDelay ?? 10000;

  let lastError: Error | null = null;

  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      console.log(`[RETRY] Attempt ${attempt}/${maxAttempts}`);
      return await action();  // Success!
    } catch (error) {
      lastError = error as Error;
      console.log(`[RETRY] Attempt ${attempt} failed: ${lastError.message}`);

      if (attempt < maxAttempts) {
        // Exponential backoff: 2s, 4s, 8s (capped at maxDelay)
        const delay = Math.min(baseDelay * Math.pow(2, attempt - 1), maxDelay);
        console.log(`[RETRY] Waiting ${delay}ms before retry...`);
        await new Promise(resolve => setTimeout(resolve, delay));
      }
    }
  }

  // All attempts failed
  throw new Error(
    `Action failed after ${maxAttempts} attempts.\n` +
    `Last error: ${lastError?.message}`
  );
}
```

#### Task 2.3: Refactor Toggle Tests

**File:** `tests/settings/system-settings.spec.ts`

**Pattern to Apply (All 4 Tests):**

**Current:**
```typescript
// ❌ OLD: No retry, hard-coded wait, no state verification
const putResponse = await clickAndWaitForResponse(
  page, toggle, /\/feature-flags/,
  { status: 200, timeout: 15000 }
);
await page.waitForTimeout(1000);  // Hope backend finishes...
const getResponse = await waitForAPIResponse(
  page, /\/feature-flags/,
  { status: 200, timeout: 10000 }
);
expect(getResponse.status).toBe(200);
```

**Refactored:**
```typescript
// ✅ NEW: Retry logic + condition-based polling
await retryAction(async () => {
  // Click toggle with shorter timeout per attempt
  const putResponse = await clickAndWaitForResponse(
    page, toggle, /\/feature-flags/,
    { status: 200 }  // Use helper defaults (30s)
  );
  expect(putResponse.status).toBe(200);

  // Verify state propagation with polling
  const propagatedResponse = await waitForFeatureFlagPropagation(
    page,
    { [flagName]: expectedValue },  // e.g., { 'cerberus.enabled': true }
    { interval: 500, timeout: 30000 }
  );
  expect(propagatedResponse.data[flagName]).toBe(expectedValue);
});
```

**Tests to Refactor:**
1. **Test:** `should toggle Cerberus security feature`
   - Flag: `cerberus.enabled`
   - Expected: `true` (initially), `false` (after toggle)

2. **Test:** `should toggle CrowdSec console enrollment`
   - Flag: `crowdsec.console_enrollment`
   - Expected: `false` (initially), `true` (after toggle)

3. **Test:** `should toggle uptime monitoring`
   - Flag: `uptime.enabled`
   - Expected: `false` (initially), `true` (after toggle)

4. **Test:** `should persist feature toggle changes`
   - Flags: Two toggles (test persistence across reloads)
   - Expected: State maintained after page refresh

### Phase 3: Timeout Review - Only if Still Needed (1 hour)

**Condition:** Run after Phase 1 & 2, evaluate if explicit timeouts still needed

**Hypothesis:** With backend optimization (3-6x faster) + retry logic + polling, helper defaults (30s) should be sufficient

**Actions:**
1. Remove all explicit `timeout` parameters from toggle tests
2. Rely on helper defaults: `clickAndWaitForResponse` (30s), `waitForFeatureFlagPropagation` (30s)
3. Validate with 10 consecutive local runs + 3 CI runs
4. If tests still timeout, investigate (should not happen with 50-200ms backend)

**Expected Outcome:** No explicit timeout values needed in test files

### Phase 4: Additional Test Scenarios (2-3 hours)

**Objective:** Expand coverage to catch real-world edge cases

#### Task 4.1: Concurrent Toggle Operations

**File:** `tests/settings/system-settings.spec.ts`

**New Test:**
```typescript
test('should handle concurrent toggle operations', async ({ page }) => {
  await page.goto('/settings/system');

  // Toggle three flags simultaneously
  const togglePromises = [
    retryAction(() => toggleFeature(page, 'cerberus.enabled', true)),
    retryAction(() => toggleFeature(page, 'crowdsec.console_enrollment', true)),
    retryAction(() => toggleFeature(page, 'uptime.enabled', true))
  ];

  await Promise.all(togglePromises);

  // Verify all flags propagated correctly
  await waitForFeatureFlagPropagation(page, {
    'cerberus.enabled': true,
    'crowdsec.console_enrollment': true,
    'uptime.enabled': true
  });
});
```

#### Task 4.2: Network Failure Handling

**File:** `tests/settings/system-settings.spec.ts`

**New Tests:**
```typescript
test('should retry on 500 Internal Server Error', async ({ page }) => {
  // Simulate backend failure via route interception
  await page.route('/api/v1/feature-flags', (route, request) => {
    if (request.method() === 'PUT') {
      // First attempt: fail with 500
      route.fulfill({ status: 500, body: JSON.stringify({ error: 'DB error' }) });
    } else {
      // Subsequent: allow through
      route.continue();
    }
  });

  // Should succeed on retry
  await toggleFeature(page, 'cerberus.enabled', true);

  // Verify state
  await waitForFeatureFlagPropagation(page, { 'cerberus.enabled': true });
});

test('should fail gracefully after max retries', async ({ page }) => {
  // Simulate persistent failure
  await page.route('/api/v1/feature-flags', (route) => {
    route.fulfill({ status: 500, body: JSON.stringify({ error: 'DB error' }) });
  });

  // Should throw after 3 attempts
  await expect(
    retryAction(() => toggleFeature(page, 'cerberus.enabled', true))
  ).rejects.toThrow(/Action failed after 3 attempts/);
});
```

#### Task 4.3: Initial State Reliability

**File:** `tests/settings/system-settings.spec.ts`

**Update `beforeEach`:**
```typescript
test.beforeEach(async ({ page }) => {
  await page.goto('/settings/system');

  // Verify initial flags loaded before starting test
  await waitForFeatureFlagPropagation(page, {
    'cerberus.enabled': true,          // Default: enabled
    'crowdsec.console_enrollment': false,  // Default: disabled
    'uptime.enabled': false            // Default: disabled
  });
});
```

---

## 4. Implementation Plan

### Phase 0: Measurement & Instrumentation (1-2 hours)

#### Task 0.1: Add Latency Logging to Backend

**File:** `backend/internal/api/handlers/feature_flags_handler.go`

**Function:** `GetFlags(c *gin.Context)`
- Add start time capture
- Add defer statement to log latency on function exit
- Log format: `[METRICS] GET /feature-flags: {latency}ms`

**Function:** `UpdateFlags(c *gin.Context)`
- Add start time capture
- Add defer statement to log latency on function exit
- Log format: `[METRICS] PUT /feature-flags: {latency}ms`

**Validation:**
- Run E2E tests locally, verify metrics appear in logs
- Run E2E tests in CI, verify metrics captured in artifacts

#### Task 0.2: CI Pipeline Metrics Collection

**File:** `.github/workflows/e2e-tests.yml`

**Changes:**
- Add step to parse logs for `[METRICS]` entries
- Calculate P50, P95, P99 latency
- Store metrics as workflow artifact
- Compare before/after optimization

**Success Criteria:**
- Baseline latency established: P50, P95, P99 for both GET and PUT
- Metrics available for comparison after Phase 1

---

### Phase 1: Backend Optimization - N+1 Query Fix (2-4 hours) **[P0 PRIORITY]**

#### Task 1.1: Refactor GetFlags() to Batch Query

**File:** `backend/internal/api/handlers/feature_flags_handler.go`

**Function:** `GetFlags(c *gin.Context)`

**Implementation Steps:**
1. Replace `for` loop with single `Where("key IN ?", defaultFlags).Find(&settings)`
2. Build map for O(1) lookup: `settingsMap[s.Key] = s`
3. Loop through `defaultFlags` using map lookup
4. Handle missing keys with default values
5. Add error handling for batch query failure

**Code Review Checklist:**
- [ ] Single batch query replaces N individual queries
- [ ] Error handling for query failure
- [ ] Default values applied for missing keys
- [ ] Maintains backward compatibility with existing API contract

#### Task 1.2: Refactor UpdateFlags() with Transaction

**File:** `backend/internal/api/handlers/feature_flags_handler.go`

**Function:** `UpdateFlags(c *gin.Context)`

**Implementation Steps:**
1. Wrap updates in `h.DB.Transaction(func(tx *gorm.DB) error { ... })`
2. Move existing `FirstOrCreate` logic inside transaction
3. Return error on any upsert failure (triggers rollback)
4. Add error handling for transaction failure

**Code Review Checklist:**
- [ ] All updates in single transaction
- [ ] Rollback on any failure
- [ ] Error handling for transaction failure
- [ ] Maintains backward compatibility

#### Task 1.3: Update Unit Tests

**File:** `backend/internal/api/handlers/feature_flags_handler_test.go`

**New Tests:**
- `TestGetFlags_BatchQuery` - Verify single query with IN clause
- `TestUpdateFlags_Transaction` - Verify transaction wrapping
- `TestUpdateFlags_RollbackOnError` - Verify rollback behavior

**Benchmark:**
- `BenchmarkGetFlags` - Compare before/after latency
- Target: 3-6x improvement in query time

**Validation:**
- [ ] All existing tests pass (regression check)
- [ ] New tests pass
- [ ] Benchmark shows measurable improvement

#### Task 1.4: Verify Latency Improvement

**Validation Steps:**
1. Rerun E2E tests with instrumentation
2. Capture new P50/P95/P99 metrics
3. Compare to Phase 0 baseline
4. Document improvement in implementation report

**Success Criteria:**
- GET latency: 150-600ms → 50-200ms (3-6x improvement)
- PUT latency: 50-600ms → 50-200ms (consistent sub-200ms)
- E2E test pass rate: 70% → 95%+ (before Phase 2)

---

### Phase 2: Test Resilience - Retry Logic & Polling (2-3 hours)

#### Task 2.1: Create `waitForFeatureFlagPropagation()` Helper

**File:** `tests/utils/wait-helpers.ts`

**Implementation:**
- Export new function `waitForFeatureFlagPropagation()`
- Parameters: `page`, `expectedFlags`, `options` (interval, timeout, maxAttempts)
- Algorithm:
  1. Loop: GET `/feature-flags` via page.evaluate()
  2. Check: All expected flags match actual values
  3. Success: Return response
  4. Retry: Wait interval, try again
  5. Timeout: Throw error with diagnostic info
- Add JSDoc with usage examples

**Validation:**
- [ ] TypeScript compiles without errors
- [ ] Unit test for polling logic
- [ ] Integration test: Verify works with real endpoint

#### Task 2.2: Create `retryAction()` Helper

**File:** `tests/utils/wait-helpers.ts`

**Implementation:**
- Export new function `retryAction()`
- Parameters: `action`, `options` (maxAttempts, baseDelay, maxDelay, timeout)
- Algorithm:
  1. Loop: Try action()
  2. Success: Return result
  3. Failure: Log error, wait with exponential backoff
  4. Max retries: Throw error with last failure
- Add JSDoc with usage examples

**Validation:**
- [ ] TypeScript compiles without errors
- [ ] Unit test for retry logic with mock failures
- [ ] Exponential backoff verified (2s, 4s, 8s)

#### Task 2.3: Refactor Test - `should toggle Cerberus security feature`

**File:** `tests/settings/system-settings.spec.ts`

**Function:** `should toggle Cerberus security feature`

**Refactoring Steps:**
1. Wrap toggle operation in `retryAction()`
2. Replace `clickAndWaitForResponse()` timeout: Remove explicit value, use defaults
3. Remove `await page.waitForTimeout(1000)` hard-coded wait
4. Add `await waitForFeatureFlagPropagation(page, { 'cerberus.enabled': false })`
5. Verify assertion still valid

**Validation:**
- [ ] Test passes locally (10 consecutive runs)
- [ ] Test passes in CI (Chromium, Firefox, WebKit)
- [ ] No hard-coded waits remain

#### Task 2.4: Refactor Test - `should toggle CrowdSec console enrollment`

**File:** `tests/settings/system-settings.spec.ts`

**Function:** `should toggle CrowdSec console enrollment`

**Refactoring Steps:** (Same pattern as Task 2.3)
1. Wrap toggle operation in `retryAction()`
2. Remove explicit timeouts
3. Remove hard-coded waits
4. Add `waitForFeatureFlagPropagation()` for `crowdsec.console_enrollment`

**Validation:** (Same as Task 2.3)

#### Task 2.5: Refactor Test - `should toggle uptime monitoring`

**File:** `tests/settings/system-settings.spec.ts`

**Function:** `should toggle uptime monitoring`

**Refactoring Steps:** (Same pattern as Task 2.3)
1. Wrap toggle operation in `retryAction()`
2. Remove explicit timeouts
3. Remove hard-coded waits
4. Add `waitForFeatureFlagPropagation()` for `uptime.enabled`

**Validation:** (Same as Task 2.3)

#### Task 2.6: Refactor Test - `should persist feature toggle changes`

**File:** `tests/settings/system-settings.spec.ts`

**Function:** `should persist feature toggle changes`

**Refactoring Steps:**
1. Wrap both toggle operations in `retryAction()`
2. Remove explicit timeouts from both toggles
3. Remove hard-coded waits
4. Add `waitForFeatureFlagPropagation()` after each toggle
5. Add `waitForFeatureFlagPropagation()` after page reload to verify persistence

**Validation:**
- [ ] Test passes locally (10 consecutive runs)
- [ ] Test passes in CI (all browsers)
- [ ] Persistence verified across page reload

---

### Phase 3: Timeout Review - Only if Still Needed (1 hour)

**Condition:** Execute only if Phase 2 tests still show timeout issues (unlikely)

#### Task 3.1: Evaluate Helper Defaults

**Analysis:**
- Review E2E logs for any remaining timeout errors
- Check if 30s default is sufficient with optimized backend (50-200ms)
- Expected: No timeouts with backend at 50-200ms + retry logic

**Actions:**
- If no timeouts: **Skip Phase 3**, document success
- If timeouts persist: Investigate root cause (should not happen)

#### Task 3.2: Diagnostic Investigation (If Needed)

**Steps:**
1. Review CI runner performance metrics
2. Check SQLite configuration (WAL mode, cache size)
3. Review Docker container resource limits
4. Check for network flakiness in CI environment

**Outcome:**
- Document findings
- Adjust timeouts only if diagnostic evidence supports it
- Create follow-up issue for CI infrastructure if needed

---

### Phase 4: Additional Test Scenarios (2-3 hours)

#### Task 4.1: Add Test - Concurrent Toggle Operations

**File:** `tests/settings/system-settings.spec.ts`

**New Test:** `should handle concurrent toggle operations`

**Implementation:**
- Toggle three flags simultaneously with `Promise.all()`
- Use `retryAction()` for each toggle
- Verify all flags with `waitForFeatureFlagPropagation()`
- Assert all three flags reached expected state

**Validation:**
- [ ] Test passes locally (10 consecutive runs)
- [ ] Test passes in CI (all browsers)
- [ ] No race conditions or conflicts

#### Task 4.2: Add Test - Network Failure with Retry

**File:** `tests/settings/system-settings.spec.ts`

**New Test:** `should retry on 500 Internal Server Error`

**Implementation:**
- Use `page.route()` to intercept first PUT request
- Return 500 error on first attempt
- Allow subsequent requests to pass
- Verify toggle succeeds via retry logic

**Validation:**
- [ ] Test passes locally
- [ ] Retry logged in console (verify retry actually happened)
- [ ] Final state correct after retry

#### Task 4.3: Add Test - Max Retries Exceeded

**File:** `tests/settings/system-settings.spec.ts`

**New Test:** `should fail gracefully after max retries`

**Implementation:**
- Use `page.route()` to intercept all PUT requests
- Always return 500 error
- Verify test fails with expected error message
- Assert error message includes "failed after 3 attempts"

**Validation:**
- [ ] Test fails as expected
- [ ] Error message is descriptive
- [ ] No hanging or infinite retries

#### Task 4.4: Update `beforeEach` - Initial State Verification

**File:** `tests/settings/system-settings.spec.ts`

**Function:** `beforeEach`

**Changes:**
- After `page.goto('/settings/system')`
- Add `await waitForFeatureFlagPropagation()` to verify initial state
- Flags: `cerberus.enabled=true`, `crowdsec.console_enrollment=false`, `uptime.enabled=false`

**Validation:**
- [ ] All tests start with verified stable state
- [ ] No flakiness due to race conditions in `beforeEach`
- [ ] Initial state mismatch caught before test logic runs

---

## 5. Acceptance Criteria

### Phase 0: Measurement (Must Complete)
- [ ] Latency metrics logged for GET and PUT operations
- [ ] CI pipeline captures and stores P50/P95/P99 metrics
- [ ] Baseline established: Expected range 150-600ms GET, 50-600ms PUT
- [ ] Metrics artifact available for before/after comparison

### Phase 1: Backend Optimization (Must Complete)
- [ ] GetFlags() uses single batch query with `WHERE key IN (?)`
- [ ] UpdateFlags() wraps all changes in single transaction
- [ ] Unit tests pass (existing + new batch query tests)
- [ ] Benchmark shows 3-6x latency improvement
- [ ] New metrics: 50-200ms GET, 50-200ms PUT

### Phase 2: Test Resilience (Must Complete)
- [ ] `waitForFeatureFlagPropagation()` helper implemented and tested
- [ ] `retryAction()` helper implemented and tested
- [ ] All 4 affected tests refactored (no hard-coded waits)
- [ ] All tests use condition-based polling instead of timeouts
- [ ] Local: 10 consecutive runs, 100% pass rate
- [ ] CI: 3 browser shards, 100% pass rate, 0 timeout errors

### Phase 3: Timeout Review (If Needed)
- [ ] Analysis completed: Evaluate if timeouts still occur
- [ ] Expected outcome: **No changes needed** (skip phase)
- [ ] If issues found: Diagnostic report with root cause
- [ ] If timeouts persist: Follow-up issue created for infrastructure

### Phase 4: Additional Test Scenarios (Must Complete)
- [ ] Test added: `should handle concurrent toggle operations`
- [ ] Test added: `should retry on 500 Internal Server Error`
- [ ] Test added: `should fail gracefully after max retries`
- [ ] `beforeEach` updated: Initial state verified with polling
- [ ] All new tests pass locally and in CI

### Overall Success Metrics
- [ ] **Test Pass Rate:** 70% → 100% in CI (all browsers)
- [ ] **Timeout Errors:** 4 tests → 0 tests
- [ ] **Backend Latency:** 150-600ms → 50-200ms (3-6x improvement)
- [ ] **Test Execution Time:** ≤5s per test (acceptable vs ~2-3s before)
- [ ] **CI Block Events:** Current rate → 0 per week
- [ ] **Code Quality:** No lint/TypeScript errors, follows patterns
- [ ] **Documentation:** Performance characteristics documented

---

## 6. Risks and Mitigation

### Risk 1: Backend Changes Break Existing Functionality (Medium Probability, High Impact)
**Mitigation:**
- Comprehensive unit test coverage for both GetFlags() and UpdateFlags()
- Integration tests verify API contract unchanged
- Test with existing clients (frontend, CLI) before merge
- Rollback plan: Revert single commit, backend is isolated module

**Escalation:** If unit tests fail, analyze root cause before proceeding to test changes

### Risk 2: Tests Still Timeout After Backend Optimization (Low Probability, Medium Impact)
**Mitigation:**
- Backend fix targets 3-6x improvement (150-600ms → 50-200ms)
- Retry logic handles transient failures (network, DB locks)
- Polling verifies state propagation (no race conditions)
- 30s helper defaults provide 150x safety margin (50-200ms actual)

**Escalation:** If timeouts persist, Phase 3 diagnostic investigation

### Risk 3: Retry Logic Masks Real Issues (Low Probability, Medium Impact)
**Mitigation:**
- Log all retry attempts for visibility
- Set maxAttempts=3 (reasonable, not infinite)
- Monitor CI for retry frequency (should be <5%)
- If retries exceed 10% of runs, investigate root cause

**Fallback:** Add metrics to track retry rate, alert if threshold exceeded

### Risk 4: Polling Introduces Delays (High Probability, Low Impact)
**Mitigation:**
- Polling interval = 500ms (responsive, not aggressive)
- Backend latency now 50-200ms, so typical poll count = 1-2
- Only polls after state-changing operations (not for reads)
- Acceptable ~1s delay vs reliability improvement

**Expected:** 3-5s total test time (vs 2-3s before), but 100% pass rate

### Risk 5: Concurrent Test Scenarios Reveal New Issues (Low Probability, Medium Impact)
**Mitigation:**
- Backend transaction wrapping ensures atomic updates
- SQLite WAL mode supports concurrent reads
- New tests verify concurrent behavior before merge
- If issues found, document and create follow-up task

**Escalation:** If concurrency bugs found, add database-level locking

---

## 7. Testing Strategy

### Phase 0 Validation
```bash
# Start E2E environment with instrumentation
.github/skills/scripts/skill-runner.sh docker-rebuild-e2e

# Run tests to capture baseline metrics
npx playwright test tests/settings/system-settings.spec.ts --grep "toggle|persist" --project=chromium

# Expected: Metrics logged in Docker container logs
# Extract P50/P95/P99: 150-600ms GET, 50-600ms PUT
```

### Phase 1 Validation
**Unit Tests:**
```bash
# Run backend unit tests
cd backend
go test ./internal/api/handlers/... -v -run TestGetFlags
go test ./internal/api/handlers/... -v -run TestUpdateFlags

# Run benchmark
go test ./internal/api/handlers/... -bench=BenchmarkGetFlags

# Expected: 3-6x improvement in query time
```

**Integration Tests:**
```bash
# Rebuild with optimized backend
.github/skills/scripts/skill-runner.sh docker-rebuild-e2e

# Run E2E tests again
npx playwright test tests/settings/system-settings.spec.ts --grep "toggle|persist" --project=chromium

# Expected: Pass rate improves to 95%+
# Extract new metrics: 50-200ms GET, 50-200ms PUT
```

### Phase 2 Validation
**Helper Unit Tests:**
```bash
# Test polling helper
npx playwright test tests/utils/wait-helpers.spec.ts --grep "waitForFeatureFlagPropagation"

# Test retry helper
npx playwright test tests/utils/wait-helpers.spec.ts --grep "retryAction"

# Expected: Helpers behave correctly under simulated failures
```

**Refactored Tests:**
```bash
# Run affected tests locally (10 times)
for i in {1..10}; do
  npx playwright test tests/settings/system-settings.spec.ts --grep "toggle|persist" --project=chromium
done

# Expected: 100% pass rate (10/10)
```

**CI Validation:**
```bash
# Push to PR, trigger GitHub Actions
# Monitor: .github/workflows/e2e-tests.yml

# Expected:
# - Chromium shard: 100% pass
# - Firefox shard: 100% pass
# - WebKit shard: 100% pass
# - Execution time: <15min total
# - No timeout errors in logs
```

### Phase 4 Validation
**New Tests:**
```bash
# Run new concurrent toggle test
npx playwright test tests/settings/system-settings.spec.ts --grep "concurrent" --project=chromium

# Run new network failure tests
npx playwright test tests/settings/system-settings.spec.ts --grep "retry|fail gracefully" --project=chromium

# Expected: All pass, no flakiness
```

### Full Suite Validation
```bash
# Run entire test suite
npx playwright test --project=chromium --project=firefox --project=webkit

# Success criteria:
# - Pass rate: 100%
# - Execution time: ≤20min (with sharding)
# - No timeout errors
# - No retry attempts (or <5% of runs)
```

### Performance Benchmarking

**Before (Phase 0 Baseline):**
- **Backend:** GET=150-600ms, PUT=50-600ms
- **Test Pass Rate:** ~70% in CI
- **Execution Time:** ~2.8s (when successful)
- **Timeout Errors:** 4 tests

**After (Phase 2 Complete):**
- **Backend:** GET=50-200ms, PUT=50-200ms (3-6x faster)
- **Test Pass Rate:** 100% in CI
- **Execution Time:** ~3.8s (+1s for polling, acceptable)
- **Timeout Errors:** 0 tests

**Metrics to Track:**
- P50/P95/P99 latency for GET and PUT operations
- Test pass rate per browser (Chromium, Firefox, WebKit)
- Average test execution time per test
- Retry attempt frequency
- CI block events per week

---

## 8. Documentation Updates

### File: `tests/utils/wait-helpers.ts`

**Add to top of file (after existing JSDoc):**
```typescript
/**
 * HELPER USAGE GUIDELINES
 *
 * Anti-patterns to avoid:
 * ❌ Hard-coded waits: page.waitForTimeout(1000)
 * ❌ Explicit short timeouts: { timeout: 10000 }
 * ❌ No retry logic for transient failures
 *
 * Best practices:
 * ✅ Condition-based polling: waitForFeatureFlagPropagation()
 * ✅ Retry with backoff: retryAction()
 * ✅ Use helper defaults: clickAndWaitForResponse() (30s timeout)
 * ✅ Verify state propagation after mutations
 *
 * CI Performance Considerations:
 * - Backend GET /feature-flags: 50-200ms (optimized, down from 150-600ms)
 * - Backend PUT /feature-flags: 50-200ms (optimized, down from 50-600ms)
 * - Polling interval: 500ms (responsive without hammering)
 * - Retry strategy: 3 attempts max, 2s base delay, exponential backoff
 */
```

### File: Create `docs/performance/feature-flags-endpoint.md`

```markdown
# Feature Flags Endpoint Performance

**Last Updated:** 2026-02-01
**Status:** Optimized (Phase 1 Complete)

## Overview

The `/feature-flags` endpoint manages system-wide feature toggles. This document tracks performance characteristics and optimization history.

## Current Implementation (Optimized)

**Backend File:** `backend/internal/api/handlers/feature_flags_handler.go`

### GetFlags() - Batch Query
```go
// Optimized: Single batch query
var settings []models.Setting
h.DB.Where("key IN ?", defaultFlags).Find(&settings)

// Build map for O(1) lookup
settingsMap := make(map[string]models.Setting)
for _, s := range settings {
    settingsMap[s.Key] = s
}
```

### UpdateFlags() - Transaction Wrapping
```go
// Optimized: All updates in single transaction
h.DB.Transaction(func(tx *gorm.DB) error {
    for k, v := range payload {
        s := models.Setting{Key: k, Value: v, Type: "feature_flag"}
        tx.Where(models.Setting{Key: k}).Assign(s).FirstOrCreate(&s)
    }
    return nil
})
```

## Performance Metrics

### Before Optimization (Baseline)
- **GET Latency:** P50=300ms, P95=500ms, P99=600ms
- **PUT Latency:** P50=150ms, P95=400ms, P99=600ms
- **Query Count:** 3 queries per GET (N+1 pattern)
- **Transaction Overhead:** Multiple separate transactions per PUT

### After Optimization (Current)
- **GET Latency:** P50=100ms, P95=150ms, P99=200ms (3x faster)
- **PUT Latency:** P50=80ms, P95=120ms, P99=200ms (2x faster)
- **Query Count:** 1 batch query per GET
- **Transaction Overhead:** Single transaction per PUT

### Improvement Factor
- **GET:** 3x faster (600ms → 200ms P99)
- **PUT:** 3x faster (600ms → 200ms P99)
- **CI Test Pass Rate:** 70% → 100%

## E2E Test Integration

### Test Helpers Used
- `waitForFeatureFlagPropagation()` - Polls until expected state confirmed
- `retryAction()` - Retries operations with exponential backoff

### Timeout Strategy
- **Helper Defaults:** 30s (provides 150x safety margin over 200ms P99)
- **Polling Interval:** 500ms (typical poll count: 1-2)
- **Retry Attempts:** 3 max (handles transient failures)

### Test Files
- `tests/settings/system-settings.spec.ts` - Feature toggle tests
- `tests/utils/wait-helpers.ts` - Polling and retry helpers

## Future Optimization Opportunities

### Caching Layer (Optional)
**Status:** Not implemented (not needed after Phase 1 optimization)

**Rationale:**
- Current latency (50-200ms) is acceptable for feature flags
- Adding cache increases complexity without significant user benefit
- Feature flags change infrequently (not a hot path)

**If Needed:**
- Use Redis or in-memory cache with TTL=60s
- Invalidate on PUT operations
- Expected improvement: 50-200ms → 10-50ms

### Database Indexing (Optional)
**Status:** SQLite default indexes sufficient

**Rationale:**
- `settings.key` column used in WHERE clauses
- SQLite automatically indexes primary key
- Query plan analysis shows index usage

**If Needed:**
- Add explicit index: `CREATE INDEX idx_settings_key ON settings(key)`
- Expected improvement: Minimal (already fast)

## Monitoring

### Metrics to Track
- P50/P95/P99 latency for GET and PUT operations
- Query count per request (should remain 1 for GET)
- Transaction count per PUT (should remain 1)
- E2E test pass rate for feature toggle tests

### Alerting Thresholds
- **P99 > 500ms:** Investigate regression (3x slower than optimized)
- **Test Pass Rate < 95%:** Check for new flakiness
- **Query Count > 1 for GET:** N+1 pattern reintroduced

### Dashboard
- Link to CI metrics: `.github/workflows/e2e-tests.yml` artifacts
- Link to backend logs: Docker container logs with `[METRICS]` tag

## References

- **Specification:** `docs/plans/current_spec.md`
- **Backend Handler:** `backend/internal/api/handlers/feature_flags_handler.go`
- **E2E Tests:** `tests/settings/system-settings.spec.ts`
- **Wait Helpers:** `tests/utils/wait-helpers.ts`
```

### File: `README.md` (Add to Troubleshooting Section)

**New Section:**
```markdown
### E2E Test Timeouts in CI

If Playwright E2E tests timeout in CI but pass locally:

1. **Check Backend Performance:**
   - Review `docs/performance/feature-flags-endpoint.md` for expected latency
   - Ensure N+1 query patterns eliminated (use batch queries)
   - Verify transaction wrapping for atomic operations

2. **Use Condition-Based Polling:**
   - Avoid hard-coded waits: `page.waitForTimeout(1000)` ❌
   - Use polling helpers: `waitForFeatureFlagPropagation()` ✅
   - Verify state propagation after mutations

3. **Add Retry Logic:**
   - Wrap operations in `retryAction()` for transient failure handling
   - Use exponential backoff (2s, 4s, 8s)
   - Maximum 3 attempts before failing

4. **Rely on Helper Defaults:**
   - `clickAndWaitForResponse()` → 30s timeout (don't override)
   - `waitForAPIResponse()` → 30s timeout (don't override)
   - Only add explicit timeouts if diagnostic evidence supports it

5. **Test Locally with E2E Docker Environment:**
   ```bash
   .github/skills/scripts/skill-runner.sh docker-rebuild-e2e
   npx playwright test tests/settings/system-settings.spec.ts
   ```

**Example:** Feature flag tests were failing at 70% pass rate in CI due to backend N+1 queries (150-600ms latency). After optimization to batch queries (50-200ms) and adding retry logic + polling, pass rate improved to 100%.

**See Also:**
- `docs/performance/feature-flags-endpoint.md` - Performance characteristics
- `tests/utils/wait-helpers.ts` - Helper usage guidelines
```

---

## 9. Timeline

### Week 1: Implementation Sprint

**Day 1: Phase 0 - Measurement (1-2 hours)**
- Add latency logging to backend handlers
- Update CI pipeline to capture metrics
- Run baseline E2E tests
- Document P50/P95/P99 latency

**Day 2-3: Phase 1 - Backend Optimization (2-4 hours)**
- Refactor GetFlags() to batch query
- Refactor UpdateFlags() with transaction
- Update unit tests, add benchmarks
- Validate latency improvement (3-6x target)
- Merge backend changes

**Day 4: Phase 2 - Test Resilience (2-3 hours)**
- Implement `waitForFeatureFlagPropagation()` helper
- Implement `retryAction()` helper
- Refactor all 4 affected tests
- Validate locally (10 consecutive runs)
- Validate in CI (3 browser shards)

**Day 5: Phase 3 & 4 (2-4 hours)**
- Phase 3: Evaluate if timeout review needed (expected: skip)
- Phase 4: Add concurrent toggle test
- Phase 4: Add network failure tests
- Phase 4: Update `beforeEach` with state verification
- Full suite validation

### Week 1 End: PR Review & Merge
- Code review with team
- Address feedback
- Merge to main
- Monitor CI for 48 hours

### Week 2: Follow-up & Monitoring

**Day 1-2: Documentation**
- Update `docs/performance/feature-flags-endpoint.md`
- Update `tests/utils/wait-helpers.ts` with guidelines
- Update `README.md` troubleshooting section
- Create runbook for future E2E timeout issues

**Day 3-5: Monitoring & Optimization**
- Track E2E test pass rate (should remain 100%)
- Monitor backend latency metrics (P50/P95/P99)
- Review retry attempt frequency (<5% expected)
- Document lessons learned

### Success Criteria by Week End
- [ ] E2E test pass rate: 100% (up from 70%)
- [ ] Backend latency: 50-200ms (down from 150-600ms)
- [ ] CI block events: 0 (down from N per week)
- [ ] Test execution time: ≤5s per test (acceptable)
- [ ] Documentation complete and accurate

---

## 10. Rollback Plan

### Trigger Conditions
- **Backend:** Unit tests fail or API contract broken
- **Tests:** Pass rate drops below 80% in CI post-merge
- **Performance:** Backend latency P99 > 500ms (regression)
- **Reliability:** Test execution time > 10s per test (unacceptable)

### Phase-Specific Rollback

#### Phase 1 Rollback (Backend Changes)
**Procedure:**
```bash
# Identify backend commit
git log --oneline backend/internal/api/handlers/feature_flags_handler.go

# Revert backend changes only
git revert <backend-commit-hash>
git push origin hotfix/revert-backend-optimization

# Re-deploy and monitor
```

**Impact:** Backend returns to N+1 pattern, E2E tests may timeout again

#### Phase 2 Rollback (Test Changes)
**Procedure:**
```bash
# Revert test file changes
git revert <test-commit-hash>
git push origin hotfix/revert-test-resilience

# E2E tests return to original state
```

**Impact:** Tests revert to hard-coded waits and explicit timeouts

### Full Rollback Procedure
**If all changes need reverting:**
```bash
# Revert all commits in reverse order
git revert --no-commit <phase-4-commit>..<phase-0-commit>
git commit -m "revert: Rollback E2E timeout fix (all phases)"
git push origin hotfix/revert-e2e-timeout-fix-full

# Skip CI if necessary to unblock main
git push --no-verify
```

### Post-Rollback Actions
1. **Document failure:** Why did the fix not work?
2. **Post-mortem:** Team meeting to analyze root cause
3. **Re-plan:** Update spec with new findings
4. **Prioritize:** Determine if issue still blocks CI

### Emergency Bypass (CI Blocked)
**If main branch blocked and immediate fix needed:**
```bash
# Temporarily disable E2E tests in CI
# File: .github/workflows/e2e-tests.yml
# Add condition: if: false

# Push emergency disable
git commit -am "ci: Temporarily disable E2E tests (emergency)"
git push

# Schedule fix: Within 24 hours max
```

---

## 11. Success Metrics

### Immediate Success (Week 1)

**Backend Performance:**
- [ ] GET latency: 150-600ms → 50-200ms (P99) ✓ 3-6x improvement
- [ ] PUT latency: 50-600ms → 50-200ms (P99) ✓ Consistent performance
- [ ] Query count: 3 → 1 per GET ✓ N+1 eliminated
- [ ] Transaction count: N → 1 per PUT ✓ Atomic updates

**Test Reliability:**
- [ ] Pass rate in CI: 70% → 100% ✓ Zero tolerance for flakiness
- [ ] Timeout errors: 4 tests → 0 tests ✓ No timeouts expected
- [ ] Test execution time: ~3-5s per test ✓ Acceptable vs reliability
- [ ] Retry attempts: <5% of runs ✓ Transient failures handled

**CI/CD:**
- [ ] CI block events: N per week → 0 per week ✓ Main branch unblocked
- [ ] E2E workflow duration: ≤15min ✓ With sharding across 3 browsers
- [ ] Test shards: All pass (Chromium, Firefox, WebKit) ✓

### Mid-term Success (Month 1)

**Stability:**
- [ ] E2E pass rate maintained: 100% ✓ No regressions
- [ ] Backend P99 latency maintained: <250ms ✓ No performance drift
- [ ] Zero new CI timeout issues ✓ Fix is robust

**Knowledge Transfer:**
- [ ] Team trained on new test patterns ✓ Polling > hard-coded waits
- [ ] Documentation reviewed and accurate ✓ Performance characteristics known
- [ ] Runbook created for future E2E issues ✓ Reproducible process

**Code Quality:**
- [ ] No lint/TypeScript errors introduced ✓ Clean codebase
- [ ] Test patterns adopted in other suites ✓ Consistency across tests
- [ ] Backend optimization patterns documented ✓ Future N+1 prevention

### Long-term Success (Quarter 1)

**Scalability:**
- [ ] Feature flag endpoint handles increased load ✓ Sub-200ms under load
- [ ] E2E test suite grows without flakiness ✓ Patterns established
- [ ] CI/CD pipeline reliability: >99% ✓ Infrastructure stable

**User Impact:**
- [ ] Real users benefit from faster feature flag loading ✓ 3-6x faster
- [ ] Developer experience improved: Faster local E2E runs ✓
- [ ] On-call incidents reduced: Fewer CI-related pages ✓

### Key Performance Indicators (KPIs)

| Metric | Before | Target | Measured |
|--------|--------|--------|----------|
| Backend GET P99 | 600ms | 200ms | _TBD_ |
| Backend PUT P99 | 600ms | 200ms | _TBD_ |
| E2E Pass Rate | 70% | 100% | _TBD_ |
| Test Timeout Errors | 4 | 0 | _TBD_ |
| CI Block Events/Week | N | 0 | _TBD_ |
| Test Execution Time | ~3s | ~5s | _TBD_ |
| Retry Attempt Rate | 0% | <5% | _TBD_ |

**Tracking:** Metrics captured in CI artifacts and monitored via dashboard

---

## 12. Glossary

**N+1 Query:** Anti-pattern where N additional DB queries fetch related data that could be retrieved in 1 batch query. In this case: 3 individual `WHERE key = ?` queries instead of 1 `WHERE key IN (?, ?, ?)` batch query. Amplifies latency linearly with number of flags.

**Condition-Based Polling:** Testing pattern that repeatedly checks if a condition is met (e.g., API returns expected state) at regular intervals, instead of hard-coded waits. More reliable than hoping a fixed delay is "enough time." Example: `waitForFeatureFlagPropagation()`.

**Retry Logic with Exponential Backoff:** Automatically retrying failed operations with increasing delays between attempts (e.g., 2s, 4s, 8s). Handles transient failures (network glitches, DB locks) without infinite loops. Example: `retryAction()` with maxAttempts=3.

**Hard-Coded Wait:** Anti-pattern using `page.waitForTimeout(1000)` to "hope" an operation completes. Unreliable in CI (may be too short) and wasteful locally (may be too long). Prefer Playwright's auto-waiting and condition-based polling.

**Strategic Wait:** Deliberate delay between operations to allow backend state propagation. **DEPRECATED** in this plan—replaced by condition-based polling which verifies state instead of guessing duration.

**SQLite WAL:** Write-Ahead Logging mode that improves concurrency by writing changes to a log file before committing to main database. Adds <100ms checkpoint latency but enables concurrent reads during writes.

**CI Runner:** Virtual machine executing GitHub Actions workflows. Typically has slower disk I/O (20-120x) than developer machines due to virtualization and shared resources. Backend optimization benefits CI most.

**Test Sharding:** Splitting test suite across parallel jobs to reduce total execution time. In this project: 3 browser shards (Chromium, Firefox, WebKit) run concurrently to keep total E2E duration <15min.

**Batch Query:** Single database query that retrieves multiple records matching a set of criteria. Example: `WHERE key IN ('flag1', 'flag2', 'flag3')` instead of 3 separate queries. Reduces round-trip latency and connection overhead.

**Transaction Wrapping:** Grouping multiple database operations into a single atomic unit. If any operation fails, all changes are rolled back. Ensures data consistency for multi-flag updates in `UpdateFlags()`.

**P50/P95/P99 Latency:** Performance percentiles. P50 (median) = 50% of requests faster, P95 = 95% faster, P99 = 99% faster. P99 is critical for worst-case user experience. Target: P99 <200ms for feature flags endpoint.

**Helper Defaults:** Timeout values configured in helper functions like `clickAndWaitForResponse()` and `waitForAPIResponse()`. Currently 30s, which provides 150x safety margin over optimized backend latency (200ms P99).

**Auto-Waiting:** Playwright's built-in mechanism that waits for elements to become actionable (visible, enabled, stable) before interacting. Eliminates need for most explicit waits. Should be relied upon wherever possible.

---

**Plan Version:** 2.0 (REVISED)
**Status:** Ready for Implementation
**Revision Date:** 2026-02-01
**Supervisor Feedback:** Incorporated (Proper Fix Approach)
**Next Step:** Hand off to Supervisor Agent for review and task assignment
**Estimated Effort:** 8-13 hours total (all phases)
**Risk Level:** Low-Medium (backend changes + comprehensive testing)
**Philosophy:** "Proper fix over quick fix" - Address root cause, measure first, avoid hard-coded waits