Charon/docs/plans/ci_failure_remediation_plan.md

# CI/CD Failure Remediation Plan

> **Created:** 2025-12-13
> **Status:** Planned
> **Priority:** High

## Executive Summary

Three GitHub Actions workflows have failed. This document provides root cause analysis and specific remediation steps for each failure.

---

## 1. Quality Checks Workflow (`quality-checks.yml`)

### 1.1 Frontend Test Timeout

**File:** [frontend/src/components/**tests**/LiveLogViewer.test.tsx](../../frontend/src/components/__tests__/LiveLogViewer.test.tsx#L374)
**Test:** "displays blocked requests with special styling" under "Security Mode"
**Error:** `Test timed out in 5000ms`

#### Root Cause Analysis

The failing test at line 374 has a race condition between:

1. The `await act(async () => { mockOnSecurityMessage(blockedLog); })` call
2. The subsequent `waitFor` assertions

The test attempts to verify multiple DOM elements after sending a security log message:

```typescript
await waitFor(() => {
  // Use getAllByText since 'WAF' appears both in dropdown option and source badge
  const wafElements = screen.getAllByText('WAF');
  expect(wafElements.length).toBeGreaterThanOrEqual(2); // Option + badge
  expect(screen.getByText('10.0.0.1')).toBeTruthy();
  expect(screen.getByText(/BLOCKED: SQL injection detected/)).toBeTruthy();
  // Block reason is shown in brackets - check for the text content
  expect(screen.getByText(/\[SQL injection detected\]/)).toBeTruthy();
});
```

**Issues identified:**

1. **Multiple assertions in single waitFor:** The `waitFor` contains 4 separate assertions. If any one fails, the entire block retries, potentially causing timeout.
2. **Complex regex matching:** The regex patterns `/BLOCKED: SQL injection detected/` and `/\[SQL injection detected\]/` may be matching against DOM elements that are not yet rendered.
3. **State update timing:** The `act()` wrapper is async but the component's state update (`setLogs`) may not complete before `waitFor` starts checking.

#### Recommended Fix

**Option A: Increase test timeout (Quick fix)**

```typescript
// Line 371-374 - Add timeout option
it('displays blocked requests with special styling', async () => {
  render(<LiveLogViewer mode="security" />);

  // Wait for connection to establish
  await waitFor(() => expect(screen.getByText('Connected')).toBeTruthy());

  const blockedLog: logsApi.SecurityLogEntry = {
    timestamp: '2025-12-12T10:30:00Z',
    level: 'warn',
    logger: 'http.handlers.waf',
    client_ip: '10.0.0.1',
    method: 'POST',
    uri: '/admin',
    status: 403,
    duration: 0.001,
    size: 0,
    user_agent: 'Attack/1.0',
    host: 'example.com',
    source: 'waf',
    blocked: true,
    block_reason: 'SQL injection detected',
  };

  // Send message inside act to properly handle state updates
  await act(async () => {
    if (mockOnSecurityMessage) {
      mockOnSecurityMessage(blockedLog);
    }
  });

  // Split assertions into separate waitFor calls to isolate failures
  await waitFor(() => {
    expect(screen.getByText('10.0.0.1')).toBeTruthy();
  }, { timeout: 10000 });

  await waitFor(() => {
    expect(screen.getByText(/BLOCKED: SQL injection detected/)).toBeTruthy();
  });

  await waitFor(() => {
    expect(screen.getByText(/\[SQL injection detected\]/)).toBeTruthy();
  });

  // Verify WAF badge appears (separate from dropdown option)
  await waitFor(() => {
    const wafElements = screen.getAllByText('WAF');
    expect(wafElements.length).toBeGreaterThanOrEqual(2);
  });
}, 15000); // Increase overall test timeout
```

**Option B: Use `findBy` queries (Preferred)**

```typescript
it('displays blocked requests with special styling', async () => {
  render(<LiveLogViewer mode="security" />);

  // Wait for connection
  await screen.findByText('Connected');

  const blockedLog: logsApi.SecurityLogEntry = {
    // ... same as before
  };

  await act(async () => {
    if (mockOnSecurityMessage) {
      mockOnSecurityMessage(blockedLog);
    }
  });

  // Use findBy for async queries - cleaner and more reliable
  await screen.findByText('10.0.0.1');
  await screen.findByText(/BLOCKED: SQL injection detected/);
  await screen.findByText(/\[SQL injection detected\]/);

  // For getAllByText, wrap in waitFor since findAllBy isn't used for counts
  await waitFor(() => {
    const wafElements = screen.getAllByText('WAF');
    expect(wafElements.length).toBeGreaterThanOrEqual(2);
  });
});
```

---

### 1.2 Backend gosec G115 - Integer Overflow Errors

#### 1.2.1 backup_service.go Line 293

**File:** [backend/internal/services/backup_service.go](../../backend/internal/services/backup_service.go#L293)
**Error:** `G115: integer overflow conversion uint64 -> int64`

**Current Code:**

```go
func (s *BackupService) GetAvailableSpace() (int64, error) {
    var stat syscall.Statfs_t
    if err := syscall.Statfs(s.BackupDir, &stat); err != nil {
        return 0, fmt.Errorf("failed to get disk space: %w", err)
    }
    // Available blocks * block size = available bytes
    return int64(stat.Bavail) * int64(stat.Bsize), nil  // Line 293
}
```

**Root Cause:**

- `stat.Bavail` is `uint64` (available blocks)
- `stat.Bsize` is `int64` (block size, can be negative on some systems but practically always positive)
- Converting `uint64` to `int64` can overflow if `Bavail` exceeds `math.MaxInt64`

**Recommended Fix:**

```go
import (
    "math"
    // ... other imports
)

func (s *BackupService) GetAvailableSpace() (int64, error) {
    var stat syscall.Statfs_t
    if err := syscall.Statfs(s.BackupDir, &stat); err != nil {
        return 0, fmt.Errorf("failed to get disk space: %w", err)
    }

    // Safe conversion with overflow check
    // Bavail is uint64, Bsize is int64 (but always positive for valid filesystems)
    bavail := stat.Bavail
    bsize := stat.Bsize

    // Check for negative block size (invalid filesystem state)
    if bsize < 0 {
        return 0, fmt.Errorf("invalid block size: %d", bsize)
    }

    // Check if Bavail exceeds int64 max before conversion
    if bavail > uint64(math.MaxInt64) {
        // Cap at max int64 - this represents ~8 exabytes which is sufficient
        return math.MaxInt64, nil
    }

    // Safe to convert now
    availBlocks := int64(bavail)
    blockSize := bsize

    // Check for overflow in multiplication
    if availBlocks > 0 && blockSize > math.MaxInt64/availBlocks {
        return math.MaxInt64, nil
    }

    return availBlocks * blockSize, nil
}
```

---

#### 1.2.2 proxy_host_handler.go Lines 216 and 235

**File:** [backend/internal/api/handlers/proxy_host_handler.go](../../backend/internal/api/handlers/proxy_host_handler.go#L216)
**Error:** `G115: integer overflow conversion int -> uint`

**Current Code (Lines 213-222 for certificate_id, Lines 231-240 for access_list_id):**

```go
// Line 216
case int:
    id := uint(t)  // G115: int -> uint overflow
    host.CertificateID = &id

// Line 235
case int:
    id := uint(t)  // G115: int -> uint overflow
    host.AccessListID = &id
```

**Root Cause:**

- When JSON is unmarshaled into `map[string]interface{}`, numeric values can become `float64` or `int`
- Converting a negative `int` to `uint` causes overflow (e.g., `-1` becomes `18446744073709551615`)
- Database IDs should never be negative

**Recommended Fix:**

Create a helper function and apply to both locations:

```go
// Add at package level or in a shared util package
// safeIntToUint safely converts an int to uint, returning false if negative
func safeIntToUint(i int) (uint, bool) {
    if i < 0 {
        return 0, false
    }
    return uint(i), true
}

// safeFloat64ToUint safely converts a float64 to uint, returning false if negative or fractional
func safeFloat64ToUint(f float64) (uint, bool) {
    if f < 0 || f != float64(uint(f)) {
        return 0, false
    }
    return uint(f), true
}
```

**Updated handler code (Lines 210-245):**

```go
if v, ok := payload["certificate_id"]; ok {
    if v == nil {
        host.CertificateID = nil
    } else {
        switch t := v.(type) {
        case float64:
            if id, ok := safeFloat64ToUint(t); ok {
                host.CertificateID = &id
            }
        case int:
            if id, ok := safeIntToUint(t); ok {
                host.CertificateID = &id
            }
        case string:
            if n, err := strconv.ParseUint(t, 10, 32); err == nil {
                id := uint(n)
                host.CertificateID = &id
            }
        }
    }
}
if v, ok := payload["access_list_id"]; ok {
    if v == nil {
        host.AccessListID = nil
    } else {
        switch t := v.(type) {
        case float64:
            if id, ok := safeFloat64ToUint(t); ok {
                host.AccessListID = &id
            }
        case int:
            if id, ok := safeIntToUint(t); ok {
                host.AccessListID = &id
            }
        case string:
            if n, err := strconv.ParseUint(t, 10, 32); err == nil {
                id := uint(n)
                host.AccessListID = &id
            }
        }
    }
}
```

---

## 2. PR Checklist Validation Workflow (`pr-checklist.yml`)

**Error:** "Validate history-rewrite checklist (conditional)" failed

### Root Cause Analysis

The workflow at [.github/workflows/pr-checklist.yml](../../.github/workflows/pr-checklist.yml) validates PRs that touch history-rewrite related files. It checks for three items in the PR body:

1. `preview_removals.sh mention` - Pattern: `/preview_removals\.sh/i`
2. `data/backups mention` - Pattern: `/data\/?backups/i`
3. `explicit non-run of --force` - Pattern: `/(?:\[\s*[xX]\s*\]\s*)?(?:i will not run|will not run|do not run|don'?t run|won'?t run)\b[^\n]*--force/i`

**When this check triggers:**

The check only runs if the PR modifies files matching:

- `scripts/history-rewrite/*`
- `docs/plans/history_rewrite.md`
- Any file containing `history-rewrite` in the path

### Resolution Options

**Option A: If PR legitimately touches history-rewrite files**

Update the PR description to include all required checklist items from [.github/PULL_REQUEST_TEMPLATE/history-rewrite.md](../../.github/PULL_REQUEST_TEMPLATE/history-rewrite.md):

```markdown
## Checklist - required for history rewrite PRs

- [x] I have created a **local** backup branch: `backup/history-YYYYMMDD-HHMMSS`
- [x] I have pushed the backup branch to the remote origin
- [x] I have run a dry-run locally: `scripts/history-rewrite/preview_removals.sh --paths '...'`
- [x] I have verified the `data/backups` tarball is present
- [x] I will not run the destructive `--force` step without explicit approval
```

**Option B: If PR doesn't need history-rewrite validation**

Ensure the PR doesn't modify files in:

- `scripts/history-rewrite/`
- `docs/plans/history_rewrite.md`
- Any files with `history-rewrite` in the name

**Option C: False positive - workflow issue**

If the workflow is triggering incorrectly, check the file list detection logic at line 27-28 of the workflow.

---

## 3. Benchmark Workflow (`benchmark.yml`)

### 3.1 "Resource not accessible by integration" Error

**Root Cause:**

The `benchmark-action/github-action-benchmark@v1` action requires write permissions to push benchmark results to the repository. This fails on:

- Pull requests from forks (restricted permissions)
- PRs where `GITHUB_TOKEN` doesn't have `contents: write` permission

**Current Workflow Configuration:**

```yaml
permissions:
  contents: write
  deployments: write
```

The error occurs because:

1. On PRs, the token may not have write access
2. The `auto-push: true` setting tries to push on main branch only, but the action still needs permissions to access the benchmark data

**Recommended Fix:**

```yaml
- name: Store Benchmark Result
  uses: benchmark-action/github-action-benchmark@v1
  with:
    name: Go Benchmark
    tool: 'go'
    output-file-path: backend/output.txt
    github-token: ${{ secrets.GITHUB_TOKEN }}
    # Only auto-push on main branch pushes, not PRs
    auto-push: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}
    alert-threshold: '150%'
    comment-on-alert: true
    fail-on-alert: false
    summary-always: true
    # Skip external data storage on PRs to avoid permission errors
    skip-fetch-gh-pages: ${{ github.event_name == 'pull_request' }}
```

Alternatively, for fork PRs, consider using `pull_request_target` event with caution or skip the benchmark action entirely on PRs.

---

### 3.2 Performance Regression (1.51x threshold exceeded)

**Warning:** Performance regression 1.51x (165768 ns/op vs 109674 ns/op) exceeds 1.5x threshold

**Benchmark Functions Available:**

From [backend/internal/api/handlers/benchmark_test.go](../../backend/internal/api/handlers/benchmark_test.go):

| Benchmark | Line |
|-----------|------|
| `BenchmarkSecurityHandler_GetStatus` | 47 |
| `BenchmarkSecurityHandler_GetStatus_NoSettings` | 82 |
| `BenchmarkSecurityHandler_ListDecisions` | 105 |
| `BenchmarkSecurityHandler_ListRuleSets` | 138 |
| `BenchmarkSecurityHandler_UpsertRuleSet` | 171 |
| `BenchmarkSecurityHandler_CreateDecision` | 202 |
| `BenchmarkSecurityHandler_GetConfig` | 233 |
| `BenchmarkSecurityHandler_UpdateConfig` | 266 |
| `BenchmarkSecurityHandler_GetStatus_Parallel` | 303 |
| `BenchmarkSecurityHandler_ListDecisions_Parallel` | 336 |
| `BenchmarkSecurityHandler_LargeRuleSetContent` | 383 |
| `BenchmarkSecurityHandler_ManySettingsLookups` | 420 |

**Root Cause Analysis:**

The 1.51x regression (165768 ns vs 109674 ns ≈ 56μs increase) likely comes from:

1. **Database operations:** Benchmarks use in-memory SQLite. Any additional queries or model changes would show up here.
2. **Parallel benchmark flakiness:** `BenchmarkSecurityHandler_GetStatus_Parallel` and `BenchmarkSecurityHandler_ListDecisions_Parallel` use shared memory databases which can have contention.
3. **CI environment variability:** GitHub Actions runners have variable performance.

**Investigation Steps:**

1. Run benchmarks locally to establish baseline:

   ```bash
   cd backend && go test -bench=. -benchmem -benchtime=3s ./internal/api/handlers/... -run=^$
   ```

2. Compare with previous commit:

   ```bash
   git stash
   git checkout HEAD~1
   go test -bench=. -benchmem ./internal/api/handlers/... -run=^$ > bench_old.txt
   git checkout -
   git stash pop
   go test -bench=. -benchmem ./internal/api/handlers/... -run=^$ > bench_new.txt
   benchstat bench_old.txt bench_new.txt
   ```

3. Check recent changes to:
   - `internal/api/handlers/security_handler.go`
   - `internal/models/security*.go`
   - Database query patterns

**Recommended Actions:**

**If real regression:**

- Profile the affected handler using `go test -cpuprofile`
- Review recent commits for inefficient code
- Optimize the specific slow path

**If CI flakiness:**

- Increase `alert-threshold` to `175%` or `200%`
- Add `-benchtime=3s` for more stable results
- Consider running benchmarks multiple times and averaging

**Workflow Fix for Threshold:**

```yaml
- name: Store Benchmark Result
  uses: benchmark-action/github-action-benchmark@v1
  with:
    # ... other options
    # Increase threshold to account for CI variability
    alert-threshold: '175%'
    # Don't fail on alert for benchmarks - just warn
    fail-on-alert: false
```

---

## Summary of Required Changes

| File | Change | Priority |
|------|--------|----------|
| `frontend/src/components/__tests__/LiveLogViewer.test.tsx` | Split waitFor assertions, increase timeout | High |
| `backend/internal/services/backup_service.go` | Add overflow protection for `GetAvailableSpace` | High |
| `backend/internal/api/handlers/proxy_host_handler.go` | Add safe int-to-uint conversion helpers | High |
| `.github/workflows/benchmark.yml` | Add permission guards, adjust threshold | Medium |
| PR Description | Add history-rewrite checklist items if applicable | Conditional |

---

## Implementation Order

1. **Backend gosec fixes** - Blocking CI, straightforward fixes
2. **Frontend test timeout** - Blocking CI, may need iteration
3. **Benchmark workflow** - Non-blocking (`fail-on-alert: false`), can be addressed after
4. **PR checklist** - Context-dependent, may not need code changes

---

## Testing Verification

After implementing fixes:

```bash
# Backend
cd backend && go test ./... -v
golangci-lint run  # Should pass gosec G115

# Frontend
cd frontend && npm test

# Full CI simulation
pre-commit run --all-files
```