Files
Charon/docs/plans/ci_failure_remediation_plan.md
GitHub Actions acea4307ba Enhance documentation and testing plans
- Added references to existing test files in the UI/UX testing plan.
- Updated CI failure remediation plan with improved file paths and clarity.
- Expanded CrowdSec full implementation documentation with detailed configuration steps and scripts.
- Improved CrowdSec testing plan with clearer objectives and expected results.
- Updated current specification documentation with additional context on CVE remediation.
- Enhanced docs-to-issues workflow documentation for better issue tracking.
- Corrected numbering in UI/UX bugfixes specification for clarity.
- Improved WAF testing plan with detailed curl commands and expected results.
- Updated QA reports for CrowdSec implementation and UI/UX testing with detailed results and coverage metrics.
- Fixed rate limit integration test summary with clear identification of issues and resolutions.
- Enhanced rate limit test status report with detailed root causes and next steps for follow-up.
2025-12-14 02:45:24 +00:00

16 KiB

CI/CD Failure Remediation Plan

Created: 2025-12-13 Status: Planned Priority: High

Executive Summary

Three GitHub Actions workflows have failed. This document provides root cause analysis and specific remediation steps for each failure.


1. Quality Checks Workflow (quality-checks.yml)

1.1 Frontend Test Timeout

File: frontend/src/components/tests/LiveLogViewer.test.tsx Test: "displays blocked requests with special styling" under "Security Mode" Error: Test timed out in 5000ms

Root Cause Analysis

The failing test at line 374 has a race condition between:

  1. The await act(async () => { mockOnSecurityMessage(blockedLog); }) call
  2. The subsequent waitFor assertions

The test attempts to verify multiple DOM elements after sending a security log message:

await waitFor(() => {
  // Use getAllByText since 'WAF' appears both in dropdown option and source badge
  const wafElements = screen.getAllByText('WAF');
  expect(wafElements.length).toBeGreaterThanOrEqual(2); // Option + badge
  expect(screen.getByText('10.0.0.1')).toBeTruthy();
  expect(screen.getByText(/BLOCKED: SQL injection detected/)).toBeTruthy();
  // Block reason is shown in brackets - check for the text content
  expect(screen.getByText(/\[SQL injection detected\]/)).toBeTruthy();
});

Issues identified:

  1. Multiple assertions in single waitFor: The waitFor contains 4 separate assertions. If any one fails, the entire block retries, potentially causing timeout.
  2. Complex regex matching: The regex patterns /BLOCKED: SQL injection detected/ and /\[SQL injection detected\]/ may be matching against DOM elements that are not yet rendered.
  3. State update timing: The act() wrapper is async but the component's state update (setLogs) may not complete before waitFor starts checking.

Option A: Increase test timeout (Quick fix)

// Line 371-374 - Add timeout option
it('displays blocked requests with special styling', async () => {
  render(<LiveLogViewer mode="security" />);

  // Wait for connection to establish
  await waitFor(() => expect(screen.getByText('Connected')).toBeTruthy());

  const blockedLog: logsApi.SecurityLogEntry = {
    timestamp: '2025-12-12T10:30:00Z',
    level: 'warn',
    logger: 'http.handlers.waf',
    client_ip: '10.0.0.1',
    method: 'POST',
    uri: '/admin',
    status: 403,
    duration: 0.001,
    size: 0,
    user_agent: 'Attack/1.0',
    host: 'example.com',
    source: 'waf',
    blocked: true,
    block_reason: 'SQL injection detected',
  };

  // Send message inside act to properly handle state updates
  await act(async () => {
    if (mockOnSecurityMessage) {
      mockOnSecurityMessage(blockedLog);
    }
  });

  // Split assertions into separate waitFor calls to isolate failures
  await waitFor(() => {
    expect(screen.getByText('10.0.0.1')).toBeTruthy();
  }, { timeout: 10000 });

  await waitFor(() => {
    expect(screen.getByText(/BLOCKED: SQL injection detected/)).toBeTruthy();
  });

  await waitFor(() => {
    expect(screen.getByText(/\[SQL injection detected\]/)).toBeTruthy();
  });

  // Verify WAF badge appears (separate from dropdown option)
  await waitFor(() => {
    const wafElements = screen.getAllByText('WAF');
    expect(wafElements.length).toBeGreaterThanOrEqual(2);
  });
}, 15000); // Increase overall test timeout

Option B: Use findBy queries (Preferred)

it('displays blocked requests with special styling', async () => {
  render(<LiveLogViewer mode="security" />);

  // Wait for connection
  await screen.findByText('Connected');

  const blockedLog: logsApi.SecurityLogEntry = {
    // ... same as before
  };

  await act(async () => {
    if (mockOnSecurityMessage) {
      mockOnSecurityMessage(blockedLog);
    }
  });

  // Use findBy for async queries - cleaner and more reliable
  await screen.findByText('10.0.0.1');
  await screen.findByText(/BLOCKED: SQL injection detected/);
  await screen.findByText(/\[SQL injection detected\]/);

  // For getAllByText, wrap in waitFor since findAllBy isn't used for counts
  await waitFor(() => {
    const wafElements = screen.getAllByText('WAF');
    expect(wafElements.length).toBeGreaterThanOrEqual(2);
  });
});

1.2 Backend gosec G115 - Integer Overflow Errors

1.2.1 backup_service.go Line 293

File: backend/internal/services/backup_service.go Error: G115: integer overflow conversion uint64 -> int64

Current Code:

func (s *BackupService) GetAvailableSpace() (int64, error) {
    var stat syscall.Statfs_t
    if err := syscall.Statfs(s.BackupDir, &stat); err != nil {
        return 0, fmt.Errorf("failed to get disk space: %w", err)
    }
    // Available blocks * block size = available bytes
    return int64(stat.Bavail) * int64(stat.Bsize), nil  // Line 293
}

Root Cause:

  • stat.Bavail is uint64 (available blocks)
  • stat.Bsize is int64 (block size, can be negative on some systems but practically always positive)
  • Converting uint64 to int64 can overflow if Bavail exceeds math.MaxInt64

Recommended Fix:

import (
    "math"
    // ... other imports
)

func (s *BackupService) GetAvailableSpace() (int64, error) {
    var stat syscall.Statfs_t
    if err := syscall.Statfs(s.BackupDir, &stat); err != nil {
        return 0, fmt.Errorf("failed to get disk space: %w", err)
    }

    // Safe conversion with overflow check
    // Bavail is uint64, Bsize is int64 (but always positive for valid filesystems)
    bavail := stat.Bavail
    bsize := stat.Bsize

    // Check for negative block size (invalid filesystem state)
    if bsize < 0 {
        return 0, fmt.Errorf("invalid block size: %d", bsize)
    }

    // Check if Bavail exceeds int64 max before conversion
    if bavail > uint64(math.MaxInt64) {
        // Cap at max int64 - this represents ~8 exabytes which is sufficient
        return math.MaxInt64, nil
    }

    // Safe to convert now
    availBlocks := int64(bavail)
    blockSize := bsize

    // Check for overflow in multiplication
    if availBlocks > 0 && blockSize > math.MaxInt64/availBlocks {
        return math.MaxInt64, nil
    }

    return availBlocks * blockSize, nil
}

1.2.2 proxy_host_handler.go Lines 216 and 235

File: backend/internal/api/handlers/proxy_host_handler.go Error: G115: integer overflow conversion int -> uint

Current Code (Lines 213-222 for certificate_id, Lines 231-240 for access_list_id):

// Line 216
case int:
    id := uint(t)  // G115: int -> uint overflow
    host.CertificateID = &id

// Line 235
case int:
    id := uint(t)  // G115: int -> uint overflow
    host.AccessListID = &id

Root Cause:

  • When JSON is unmarshaled into map[string]interface{}, numeric values can become float64 or int
  • Converting a negative int to uint causes overflow (e.g., -1 becomes 18446744073709551615)
  • Database IDs should never be negative

Recommended Fix:

Create a helper function and apply to both locations:

// Add at package level or in a shared util package
// safeIntToUint safely converts an int to uint, returning false if negative
func safeIntToUint(i int) (uint, bool) {
    if i < 0 {
        return 0, false
    }
    return uint(i), true
}

// safeFloat64ToUint safely converts a float64 to uint, returning false if negative or fractional
func safeFloat64ToUint(f float64) (uint, bool) {
    if f < 0 || f != float64(uint(f)) {
        return 0, false
    }
    return uint(f), true
}

Updated handler code (Lines 210-245):

if v, ok := payload["certificate_id"]; ok {
    if v == nil {
        host.CertificateID = nil
    } else {
        switch t := v.(type) {
        case float64:
            if id, ok := safeFloat64ToUint(t); ok {
                host.CertificateID = &id
            }
        case int:
            if id, ok := safeIntToUint(t); ok {
                host.CertificateID = &id
            }
        case string:
            if n, err := strconv.ParseUint(t, 10, 32); err == nil {
                id := uint(n)
                host.CertificateID = &id
            }
        }
    }
}
if v, ok := payload["access_list_id"]; ok {
    if v == nil {
        host.AccessListID = nil
    } else {
        switch t := v.(type) {
        case float64:
            if id, ok := safeFloat64ToUint(t); ok {
                host.AccessListID = &id
            }
        case int:
            if id, ok := safeIntToUint(t); ok {
                host.AccessListID = &id
            }
        case string:
            if n, err := strconv.ParseUint(t, 10, 32); err == nil {
                id := uint(n)
                host.AccessListID = &id
            }
        }
    }
}

2. PR Checklist Validation Workflow (pr-checklist.yml)

Error: "Validate history-rewrite checklist (conditional)" failed

Root Cause Analysis

The workflow at .github/workflows/pr-checklist.yml validates PRs that touch history-rewrite related files. It checks for three items in the PR body:

  1. preview_removals.sh mention - Pattern: /preview_removals\.sh/i
  2. data/backups mention - Pattern: /data\/?backups/i
  3. explicit non-run of --force - Pattern: /(?:\[\s*[xX]\s*\]\s*)?(?:i will not run|will not run|do not run|don'?t run|won'?t run)\b[^\n]*--force/i

When this check triggers:

The check only runs if the PR modifies files matching:

  • scripts/history-rewrite/*
  • docs/plans/history_rewrite.md
  • Any file containing history-rewrite in the path

Resolution Options

Option A: If PR legitimately touches history-rewrite files

Update the PR description to include all required checklist items from .github/PULL_REQUEST_TEMPLATE/history-rewrite.md:

## Checklist - required for history rewrite PRs

- [x] I have created a **local** backup branch: `backup/history-YYYYMMDD-HHMMSS`
- [x] I have pushed the backup branch to the remote origin
- [x] I have run a dry-run locally: `scripts/history-rewrite/preview_removals.sh --paths '...'`
- [x] I have verified the `data/backups` tarball is present
- [x] I will not run the destructive `--force` step without explicit approval

Option B: If PR doesn't need history-rewrite validation

Ensure the PR doesn't modify files in:

  • scripts/history-rewrite/
  • docs/plans/history_rewrite.md
  • Any files with history-rewrite in the name

Option C: False positive - workflow issue

If the workflow is triggering incorrectly, check the file list detection logic at line 27-28 of the workflow.


3. Benchmark Workflow (benchmark.yml)

3.1 "Resource not accessible by integration" Error

Root Cause:

The benchmark-action/github-action-benchmark@v1 action requires write permissions to push benchmark results to the repository. This fails on:

  • Pull requests from forks (restricted permissions)
  • PRs where GITHUB_TOKEN doesn't have contents: write permission

Current Workflow Configuration:

permissions:
  contents: write
  deployments: write

The error occurs because:

  1. On PRs, the token may not have write access
  2. The auto-push: true setting tries to push on main branch only, but the action still needs permissions to access the benchmark data

Recommended Fix:

- name: Store Benchmark Result
  uses: benchmark-action/github-action-benchmark@v1
  with:
    name: Go Benchmark
    tool: 'go'
    output-file-path: backend/output.txt
    github-token: ${{ secrets.GITHUB_TOKEN }}
    # Only auto-push on main branch pushes, not PRs
    auto-push: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}
    alert-threshold: '150%'
    comment-on-alert: true
    fail-on-alert: false
    summary-always: true
    # Skip external data storage on PRs to avoid permission errors
    skip-fetch-gh-pages: ${{ github.event_name == 'pull_request' }}

Alternatively, for fork PRs, consider using pull_request_target event with caution or skip the benchmark action entirely on PRs.


3.2 Performance Regression (1.51x threshold exceeded)

Warning: Performance regression 1.51x (165768 ns/op vs 109674 ns/op) exceeds 1.5x threshold

Benchmark Functions Available:

From backend/internal/api/handlers/benchmark_test.go:

Benchmark Line
BenchmarkSecurityHandler_GetStatus 47
BenchmarkSecurityHandler_GetStatus_NoSettings 82
BenchmarkSecurityHandler_ListDecisions 105
BenchmarkSecurityHandler_ListRuleSets 138
BenchmarkSecurityHandler_UpsertRuleSet 171
BenchmarkSecurityHandler_CreateDecision 202
BenchmarkSecurityHandler_GetConfig 233
BenchmarkSecurityHandler_UpdateConfig 266
BenchmarkSecurityHandler_GetStatus_Parallel 303
BenchmarkSecurityHandler_ListDecisions_Parallel 336
BenchmarkSecurityHandler_LargeRuleSetContent 383
BenchmarkSecurityHandler_ManySettingsLookups 420

Root Cause Analysis:

The 1.51x regression (165768 ns vs 109674 ns ≈ 56μs increase) likely comes from:

  1. Database operations: Benchmarks use in-memory SQLite. Any additional queries or model changes would show up here.
  2. Parallel benchmark flakiness: BenchmarkSecurityHandler_GetStatus_Parallel and BenchmarkSecurityHandler_ListDecisions_Parallel use shared memory databases which can have contention.
  3. CI environment variability: GitHub Actions runners have variable performance.

Investigation Steps:

  1. Run benchmarks locally to establish baseline:

    cd backend && go test -bench=. -benchmem -benchtime=3s ./internal/api/handlers/... -run=^$
    
  2. Compare with previous commit:

    git stash
    git checkout HEAD~1
    go test -bench=. -benchmem ./internal/api/handlers/... -run=^$ > bench_old.txt
    git checkout -
    git stash pop
    go test -bench=. -benchmem ./internal/api/handlers/... -run=^$ > bench_new.txt
    benchstat bench_old.txt bench_new.txt
    
  3. Check recent changes to:

    • internal/api/handlers/security_handler.go
    • internal/models/security*.go
    • Database query patterns

Recommended Actions:

If real regression:

  • Profile the affected handler using go test -cpuprofile
  • Review recent commits for inefficient code
  • Optimize the specific slow path

If CI flakiness:

  • Increase alert-threshold to 175% or 200%
  • Add -benchtime=3s for more stable results
  • Consider running benchmarks multiple times and averaging

Workflow Fix for Threshold:

- name: Store Benchmark Result
  uses: benchmark-action/github-action-benchmark@v1
  with:
    # ... other options
    # Increase threshold to account for CI variability
    alert-threshold: '175%'
    # Don't fail on alert for benchmarks - just warn
    fail-on-alert: false

Summary of Required Changes

File Change Priority
frontend/src/components/__tests__/LiveLogViewer.test.tsx Split waitFor assertions, increase timeout High
backend/internal/services/backup_service.go Add overflow protection for GetAvailableSpace High
backend/internal/api/handlers/proxy_host_handler.go Add safe int-to-uint conversion helpers High
.github/workflows/benchmark.yml Add permission guards, adjust threshold Medium
PR Description Add history-rewrite checklist items if applicable Conditional

Implementation Order

  1. Backend gosec fixes - Blocking CI, straightforward fixes
  2. Frontend test timeout - Blocking CI, may need iteration
  3. Benchmark workflow - Non-blocking (fail-on-alert: false), can be addressed after
  4. PR checklist - Context-dependent, may not need code changes

Testing Verification

After implementing fixes:

# Backend
cd backend && go test ./... -v
golangci-lint run  # Should pass gosec G115

# Frontend
cd frontend && npm test

# Full CI simulation
pre-commit run --all-files