Files
Charon/docs/plans/archive/TEST_ISOLATION_FINDINGS.md
2026-02-19 16:34:10 +00:00

4.9 KiB

Test Isolation Findings - Go 1.26.0

Date: 2026-02-16 Investigation: Test failures after Go 1.26.0 upgrade Status: Partial fix committed, further investigation required

Summary

Root Cause Confirmed: Go 1.26.0 upgrade (commit dc40102a) changed timing/signal handling/scheduling behavior.

Key Finding: All 5 failing tests PASS individually but FAIL in full suite → Test isolation issue.

Fixes Completed

Fix #1: TestMain_DefaultStartupGracefulShutdown_Subprocess

  • File: backend/cmd/api/main_test.go:287
  • Change: Increased SIGTERM timeout from 500ms → 1000ms
  • Commit: 62740eb5
  • Status: PASSING individually
  • Reason: Go 1.26.0 signal delivery timing changes on Linux

Tests Status Matrix

Test Individual Full Suite Priority Notes
TestMain_DefaultStartupGracefulShutdown_Subprocess PASS Unknown HIGH Fixed timeout
TestCredentialService_GetCredentialForDomain_WildcardMatch PASS FAIL HIGH No code changes needed
TestDeleteCertificate_CreatesBackup PASS FAIL MEDIUM No code changes needed
TestHeartbeatPoller_ConcurrentSafety PASS FAIL MEDIUM No code changes needed
TestSecurityService_LogAudit_ChannelFullFallsBackToSyncWrite PASS FAIL MEDIUM No code changes needed

Test Isolation Issue

Observation: Tests pass when run individually but fail in full suite execution.

Likely Causes:

  1. Global State Pollution:

    • Tests modifying shared package-level variables
    • Singleton initialization state persisting between tests
    • Environment variables not being properly cleaned up
  2. Database Connection Leaks:

    • SQLite in-memory databases not properly closed
    • GORM connection pool exhaustion
    • WAL mode journal files persisting
  3. Goroutine Leaks:

    • Background goroutines from previous tests still running
    • Channels not being closed
    • Context cancellations not propagating
  4. Test Execution Order:

    • Tests depending on specific execution order
    • Previous test failures leaving system in bad state
    • Resource cleanup in t.Cleanup() not executing due to panics
  5. Race Conditions (Go 1.26.0 Scheduler):

    • Go 1.26.0's more aggressive preemption exposing hidden races
    • Tests making timing assumptions that no longer hold
    • Concurrent test execution causing interference

Investigation Blockers

Current Block: Full test suite hangs or takes excessive time (>2 minutes).

Symptoms:

  • go test ./... hangs indefinitely or terminates after 120s timeout
  • Cannot get full suite results to see which tests are actually failing
  • Cannot collect coverage data from full suite run

Needed:

  • Identify which test(s) are causing the hang
  • Isolate hanging test(s) and run rest of suite
  • Check for infinite loops or deadlocks in test cleanup

Next Steps

Option A: Sequential Investigation (4-6 hours)

  1. Run tests package-by-package to identify hanging package
  2. Use -timeout 30s flag to catch hanging tests quickly
  3. Add goroutine leak detection: go test -race -p 1 ./...
  4. Use t.Parallel() marking to understand parallelization issues
  5. Add t.Cleanup() verification to catch leak sources

Option B: Quick Workaround (30 minutes)

  1. Run tests with -p 1 (no parallelism) to avoid race conditions
  2. Increase timeout: -timeout 10m
  3. Skip known flaky tests temporarily with t.Skip("Go 1.26.0 isolation issue")
  4. Create tracking issue for proper fix
  • Revert to Go 1.25.7
  • Loses security fixes
  • Kicks can down road

Recommendation

Hybrid Approach:

  1. Immediate (now): Run tests with -p 1 -timeout 5m to force sequential execution
  2. Short-term (today): Identify hanging tests and skip with tracking issue
  3. Long-term (this week): Fix test isolation properly with cleanup audits

Why: Unblocks CI immediately while preserving investigation path.

Commands for Investigation

# Run sequentially with timeout
go test -p 1 -timeout 5m ./...

# Find hanging test packages
for pkg in $(go list ./...); do
    echo "Testing $pkg..."
    timeout 30s go test -v "$pkg" || echo "FAILED or TIMEOUT: $pkg"
done

# Check for goroutine leaks
go test -race -p 1 -count=1 ./...

# Run specific packages
go test -v ./cmd/... ./internal/api/... ./internal/services/...

Action Items

  • Run tests sequentially (-p 1) to check if parallelism is the issue
  • Identify hanging test package
  • Add timeout flags to test execution script
  • Audit all tests for proper t.Cleanup() usage
  • Add goroutine leak detection to CI
  • Create tracking issue for test isolation fixes