135 lines
4.9 KiB
Markdown
135 lines
4.9 KiB
Markdown
# Test Isolation Findings - Go 1.26.0
|
|
|
|
**Date:** 2026-02-16
|
|
**Investigation:** Test failures after Go 1.26.0 upgrade
|
|
**Status:** Partial fix committed, further investigation required
|
|
|
|
## Summary
|
|
|
|
**Root Cause Confirmed:** Go 1.26.0 upgrade (commit dc40102a) changed timing/signal handling/scheduling behavior.
|
|
|
|
**Key Finding:** All 5 failing tests **PASS individually** but **FAIL in full suite** → Test isolation issue.
|
|
|
|
## Fixes Completed
|
|
|
|
### ✅ Fix #1: TestMain_DefaultStartupGracefulShutdown_Subprocess
|
|
- **File:** backend/cmd/api/main_test.go:287
|
|
- **Change:** Increased SIGTERM timeout from 500ms → 1000ms
|
|
- **Commit:** 62740eb5
|
|
- **Status:** ✅ PASSING individually
|
|
- **Reason:** Go 1.26.0 signal delivery timing changes on Linux
|
|
|
|
## Tests Status Matrix
|
|
|
|
| Test | Individual | Full Suite | Priority | Notes |
|
|
|------|-----------|------------|----------|-------|
|
|
| TestMain_DefaultStartupGracefulShutdown_Subprocess | ✅ PASS | ❓ Unknown | HIGH | Fixed timeout |
|
|
| TestCredentialService_GetCredentialForDomain_WildcardMatch | ✅ PASS | ❌ FAIL | HIGH | No code changes needed |
|
|
| TestDeleteCertificate_CreatesBackup | ✅ PASS | ❌ FAIL | MEDIUM | No code changes needed |
|
|
| TestHeartbeatPoller_ConcurrentSafety | ✅ PASS | ❌ FAIL | MEDIUM | No code changes needed |
|
|
| TestSecurityService_LogAudit_ChannelFullFallsBackToSyncWrite | ✅ PASS | ❌ FAIL | MEDIUM | No code changes needed |
|
|
|
|
## Test Isolation Issue
|
|
|
|
**Observation:** Tests pass when run individually but fail in full suite execution.
|
|
|
|
**Likely Causes:**
|
|
1. **Global State Pollution:**
|
|
- Tests modifying shared package-level variables
|
|
- Singleton initialization state persisting between tests
|
|
- Environment variables not being properly cleaned up
|
|
|
|
2. **Database Connection Leaks:**
|
|
- SQLite in-memory databases not properly closed
|
|
- GORM connection pool exhaustion
|
|
- WAL mode journal files persisting
|
|
|
|
3. **Goroutine Leaks:**
|
|
- Background goroutines from previous tests still running
|
|
- Channels not being closed
|
|
- Context cancellations not propagating
|
|
|
|
4. **Test Execution Order:**
|
|
- Tests depending on specific execution order
|
|
- Previous test failures leaving system in bad state
|
|
- Resource cleanup in t.Cleanup() not executing due to panics
|
|
|
|
5. **Race Conditions (Go 1.26.0 Scheduler):**
|
|
- Go 1.26.0's more aggressive preemption exposing hidden races
|
|
- Tests making timing assumptions that no longer hold
|
|
- Concurrent test execution causing interference
|
|
|
|
## Investigation Blockers
|
|
|
|
**Current Block:** Full test suite hangs or takes excessive time (>2 minutes).
|
|
|
|
**Symptoms:**
|
|
- `go test ./...` hangs indefinitely or terminates after 120s timeout
|
|
- Cannot get full suite results to see which tests are actually failing
|
|
- Cannot collect coverage data from full suite run
|
|
|
|
**Needed:**
|
|
- Identify which test(s) are causing the hang
|
|
- Isolate hanging test(s) and run rest of suite
|
|
- Check for infinite loops or deadlocks in test cleanup
|
|
|
|
## Next Steps
|
|
|
|
### Option A: Sequential Investigation (4-6 hours)
|
|
1. Run tests package-by-package to identify hanging package
|
|
2. Use `-timeout 30s` flag to catch hanging tests quickly
|
|
3. Add goroutine leak detection: `go test -race -p 1 ./...`
|
|
4. Use `t.Parallel()` marking to understand parallelization issues
|
|
5. Add `t.Cleanup()` verification to catch leak sources
|
|
|
|
### Option B: Quick Workaround (30 minutes)
|
|
1. Run tests with `-p 1` (no parallelism) to avoid race conditions
|
|
2. Increase timeout: `-timeout 10m`
|
|
3. Skip known flaky tests temporarily with `t.Skip("Go 1.26.0 isolation issue")`
|
|
4. Create tracking issue for proper fix
|
|
|
|
### Option C: Rollback Go Version (NOT RECOMMENDED)
|
|
- Revert to Go 1.25.7
|
|
- Loses security fixes
|
|
- Kicks can down road
|
|
|
|
## Recommendation
|
|
|
|
**Hybrid Approach:**
|
|
1. **Immediate (now):** Run tests with `-p 1 -timeout 5m` to force sequential execution
|
|
2. **Short-term (today):** Identify hanging tests and skip with tracking issue
|
|
3. **Long-term (this week):** Fix test isolation properly with cleanup audits
|
|
|
|
**Why:** Unblocks CI immediately while preserving investigation path.
|
|
|
|
## Commands for Investigation
|
|
|
|
```bash
|
|
# Run sequentially with timeout
|
|
go test -p 1 -timeout 5m ./...
|
|
|
|
# Find hanging test packages
|
|
for pkg in $(go list ./...); do
|
|
echo "Testing $pkg..."
|
|
timeout 30s go test -v "$pkg" || echo "FAILED or TIMEOUT: $pkg"
|
|
done
|
|
|
|
# Check for goroutine leaks
|
|
go test -race -p 1 -count=1 ./...
|
|
|
|
# Run specific packages
|
|
go test -v ./cmd/... ./internal/api/... ./internal/services/...
|
|
```
|
|
|
|
## Related Documents
|
|
- [docs/plans/GO_126_TEST_FAILURES_ANALYSIS.md](./GO_126_TEST_FAILURES_ANALYSIS.md) - Initial analysis
|
|
- [docs/plans/CI_TEST_FAILURES_DETAILED_REMEDIATION.md](./CI_TEST_FAILURES_DETAILED_REMEDIATION.md) - CI failures
|
|
|
|
## Action Items
|
|
- [ ] Run tests sequentially (`-p 1`) to check if parallelism is the issue
|
|
- [ ] Identify hanging test package
|
|
- [ ] Add timeout flags to test execution script
|
|
- [ ] Audit all tests for proper t.Cleanup() usage
|
|
- [ ] Add goroutine leak detection to CI
|
|
- [ ] Create tracking issue for test isolation fixes
|