fix: add timeout to integration tests to prevent CI hangs
- Add timeout-minutes: 5 to docker-build.yml integration test step - Add set -o pipefail to integration-test.sh - Add 4-minute timeout wrapper (INTEGRATION_TEST_TIMEOUT env var) Resolves hang after Caddy TLS cleanup in GitHub Actions run #20319807650
This commit is contained in:
1
.github/workflows/docker-build.yml
vendored
1
.github/workflows/docker-build.yml
vendored
@@ -294,6 +294,7 @@ jobs:
|
||||
-p 80:80 \
|
||||
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ steps.tag.outputs.tag }}
|
||||
- name: Run Integration Test
|
||||
timeout-minutes: 5
|
||||
run: ./scripts/integration-test.sh
|
||||
|
||||
- name: Check container logs
|
||||
|
||||
1
.github/workflows/docker-publish.yml
vendored
1
.github/workflows/docker-publish.yml
vendored
@@ -233,6 +233,7 @@ jobs:
|
||||
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ steps.tag.outputs.tag }}
|
||||
|
||||
- name: Run Integration Test
|
||||
timeout-minutes: 5
|
||||
run: ./scripts/integration-test.sh
|
||||
|
||||
- name: Check container logs
|
||||
|
||||
@@ -16,7 +16,7 @@
|
||||
- The Dockerfile itself and base images were not at fault.
|
||||
|
||||
## Fix applied
|
||||
- Updated [ .github/workflows/docker-build.yml](.github/workflows/docker-build.yml) to load the image when the event is `pull_request` (`load: ${{ github.event_name == 'pull_request' }}`) while keeping `push: false` for PRs. This makes the locally built image available to the verification step without publishing it.
|
||||
- Updated [.github/workflows/docker-build.yml](../../.github/workflows/docker-build.yml) to load the image when the event is `pull_request` (`load: ${{ github.event_name == 'pull_request' }}`) while keeping `push: false` for PRs. This makes the locally built image available to the verification step without publishing it.
|
||||
|
||||
## Validation
|
||||
- Local docker build: `DOCKER_BUILDKIT=1 docker build --progress=plain --pull --platform=linux/amd64 .` → success.
|
||||
@@ -28,3 +28,54 @@
|
||||
## Follow-ups / notes
|
||||
- The verification step now succeeds in PR builds because the image is available locally; no Dockerfile or .dockerignore changes were necessary.
|
||||
- If the version mismatch hook should be satisfied, align `.version` with the intended release tag or skip the hook for non-release branches; left unchanged to avoid an unintended version bump.
|
||||
|
||||
---
|
||||
|
||||
# Plan: Investigate GitHub Actions run hanging (run 20319807650, job 58372706756, PR #420)
|
||||
|
||||
## Intent
|
||||
Compose a focused, minimum-touch investigation to locate why the referenced GitHub Actions run stalled. The goal is to pinpoint the blocking step, confirm whether it is a workflow, Docker build, or test harness issue, and deliver fixes that avoid new moving parts.
|
||||
|
||||
## Phases (minimizing requests)
|
||||
|
||||
### Phase 1 — Fast evidence sweep (1–2 requests)
|
||||
- Pull the raw run log from the URL to capture timestamps and see exactly which job/step froze. Annotate wall-clock durations per step, especially in `build-and-push` of [../../.github/workflows/docker-build.yml](../../.github/workflows/docker-build.yml) and `backend-quality` / `frontend-quality` of [../../.github/workflows/quality-checks.yml](../../.github/workflows/quality-checks.yml).
|
||||
- Note whether the hang preceded or followed `docker/build-push-action` (step `Build and push Docker image`) or the verification step `Verify Caddy Security Patches (CVE-2025-68156)` that shells into the built image and may wait on Docker or `go version -m` output.
|
||||
- If the run is actually the `trivy-pr-app-only` job, check for a stall around `docker build -t charon:pr-${{ github.sha }}` or `aquasec/trivy:latest` pulls.
|
||||
|
||||
### Phase 2 — Timeline + suspect isolation (1 request)
|
||||
- Construct a concise timeline from the log with start/end times for each step; flag any step exceeding its historical median (use neighboring successful runs of `docker-build.yml` and `quality-checks.yml` as references).
|
||||
- Identify whether the hang aligns with runner resource exhaustion (look for `no space left on device`, `context deadline exceeded`, or missing heartbeats) versus a deadlock in our scripts such as `scripts/go-test-coverage.sh` or `scripts/frontend-test-coverage.sh` that could wait on coverage thresholds or stalled tests.
|
||||
|
||||
### Phase 3 — Targeted reproduction (1 request locally if needed)
|
||||
- Recreate the suspected step locally using the same inputs: e.g., `DOCKER_BUILDKIT=1 docker build --progress=plain --pull --platform=linux/amd64 .` for the `build-and-push` stage, or `bash scripts/go-test-coverage.sh` and `bash scripts/frontend-test-coverage.sh` for the quality jobs.
|
||||
- If the stall was inside `Verify Caddy Security Patches`, run its inner commands locally: `docker create/pull` of the PR-tagged image, `docker cp` of `/usr/bin/caddy`, and `go version -m ./caddy_binary` to see if module inspection hangs without a local Go toolchain.
|
||||
|
||||
### Phase 4 — Fix design (1 request)
|
||||
- Add deterministic timeouts per risky step:
|
||||
- `docker/build-push-action` already inherits the job timeout (30m); consider adding `build-args`-side timeouts via `--progress=plain` plus `BUILDKIT_STEP_LOG_MAX_SIZE` to avoid log-buffer stalls.
|
||||
- For `Verify Caddy Security Patches`, add an explicit `timeout-minutes: 5` or wrap commands with `timeout 300s` to prevent indefinite waits when registry pulls are slow.
|
||||
- For `trivy-pr-app-only`, pin the action version and add `timeout 300s` around `docker build` to surface network hangs.
|
||||
- If the log shows tests hanging, instrument `scripts/go-test-coverage.sh` and `scripts/frontend-test-coverage.sh` with `set -x`, `CI=1`, and `timeout` wrappers around `go test` / `npm run test -- --runInBand --maxWorkers=2` to avoid runner saturation.
|
||||
|
||||
### Phase 5 — Hardening and guardrails (1–2 requests)
|
||||
- Cache hygiene: add a `docker system df` snapshot before builds and prune on failure to avoid disk pressure on hosted runners.
|
||||
- Add a lightweight heartbeat to long steps (e.g., `while sleep 60; do echo "still working"; done &` in build steps) so Actions detects liveness and avoids silent 15‑minute idle timeouts.
|
||||
- Mirror diagnostics into the summary: capture the last 200 lines of `~/.docker/daemon.json` or BuildKit traces (`/var/lib/docker/buildkit`) if available, to make future investigations single-pass.
|
||||
|
||||
## Files and components to touch (if remediation is needed)
|
||||
- Workflows: [../../.github/workflows/docker-build.yml](../../.github/workflows/docker-build.yml) (step timeouts, heartbeats), [../../.github/workflows/quality-checks.yml](../../.github/workflows/quality-checks.yml) (timeouts around coverage scripts), and [../../.github/workflows/codecov-upload.yml](../../.github/workflows/codecov-upload.yml) if uploads were the hang point.
|
||||
- Scripts: `scripts/go-test-coverage.sh`, `scripts/frontend-test-coverage.sh` for timeouts and verbose logging; `scripts/repo_health_check.sh` for early failure signals.
|
||||
- Runtime artifacts: `docker-entrypoint.sh` only if container start was part of the stall (unlikely), and the [../../Dockerfile](../../Dockerfile) if build stages require log-friendly flags.
|
||||
|
||||
## Observations on ignore/config files
|
||||
- [.gitignore](../../.gitignore): Already excludes build, coverage, and data artifacts; no changes appear necessary for this investigation.
|
||||
- [.dockerignore](../../.dockerignore): Appropriately trims docs and cache-heavy paths; no additions needed for CI hangs.
|
||||
- [.codecov.yml](../../.codecov.yml): Coverage gates are explicit at 85% with sensible ignores; leave unchanged unless coverage stalls are traced to overly broad ignores (not indicated yet).
|
||||
- [Dockerfile](../../Dockerfile): Multi-stage with BuildKit-friendly caching; only consider adding `--progress=plain` via workflow flags rather than altering the file itself.
|
||||
|
||||
## Definition of done for the investigation
|
||||
- The hung step is identified with timestamped proof from the run log.
|
||||
- A reproduction (or a clear non-repro) is documented; if non-repro, capture environmental deltas.
|
||||
- A minimal fix is drafted (timeouts, heartbeats, cache hygiene) with a short PR plan referencing the exact workflow steps.
|
||||
- Follow-up Actions run completes without hanging; summary includes before/after step durations.
|
||||
|
||||
@@ -173,3 +173,185 @@ The new `/api/v1/health/db` endpoint returns:
|
||||
| Test Coverage | ✅ 83-87% |
|
||||
|
||||
**Final Result: QA PASSED** ✅
|
||||
|
||||
---
|
||||
|
||||
# QA Audit Report: Integration Test Timeout Fix
|
||||
|
||||
**Date:** December 17, 2025
|
||||
**Auditor:** GitHub Copilot
|
||||
**Task:** QA audit on integration test timeout fix
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
| Check | Status | Details |
|
||||
|-------|--------|---------|
|
||||
| Pre-commit hooks | ✅ PASS | All hooks passed |
|
||||
| Backend coverage | ✅ PASS | 85.6% (≥85% required) |
|
||||
| Frontend coverage | ✅ PASS | 89.48% (≥85% required) |
|
||||
| TypeScript check | ✅ PASS | No type errors |
|
||||
| File review | ✅ PASS | Changes verified correct |
|
||||
|
||||
**Overall Status:** ✅ **ALL CHECKS PASSED**
|
||||
|
||||
---
|
||||
|
||||
## Detailed Results
|
||||
|
||||
### 1. Pre-commit Hooks
|
||||
|
||||
**Status:** ✅ PASS
|
||||
|
||||
All hooks executed successfully:
|
||||
|
||||
- ✅ fix end of files
|
||||
- ✅ trim trailing whitespace
|
||||
- ✅ check yaml
|
||||
- ✅ check for added large files
|
||||
- ✅ dockerfile validation
|
||||
- ✅ Go Vet
|
||||
- ✅ Check .version matches latest Git tag
|
||||
- ✅ Prevent large files that are not tracked by LFS
|
||||
- ✅ Prevent committing CodeQL DB artifacts
|
||||
- ✅ Prevent committing data/backups files
|
||||
- ✅ Frontend Lint (Fix)
|
||||
|
||||
### 2. Backend Coverage
|
||||
|
||||
**Status:** ✅ PASS
|
||||
|
||||
- **Coverage achieved:** 85.6%
|
||||
- **Minimum required:** 85%
|
||||
- **Margin:** +0.6%
|
||||
|
||||
All tests passed with zero failures.
|
||||
|
||||
### 3. Frontend Coverage
|
||||
|
||||
**Status:** ✅ PASS
|
||||
|
||||
- **Coverage achieved:** 89.48%
|
||||
- **Minimum required:** 85%
|
||||
- **Margin:** +4.48%
|
||||
|
||||
Test results:
|
||||
|
||||
- Total test files: 96 passed
|
||||
- Total tests: 1032 passed, 2 skipped
|
||||
- Duration: 79.45s
|
||||
|
||||
### 4. TypeScript Check
|
||||
|
||||
**Status:** ✅ PASS
|
||||
|
||||
- Command: `npm run type-check`
|
||||
- Result: No type errors detected
|
||||
- TypeScript compilation completed without errors
|
||||
|
||||
---
|
||||
|
||||
## File Review
|
||||
|
||||
### `.github/workflows/docker-build.yml`
|
||||
|
||||
**Status:** ✅ Verified
|
||||
|
||||
Changes verified:
|
||||
|
||||
1. **timeout-minutes value at job level** (line ~29):
|
||||
- `timeout-minutes: 30` is properly indented under `build-and-push` job
|
||||
- YAML syntax is correct
|
||||
|
||||
2. **timeout-minutes for integration test step** (line ~235):
|
||||
- `timeout-minutes: 5` is properly indented under the "Run Integration Test" step
|
||||
- This ensures the integration test doesn't hang CI indefinitely
|
||||
|
||||
**Sample verified YAML structure:**
|
||||
|
||||
```yaml
|
||||
test-image:
|
||||
name: Test Docker Image
|
||||
needs: build-and-push
|
||||
runs-on: ubuntu-latest
|
||||
...
|
||||
steps:
|
||||
...
|
||||
- name: Run Integration Test
|
||||
timeout-minutes: 5
|
||||
run: ./scripts/integration-test.sh
|
||||
```
|
||||
|
||||
### `.github/workflows/trivy-scan.yml`
|
||||
|
||||
**Status:** ⚠️ File does not exist
|
||||
|
||||
The file `trivy-scan.yml` does not exist in `.github/workflows/`. Trivy scanning functionality is integrated within `docker-build.yml` instead. This is not an issue - it appears there was no separate Trivy scan workflow to modify.
|
||||
|
||||
**Note:** If a separate `trivy-scan.yml` was intended to be created/modified, that change was not applied or the file reference was incorrect.
|
||||
|
||||
### `scripts/integration-test.sh`
|
||||
|
||||
**Status:** ✅ Verified
|
||||
|
||||
Changes verified:
|
||||
|
||||
1. **Script-level timeout wrapper** (lines 1-14):
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
set -e
|
||||
set -o pipefail
|
||||
|
||||
# Fail entire script if it runs longer than 4 minutes (240 seconds)
|
||||
# This prevents CI hangs from indefinite waits
|
||||
TIMEOUT=${INTEGRATION_TEST_TIMEOUT:-240}
|
||||
if command -v timeout >/dev/null 2>&1; then
|
||||
if [ "${INTEGRATION_TEST_WRAPPED:-}" != "1" ]; then
|
||||
export INTEGRATION_TEST_WRAPPED=1
|
||||
exec timeout $TIMEOUT "$0" "$@"
|
||||
fi
|
||||
fi
|
||||
```
|
||||
|
||||
2. **Verification of bash syntax:**
|
||||
- ✅ Shebang is correct (`#!/bin/bash`)
|
||||
- ✅ `set -e` and `set -o pipefail` for fail-fast behavior
|
||||
- ✅ Environment variable `TIMEOUT` with default of 240 seconds
|
||||
- ✅ Guard variable `INTEGRATION_TEST_WRAPPED` prevents infinite recursion
|
||||
- ✅ Uses `exec timeout` to replace the process with timeout-wrapped version
|
||||
- ✅ Conditional checks for `timeout` command availability
|
||||
|
||||
3. **No unintended changes detected:**
|
||||
- Script logic for health checks, setup, login, proxy host creation, and testing remains intact
|
||||
- All existing retry mechanisms preserved
|
||||
|
||||
---
|
||||
|
||||
## Issues Found
|
||||
|
||||
**None** - All checks passed and file changes are syntactically correct.
|
||||
|
||||
---
|
||||
|
||||
## Recommendations
|
||||
|
||||
1. **Clarify trivy-scan.yml reference**: The user mentioned `.github/workflows/trivy-scan.yml` was modified, but this file does not exist. Trivy scanning is part of `docker-build.yml`. Verify if this was a typo or if a separate workflow was intended.
|
||||
|
||||
2. **Document timeout configuration**: The `INTEGRATION_TEST_TIMEOUT` environment variable is configurable. Consider documenting this in the project README or CI documentation.
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
The integration test timeout fix has been successfully implemented and validated. All quality gates pass:
|
||||
|
||||
- Pre-commit hooks validate code formatting and linting
|
||||
- Backend coverage meets the 85% threshold (85.6%)
|
||||
- Frontend coverage exceeds the 85% threshold (89.48%)
|
||||
- TypeScript compilation has no errors
|
||||
- YAML files have correct indentation and syntax
|
||||
- Bash script timeout wrapper is syntactically correct and functional
|
||||
|
||||
**Final Result: QA PASSED** ✅
|
||||
|
||||
@@ -1,5 +1,16 @@
|
||||
#!/bin/bash
|
||||
set -e
|
||||
set -o pipefail
|
||||
|
||||
# Fail entire script if it runs longer than 4 minutes (240 seconds)
|
||||
# This prevents CI hangs from indefinite waits
|
||||
TIMEOUT=${INTEGRATION_TEST_TIMEOUT:-240}
|
||||
if command -v timeout >/dev/null 2>&1; then
|
||||
if [ "${INTEGRATION_TEST_WRAPPED:-}" != "1" ]; then
|
||||
export INTEGRATION_TEST_WRAPPED=1
|
||||
exec timeout $TIMEOUT "$0" "$@"
|
||||
fi
|
||||
fi
|
||||
|
||||
# Configuration
|
||||
API_URL="http://localhost:8080/api/v1"
|
||||
|
||||
Reference in New Issue
Block a user