chore(docs): archive uptime monitoring regression investigation plan to address false DOWN states

This commit is contained in:
GitHub Actions
2026-03-02 22:44:41 +00:00
parent 9c203914dd
commit a1a9ab2ece
2 changed files with 362 additions and 0 deletions
@@ -0,0 +1,586 @@
---
post_title: "Current Spec: Local Docker Socket Group Access Remediation"
categories:
- planning
- docker
- security
- backend
- frontend
tags:
- docker.sock
- least-privilege
- group-add
- compose
- validation
summary: "Comprehensive plan to resolve local docker socket access failures for non-root process uid=1000 gid=1000 when host socket gid is not in supplemental groups, with phased rollout, PR slicing, and least-privilege validation."
post_date: 2026-02-25
---
## 1) Introduction
### Overview
Charon local Docker discovery currently fails in environments where:
- Socket mount exists: `/var/run/docker.sock:/var/run/docker.sock:ro`
- Charon process runs non-root (typically `uid=1000 gid=1000`)
- Host socket group (example: `gid=988`) is not present in process supplemental groups
Observed user-facing failure class (already emitted by backend details builder):
- `Local Docker socket mounted but not accessible by current process (uid=1000 gid=1000)... Process groups do not include socket gid 988; run container with matching supplemental group (e.g., --group-add 988).`
### Goals
1. Preserve non-root default execution (`USER charon`) while enabling local Docker discovery safely.
2. Standardize supplemental-group strategy across compose variants and launcher scripts.
3. Keep behavior deterministic in backend/API/frontend error surfacing when permissions are wrong.
4. Validate least-privilege posture (non-root, minimal group grant, no broad privilege escalation).
### Non-Goals
- No redesign of remote Docker support (`tcp://...`) beyond compatibility checks.
- No changes to unrelated security modules (WAF, ACL, CrowdSec workflows).
- No broad Docker daemon hardening beyond this socket-access path.
### Scope Labels (Authoritative)
- `repo-deliverable`: changes that must be included in repository PR slices under `/projects/Charon`.
- `operator-local follow-up`: optional local environment changes outside repository scope (for example `/root/docker/...`), not required for repo PR acceptance.
---
## 2) Research Findings
### 2.1 Critical Runtime Files (Confirmed)
- `backend/internal/services/docker_service.go`
- Key functions:
- `NewDockerService()`
- `(*DockerService).ListContainers(...)`
- `resolveLocalDockerHost()`
- `buildLocalDockerUnavailableDetails(...)`
- `isDockerConnectivityError(...)`
- `extractErrno(...)`
- `localSocketStatSummary(...)`
- Contains explicit supplemental-group hint text with `--group-add <gid>` when `EACCES/EPERM` occurs.
- `backend/internal/api/handlers/docker_handler.go`
- Key function: `(*DockerHandler).ListContainers(...)`
- Maps `DockerUnavailableError` to HTTP `503` with `details` string consumed by UI.
- `frontend/src/hooks/useDocker.ts`
- Hook: `useDocker(host?, serverId?)`
- Converts `503` payload details into surfaced `Error(message)`.
- `frontend/src/components/ProxyHostForm.tsx`
- Uses `useDocker`.
- Error panel title: `Docker Connection Failed`.
- Existing troubleshooting text currently mentions socket mount but not explicit supplemental group action.
- `.docker/docker-entrypoint.sh`
- Root path auto-aligns docker socket GID with user group membership via:
- `get_group_by_gid()`
- `create_group_with_gid()`
- `add_user_to_group()`
- Non-root path logs generic `--group-add` guidance but does not include resolved host socket GID.
- `Dockerfile`
- Creates non-root user `charon` (uid/gid 1000) and final `USER charon`.
- This is correct for least privilege and should remain default.
### 2.2 Compose and Script Surface Area
Primary in-repo compose files with docker socket mount:
- `.docker/compose/docker-compose.yml` (`charon` service)
- `.docker/compose/docker-compose.local.yml` (`charon` service)
- `.docker/compose/docker-compose.dev.yml` (`app` service)
- `.docker/compose/docker-compose.playwright-local.yml` (`charon-e2e` service)
- `.docker/compose/docker-compose.playwright-ci.yml` (`charon-app`, `crowdsec` services)
Primary out-of-repo/local-ops file in active workspace:
- `/root/docker/containers/charon/docker-compose.yml` (`charon` service)
- Includes socket mount.
- `user:` is currently commented out.
- No `group_add` entry exists.
Launcher scripts discovered:
- `.github/skills/docker-start-dev-scripts/run.sh`
- Runs: `docker compose -f .docker/compose/docker-compose.dev.yml up -d`
- `/root/docker/containers/charon/docker-compose-up-charon.sh`
- Runs: `docker compose up -d`
### 2.3 Existing Tests Relevant to This Failure
Backend service tests (`backend/internal/services/docker_service_test.go`):
- `TestBuildLocalDockerUnavailableDetails_PermissionDeniedIncludesGroupHint`
- `TestBuildLocalDockerUnavailableDetails_MissingSocket`
- Connectivity classification tests across URL/syscall/network errors.
Backend handler tests (`backend/internal/api/handlers/docker_handler_test.go`):
- `TestDockerHandler_ListContainers_DockerUnavailableMappedTo503`
- Other selector and remote-host mapping tests.
Frontend hook tests (`frontend/src/hooks/__tests__/useDocker.test.tsx`):
- `it('extracts details from 503 service unavailable error', ...)`
### 2.4 Config Review Findings (`.gitignore`, `codecov.yml`, `.dockerignore`, `Dockerfile`)
- `.gitignore`: no blocker for this feature; already excludes local env/artifacts extensively.
- `.dockerignore`: no blocker for this feature; includes docs/tests and build artifacts exclusions.
- `Dockerfile`: non-root default is aligned with least-privilege intent.
- `codecov.yml`: currently excludes the two key Docker logic files:
- `backend/internal/services/docker_service.go`
- `backend/internal/api/handlers/docker_handler.go`
This exclusion undermines regression visibility for this exact problem class and should be revised.
### 2.5 Confidence
Confidence score: **97%**
Reasoning:
- Root cause and symptom path are already explicit in code.
- Required files and control points are concrete and localized.
- Existing tests already cover adjacent behavior and reduce implementation risk.
---
## 3) Requirements (EARS)
- WHEN local Docker source is selected and `/var/run/docker.sock` is mounted, THE SYSTEM SHALL return containers if the process has supplemental membership for socket GID.
- WHEN local Docker source is selected and socket permissions deny access (`EACCES`/`EPERM`), THE SYSTEM SHALL return HTTP `503` with a deterministic, actionable details message including supplemental-group guidance.
- WHEN container runs non-root and socket GID is known, THE SYSTEM SHALL provide explicit startup diagnostics indicating the required `group_add` value.
- WHEN docker-compose-based local/dev startup is used, THE SYSTEM SHALL support local-only `group_add` configuration from host socket GID without requiring root process runtime.
- WHEN remote Docker source is selected (`server_id` path), THE SYSTEM SHALL remain functionally unchanged.
- WHEN least-privilege validation is executed, THE SYSTEM SHALL demonstrate non-root process execution and only necessary supplemental group grant.
- IF resolved socket GID equals `0`, THEN THE SYSTEM SHALL require explicit operator opt-in and risk acknowledgment before any `group_add: ["0"]` path is used.
---
## 4) Technical Specifications
### 4.1 Architecture and Data Flow
User flow:
1. UI `ProxyHostForm` sets source = `Local (Docker Socket)`.
2. `useDocker(...)` calls `dockerApi.listContainers(...)`.
3. Backend `DockerHandler.ListContainers(...)` invokes `DockerService.ListContainers(...)`.
4. If socket access denied, backend emits `DockerUnavailableError` with details.
5. Handler returns `503` JSON `{ error, details }`.
6. Frontend surfaces message in `Docker Connection Failed` block.
No database schema change is required.
### 4.2 API Contract (No endpoint shape change)
Endpoint:
- `GET /api/v1/docker/containers`
- Query params:
- `host` (allowed: empty or `local` only)
- `server_id` (UUID for remote server lookup)
Responses:
- `200 OK`: `DockerContainer[]`
- `503 Service Unavailable`:
- `error: "Docker daemon unavailable"`
- `details: <actionable message>`
- `400`, `404`, `500` unchanged.
### 4.3 Deterministic `group_add` Policy (Chosen)
Chosen policy: **conditional local-only profile/override while keeping CI unaffected**.
Authoritative policy statement:
1. `repo-deliverable`: repository compose paths used for local operator runs (`.docker/compose/docker-compose.local.yml`, `.docker/compose/docker-compose.dev.yml`) may include local-only `group_add` wiring using `DOCKER_SOCK_GID`.
2. `repo-deliverable`: CI compose paths (`.docker/compose/docker-compose.playwright-ci.yml`) remain unaffected by this policy and must not require `DOCKER_SOCK_GID`.
3. `repo-deliverable`: base compose (`.docker/compose/docker-compose.yml`) remains safe by default and must not force a local host-specific GID requirement in CI.
4. `operator-local follow-up`: out-of-repo operator files (for example `/root/docker/containers/charon/docker-compose.yml`) may mirror this policy but are explicitly outside mandatory repo PR scope.
CI compatibility statement:
- CI workflows remain deterministic because they do not depend on local host socket GID export for this remediation.
- No CI job should fail due to missing `DOCKER_SOCK_GID` after this plan.
Security guardrail for `gid==0` (mandatory):
- If `stat -c '%g' /var/run/docker.sock` returns `0`, local profile/override usage must fail closed by default.
- Enabling `group_add: ["0"]` requires explicit opt-in (for example `ALLOW_DOCKER_SOCK_GID_0=true`) and documented risk acknowledgment in operator guidance.
- Silent fallback to GID `0` is prohibited.
### 4.4 Entrypoint Diagnostic Improvements
In `.docker/docker-entrypoint.sh` non-root socket branch:
- Extend current message to include resolved socket GID from `stat -c '%g' /var/run/docker.sock`.
- Emit exact recommendation format:
- `Use docker compose group_add: ["<gid>"] or run with --group-add <gid>`
- If resolved GID is `0`, emit explicit warning requiring opt-in/risk acknowledgment instead of generic recommendation.
No privilege escalation should be introduced.
### 4.5 Frontend UX Message Precision
In `frontend/src/components/ProxyHostForm.tsx` troubleshooting text:
- Retain mount guidance.
- Add supplemental-group guidance for containerized runs.
- Keep language concise and operational.
### 4.6 Coverage and Quality Config Adjustments
`codecov.yml` review outcome:
- Proposed: remove Docker logic file ignores for:
- `backend/internal/services/docker_service.go`
- `backend/internal/api/handlers/docker_handler.go`
- Reason: this issue is rooted in these files; exclusion hides regressions.
`.gitignore` review outcome:
- No change required for core remediation.
`.dockerignore` review outcome:
- No required change for runtime fix.
- Optional follow-up: verify no additional local-only compose/env files are copied in future.
`Dockerfile` review outcome:
- No required behavioral change; preserve non-root default.
---
## 5) Risks, Edge Cases, Mitigations
### Risks
1. Host socket GID differs across environments (`docker` group not stable numeric ID).
2. CI runners may not permit or need explicit `group_add` depending on runner Docker setup.
3. Over-granting groups could violate least-privilege intent.
4. Socket GID can be `0` on some hosts and implies root-group blast radius.
### Edge Cases
- Socket path missing (`ENOENT`) remains handled with existing details path.
- Rootless host Docker sockets (`/run/user/<uid>/docker.sock`) remain selectable by `resolveLocalDockerHost()`.
- Remote server discovery path (`tcp://...`) must remain unaffected.
### Mitigations
- Use environment-substituted `DOCKER_SOCK_GID`, not hardcoded `988` in committed compose files.
- Keep `group_add` scoped only to local operator flows that require socket discovery.
- Fail closed on `DOCKER_SOCK_GID=0` unless explicit opt-in and risk acknowledgment are present.
- Verify `id` output inside container to confirm only necessary supplemental group is present.
---
## 6) Implementation Plan (Phased, minimal request count)
Design principle for phases: maximize delivery per request by grouping strongly-related changes into each phase and minimizing handoffs.
### Phase 1 — Baseline + Diagnostics + Compose Foundations
Scope:
1. Compose updates in local/dev paths to support local-only `group_add` via `DOCKER_SOCK_GID`.
2. Entrypoint diagnostic enhancement for non-root socket path.
`repo-deliverable` files:
- `.docker/compose/docker-compose.local.yml`
- `.docker/compose/docker-compose.dev.yml`
- `.docker/docker-entrypoint.sh`
`operator-local follow-up` files (non-blocking, out of repo PR scope):
- `/root/docker/containers/charon/docker-compose.yml`
- `/root/docker/containers/charon/docker-compose-up-charon.sh`
Deliverables:
- Deterministic startup guidance and immediate local remediation path.
### Phase 2 — API/UI Behavior Tightening + Tests
Scope:
1. Preserve and, if needed, refine backend detail text consistency in `buildLocalDockerUnavailableDetails(...)`.
2. UI troubleshooting copy update in `ProxyHostForm.tsx`.
3. Expand/refresh tests for permission-denied + supplemental-group hint rendering path.
Primary files:
- `backend/internal/services/docker_service.go`
- `backend/internal/services/docker_service_test.go`
- `backend/internal/api/handlers/docker_handler.go`
- `backend/internal/api/handlers/docker_handler_test.go`
- `frontend/src/hooks/useDocker.ts`
- `frontend/src/hooks/__tests__/useDocker.test.tsx`
- `frontend/src/components/ProxyHostForm.tsx`
- `frontend/src/components/__tests__/ProxyHostForm*.test.tsx`
Deliverables:
- User sees precise, actionable guidance when failure occurs.
- Regression tests protect failure classification and surfaced guidance.
### Phase 3 — Coverage Policy + Documentation + CI/Validation Hardening
Scope:
1. Remove Docker logic exclusions in `codecov.yml`.
2. Update docs to include `group_add` guidance where socket mount is described.
3. Validate CI/playwright compose behavior remains unaffected and verify local least-privilege checks.
Primary files:
- `codecov.yml`
- `README.md`
- `docs/getting-started.md`
- `SECURITY.md`
- `.vscode/tasks.json` (only if adding dedicated validation task labels)
Deliverables:
- Documentation and coverage policy match runtime behavior.
- Verified validation playbook for operators and CI.
---
## 7) PR Slicing Strategy
### Decision
**Split into multiple PRs (PR-1 / PR-2 / PR-3).**
### Trigger Reasons
- Cross-domain change set (compose + shell entrypoint + backend + frontend + tests + docs + coverage policy).
- Distinct rollback boundaries needed (runtime config vs behavior vs governance/reporting).
- Faster and safer review with independently verifiable increments.
### Ordered PR Slices
#### PR-1: Runtime Access Foundation (Compose + Entrypoint)
Scope:
- Add local-only `group_add` strategy to local/dev compose flows.
- Improve non-root entrypoint diagnostics to print required GID.
Files (expected):
- `.docker/compose/docker-compose.local.yml`
- `.docker/compose/docker-compose.dev.yml`
- `.docker/docker-entrypoint.sh`
Operator-local follow-up (not part of repo PR gate):
- `/root/docker/containers/charon/docker-compose.yml`
- `/root/docker/containers/charon/docker-compose-up-charon.sh`
Dependencies:
- None.
Acceptance criteria:
1. Container remains non-root (`id -u = 1000`).
2. With local-only config enabled and `DOCKER_SOCK_GID` exported, `id -G` inside container includes socket GID.
3. `GET /api/v1/docker/containers?host=local` no longer fails due to `EACCES` in correctly configured environment.
4. If resolved socket GID is `0`, setup fails by default unless explicit opt-in and risk acknowledgment are provided.
Rollback/contingency:
- Revert compose and entrypoint deltas only.
#### PR-2: Behavior + UX + Tests
Scope:
- Backend details consistency (if required).
- Frontend troubleshooting message update.
- Add/adjust tests around permission-denied + supplemental-group guidance.
Files (expected):
- `backend/internal/services/docker_service.go`
- `backend/internal/services/docker_service_test.go`
- `backend/internal/api/handlers/docker_handler.go`
- `backend/internal/api/handlers/docker_handler_test.go`
- `frontend/src/hooks/useDocker.ts`
- `frontend/src/hooks/__tests__/useDocker.test.tsx`
- `frontend/src/components/ProxyHostForm.tsx`
- `frontend/src/components/__tests__/ProxyHostForm*.test.tsx`
Dependencies:
- PR-1 recommended (runtime setup available for realistic local validation).
Acceptance criteria:
1. `503` details include actionable group guidance for permission-denied scenarios.
2. UI error panel provides mount + supplemental-group troubleshooting.
3. All touched unit/e2e tests pass for local Docker source path.
Rollback/contingency:
- Revert only behavior/UI/test deltas; keep PR-1 foundations.
#### PR-3: Coverage + Docs + Validation Playbook
Scope:
- Update `codecov.yml` exclusions for Docker logic files.
- Update user/operator docs where socket mount guidance appears.
- Optional task additions for socket-permission diagnostics.
Files (expected):
- `codecov.yml`
- `README.md`
- `docs/getting-started.md`
- `SECURITY.md`
- `.vscode/tasks.json` (optional)
Dependencies:
- PR-2 preferred to ensure policy aligns with test coverage additions.
Acceptance criteria:
1. Codecov includes Docker service/handler in coverage accounting.
2. Docs show both socket mount and supplemental-group requirement.
3. Validation command set is documented and reproducible.
Rollback/contingency:
- Revert reporting/docs/task changes only.
---
## 8) Validation Strategy (Protocol-Ordered)
### 8.1 E2E Prerequisite / Rebuild Check (Mandatory First)
Follow project protocol to decide whether E2E container rebuild is required before tests:
1. If application/runtime or Docker build inputs changed, rebuild E2E environment.
2. If only test files changed and environment is healthy, reuse current container.
3. If environment state is suspect, rebuild.
Primary task:
- VS Code task: `Docker: Rebuild E2E Environment` (or clean variant when needed).
### 8.2 E2E First (Mandatory)
Run E2E before unit tests:
- VS Code task: `Test: E2E Playwright (Targeted Suite)` for scoped regression checks.
- VS Code task: `Test: E2E Playwright (Skill)` for broader safety pass as needed.
### 8.3 Local Patch Report (Mandatory Before Unit/Coverage)
Generate patch artifacts immediately after E2E:
```bash
cd /projects/Charon
bash scripts/local-patch-report.sh
```
Required artifacts:
- `test-results/local-patch-report.md`
- `test-results/local-patch-report.json`
### 8.4 Unit + Coverage Validation
Backend and frontend unit coverage gates after patch report:
```bash
cd /projects/Charon/backend && go test ./internal/services ./internal/api/handlers
cd /projects/Charon/frontend && npm run test -- src/hooks/__tests__/useDocker.test.tsx
```
Then run coverage tasks/scripts per project protocol (minimum threshold enforcement remains unchanged).
### 8.5 Least-Privilege + `gid==0` Guardrail Checks
Pass conditions:
1. Container process remains non-root.
2. Supplemental group grant is limited to socket GID only for local operator flow.
3. No privileged mode or unrelated capability additions.
4. Socket remains read-only.
5. If socket GID resolves to `0`, local run fails closed unless explicit opt-in and risk acknowledgment are present.
---
## 9) Suggested File-Level Updates Summary
### `repo-deliverable` Must Update
- `.docker/compose/docker-compose.local.yml`
- `.docker/compose/docker-compose.dev.yml`
- `.docker/docker-entrypoint.sh`
- `frontend/src/components/ProxyHostForm.tsx`
- `codecov.yml`
### `repo-deliverable` Should Update
- `README.md`
- `docs/getting-started.md`
- `SECURITY.md`
### `repo-deliverable` Optional Update
- `.vscode/tasks.json` (dedicated task to precompute/export `DOCKER_SOCK_GID` and start compose)
### `operator-local follow-up` (Out of Mandatory Repo PR Scope)
- `/root/docker/containers/charon/docker-compose.yml`
- `/root/docker/containers/charon/docker-compose-up-charon.sh`
### Reviewed, No Required Change
- `.gitignore`
- `.dockerignore`
- `Dockerfile` (keep non-root default)
---
## 10) Acceptance Criteria / DoD
1. Local Docker source works in non-root container when supplemental socket group is supplied.
2. Failure path remains explicit and actionable when supplemental group is missing.
3. Scope split is explicit and consistent: `repo-deliverable` vs `operator-local follow-up`.
4. Chosen policy is unambiguous: conditional local-only `group_add`; CI remains unaffected.
5. `gid==0` path is guarded by explicit opt-in/risk acknowledgment and never silently defaulted.
6. Validation order is protocol-aligned: E2E prerequisite/rebuild check -> E2E first -> local patch report -> unit/coverage.
7. Coverage policy no longer suppresses Docker service/handler regression visibility.
8. PR-1, PR-2, PR-3 each pass their slice acceptance criteria with independent rollback safety.
9. This file contains one active plan with one frontmatter block and no archived concatenated plan content.
---
## 11) Handoff
This plan is complete and execution-ready for Supervisor review. It includes:
- Root-cause grounded file/function map
- EARS requirements
- Specific multi-phase implementation path
- PR slicing with dependencies and rollback notes
- Validation sequence explicitly aligned to project protocol order and least-privilege guarantees
@@ -0,0 +1,362 @@
# Uptime Monitoring Regression Investigation (Scheduled vs Manual)
Date: 2026-03-01
Owner: Planning Agent
Status: Investigation Complete, Fix Plan Proposed
Severity: High (false DOWN states on automated monitoring)
## 1. Executive Summary
Two services (Wizarr and Charon) can flip to `DOWN` during scheduled cycles while manual checks immediately return `UP` because scheduled checks use a host-level TCP gate that can short-circuit monitor-level HTTP checks.
The scheduled path is:
- `ticker -> CheckAll -> checkAllHosts -> (host status down) -> markHostMonitorsDown`
The manual path is:
- `POST /api/v1/uptime/monitors/:id/check -> CheckMonitor -> checkMonitor`
Only the scheduled path runs host precheck gating. If host precheck fails (TCP to upstream host/port), `CheckAll` skips HTTP checks and forcibly writes monitor status to `down` with heartbeat message `Host unreachable`.
This is a backend state mutation problem (not only UI rendering).
## 1.1 Monitoring Policy (Authoritative Behavior)
Charon uptime monitoring SHALL follow URL-truth semantics for HTTP/HTTPS monitors,
matching third-party external monitor behavior (Uptime Kuma style) without requiring
any additional service.
Policy:
- HTTP/HTTPS monitors are URL-truth based. The monitor result is authoritative based
on the configured URL check outcome (status code/timeout/TLS/connectivity from URL
perspective).
- Internal TCP reachability precheck (`ForwardHost:ForwardPort`) is
non-authoritative for HTTP/HTTPS monitor status.
- TCP monitors remain endpoint-socket checks and may rely on direct socket
reachability semantics.
- Host precheck may still be used for optimization, grouping telemetry, and operator
diagnostics, but SHALL NOT force HTTP/HTTPS monitors to DOWN.
## 2. Research Findings
### 2.1 Execution Path Comparison (Required)
### Scheduled path behavior
- Entry: `backend/internal/api/routes/routes.go` (background ticker, calls `uptimeService.CheckAll()`)
- `CheckAll()` calls `checkAllHosts()` first.
- File: `backend/internal/services/uptime_service.go:354`
- `checkAllHosts()` updates each `UptimeHost.Status` via TCP checks in `checkHost()`.
- File: `backend/internal/services/uptime_service.go:395`
- `checkHost()` dials `UptimeHost.Host` + monitor port (prefer `ProxyHost.ForwardPort`, fallback to URL port).
- File: `backend/internal/services/uptime_service.go:437`
- Back in `CheckAll()`, monitors are grouped by `UptimeHostID`.
- File: `backend/internal/services/uptime_service.go:367`
- If `UptimeHost.Status == "down"`, `markHostMonitorsDown()` is called and individual monitor checks are skipped.
- File: `backend/internal/services/uptime_service.go:381`
- File: `backend/internal/services/uptime_service.go:593`
### Manual path behavior
- Entry: `POST /api/v1/uptime/monitors/:id/check`.
- Handler: `backend/internal/api/handlers/uptime_handler.go:107`
- Calls `service.CheckMonitor(*monitor)` asynchronously.
- File: `backend/internal/services/uptime_service.go:707`
- `checkMonitor()` performs direct HTTP/TCP monitor check and updates monitor + heartbeat.
- File: `backend/internal/services/uptime_service.go:711`
### Key divergence
- Scheduled: host-gated (precheck can override monitor)
- Manual: direct monitor check (no host gate)
## 3. Root Cause With Evidence
## 3.1 Primary Root Cause: Host Precheck Overrides HTTP Success in Scheduled Cycles
When `UptimeHost` is marked `down`, scheduled checks do not run `checkMonitor()` for that host's monitors. Instead they call `markHostMonitorsDown()` which:
- sets each monitor `Status = "down"`
- writes `UptimeHeartbeat{Status: "down", Message: "Host unreachable"}`
- maxes failure count (`FailureCount = MaxRetries`)
Evidence:
- Short-circuit: `backend/internal/services/uptime_service.go:381`
- Forced down write: `backend/internal/services/uptime_service.go:610`
- Forced heartbeat message: `backend/internal/services/uptime_service.go:624`
This exactly matches symptom pattern:
1. Manual refresh sets monitor `UP` via direct HTTP check.
2. Next scheduler cycle can force it back to `DOWN` from host precheck path.
## 3.2 Hypothesis Check: TCP precheck can fail while public URL HTTP check succeeds
Confirmed as plausible by design:
- `checkHost()` tests upstream reachability (`ForwardHost:ForwardPort`) from Charon runtime.
- `checkMonitor()` tests monitor URL (public domain URL, often via Caddy/public routing).
A service can be publicly reachable by monitor URL while upstream TCP precheck fails due to network namespace/routing/DNS/hairpin differences.
This is especially likely for:
- self-referential routes (Charon monitoring Charon via public hostname)
- host/container networking asymmetry
- services reachable through proxy path but not directly on upstream socket from current runtime context
## 3.3 Recent Change Correlation (Required)
### `SyncAndCheckForHost` (regression amplifier)
- Introduced in commit `2cd19d89` and called from proxy host create path.
- Files:
- `backend/internal/services/uptime_service.go:1195`
- `backend/internal/api/handlers/proxy_host_handler.go:418`
- Behavior: creates/syncs monitor and immediately runs `checkMonitor()`.
Impact: makes monitors quickly show `UP` after create/manual, then scheduler can flip to `DOWN` if host precheck fails. This increased visibility of scheduled/manual inconsistency.
### `CleanupStaleFailureCounts`
- Introduced in `2cd19d89`, refined in `7a12ab79`.
- File: `backend/internal/services/uptime_service.go:1277`
- It runs at startup and resets stale monitor states only; not per-cycle override logic.
- Not root cause of recurring per-cycle flip.
### Frontend effective status changes
- Latest commit `0241de69` refactors `effectiveStatus` handling.
- File: `frontend/src/pages/Uptime.tsx`.
- Backend evidence proves this is not visual-only: scheduler writes `down` heartbeats/messages directly in DB.
## 3.4 Grouping Logic Analysis (`UptimeHost`/`UpstreamHost`)
Monitors are grouped by `UptimeHostID` in `CheckAll()`. `UptimeHost` is derived from `ProxyHost.ForwardHost` in sync flows.
Relevant code:
- group map by `UptimeHostID`: `backend/internal/services/uptime_service.go:367`
- host linkage in sync: `backend/internal/services/uptime_service.go:189`, `backend/internal/services/uptime_service.go:226`
- sync single-host update path: `backend/internal/services/uptime_service.go:1023`
Risk: one host precheck failure can mark all grouped monitors down without URL-level validation.
## 4. Technical Specification (Fix Plan)
## 4.1 Minimal Proper Fix (First)
Goal: eliminate false DOWN while preserving existing behavior as much as possible.
Change `CheckAll()` host-down branch to avoid hard override for HTTP/HTTPS monitors.
Mandatory hotfix rule:
- WHEN a host precheck is `down`, THE SYSTEM SHALL partition host monitors by type inside `CheckAll()`.
- `markHostMonitorsDown` MUST be invoked only for `tcp` monitors.
- `http`/`https` monitors MUST still run through `checkMonitor()` and MUST NOT be force-written `down` by the host precheck path.
- Host precheck outcomes MAY be recorded for optimization/telemetry/grouping, but MUST NOT be treated as final status for `http`/`https` monitors.
Proposed rule:
1. If host is down:
- For `http`/`https` monitors: still run `checkMonitor()` (do not force down).
- For `tcp` monitors: keep current host-down fast-path (`markHostMonitorsDown`) or direct tcp check.
2. If host is not down:
- Keep existing behavior (run `checkMonitor()` for all monitors).
Rationale:
- Aligns scheduled behavior with manual for URL-based monitors.
- Preserves reverse proxy product semantics where public URL availability is the source of truth.
- Minimal code delta in `CheckAll()` decision branch.
- Preserves optimization for true TCP-only monitors.
### Exact file/function targets
- `backend/internal/services/uptime_service.go`
- `CheckAll()`
- add small helper (optional): `partitionMonitorsByType(...)`
## 4.2 Long-Term Robust Fix (Deferred)
Introduce host precheck as advisory signal, not authoritative override.
Design:
1. Add `HostReachability` result to run context (not persisted as forced monitor status).
2. Always execute per-monitor checks, but use host precheck to:
- tune retries/backoff
- annotate failure reason
- optimize notification batching
3. Optionally add feature flag:
- `feature.uptime.strict_host_precheck` (default `false`)
- allows legacy strict gating in environments that want it.
Benefits:
- Removes false DOWN caused by precheck mismatch.
- Keeps performance and batching controls.
- More explicit semantics for operators.
## 5. API/Schema Impact
No API contract change required for minimal fix.
No database migration required for minimal fix.
Long-term fix may add one feature flag setting only.
## 6. EARS Requirements
### Ubiquitous
- THE SYSTEM SHALL evaluate HTTP/HTTPS monitor availability using URL-level checks as the authoritative signal.
### Event-driven
- WHEN the scheduled uptime cycle runs, THE SYSTEM SHALL execute HTTP/HTTPS monitor checks regardless of internal host precheck state.
- WHEN the scheduled uptime cycle runs and host precheck is down, THE SYSTEM SHALL apply host-level forced-down logic only to TCP monitors.
### State-driven
- WHILE a monitor type is `http` or `https`, THE SYSTEM SHALL NOT force monitor status to `down` solely from internal host precheck failure.
- WHILE a monitor type is `tcp`, THE SYSTEM SHALL evaluate status using endpoint socket reachability semantics.
### Unwanted behavior
- IF internal host precheck is unreachable AND URL-level HTTP/HTTPS check returns success, THEN THE SYSTEM SHALL set monitor status to `up`.
- IF internal host precheck is reachable AND URL-level HTTP/HTTPS check fails, THEN THE SYSTEM SHALL set monitor status to `down`.
### Optional
- WHERE host precheck telemetry is enabled, THE SYSTEM SHALL record host-level reachability for diagnostics and grouping without overriding HTTP/HTTPS monitor final state.
## 7. Implementation Plan
### Phase 1: Reproduction Lock-In (Tests First)
- Add backend service test proving current regression:
- host precheck fails
- monitor URL check would succeed
- scheduled `CheckAll()` currently writes down (existing behavior)
- File: `backend/internal/services/uptime_service_test.go` (new test block)
### Phase 2: Minimal Backend Fix
- Update `CheckAll()` branch logic to run HTTP/HTTPS monitors even when host is down.
- Make monitor partitioning explicit and mandatory in `CheckAll()` host-down branch.
- Add an implementation guard before partitioning: normalize monitor type using
`strings.TrimSpace` + `strings.ToLower` to prevent `HTTP`/`HTTPS` case
regressions and whitespace-related misclassification.
- Ensure `markHostMonitorsDown` is called only for TCP monitor partitions.
- File: `backend/internal/services/uptime_service.go`
### Phase 3: Backend Validation
- Add/adjust tests:
- scheduled path no longer forces down when HTTP succeeds
- manual and scheduled reach same final state for HTTP monitors
- internal host unreachable + public URL HTTP 200 => monitor is `UP`
- internal host reachable + public URL failure => monitor is `DOWN`
- TCP monitor behavior unchanged under host-down conditions
- Files:
- `backend/internal/services/uptime_service_test.go`
- `backend/internal/services/uptime_service_race_test.go` (if needed for concurrency side-effects)
### Phase 4: Integration/E2E Coverage
- Add targeted API-level integration test for scheduler vs manual parity.
- Add Playwright scenario for:
- monitor set UP by manual check
- remains UP after scheduled cycle when URL is reachable
- Add parity scenario for:
- internal TCP precheck unreachable + URL returns 200 => `UP`
- internal TCP precheck reachable + URL failure => `DOWN`
- Files:
- `backend/internal/api/routes/routes_test.go` (or uptime handler integration suite)
- `tests/monitoring/uptime-monitoring.spec.ts` (or equivalent uptime spec file)
Scope note:
- This hotfix plan is intentionally limited to backend behavior correction and
regression tests (unit/integration/E2E).
- Dedicated documentation-phase work is deferred and out of scope for this
hotfix PR.
## 8. Test Plan (Unit / Integration / E2E)
Duplicate notification definition (hotfix acceptance/testing):
- A duplicate notification means the same `(monitor_id, status,
scheduler_tick_id)` is emitted more than once within a single scheduler run.
## Unit Tests
1. `CheckAll_HostDown_DoesNotForceDown_HTTPMonitor_WhenHTTPCheckSucceeds`
2. `CheckAll_HostDown_StillHandles_TCPMonitor_Conservatively`
3. `CheckAll_ManualAndScheduledParity_HTTPMonitor`
4. `CheckAll_InternalHostUnreachable_PublicURL200_HTTPMonitorEndsUp` (blocking)
5. `CheckAll_InternalHostReachable_PublicURLFail_HTTPMonitorEndsDown` (blocking)
## Integration Tests
1. Scheduler endpoint (`/api/v1/system/uptime/check`) parity with monitor check endpoint.
2. Verify DB heartbeat message is real HTTP result (not `Host unreachable`) for HTTP monitors where URL is reachable.
3. Verify when host precheck is down, HTTP monitor heartbeat/notification output is derived from `checkMonitor()` (not synthetic host-path `Host unreachable`).
4. Verify no duplicate notifications are emitted from host+monitor paths for the same scheduler run, where duplicate is defined as repeated `(monitor_id, status, scheduler_tick_id)`.
5. Verify internal host precheck unreachable + public URL 200 still resolves monitor `UP`.
6. Verify internal host precheck reachable + public URL failure resolves monitor `DOWN`.
## E2E Tests
1. Create/sync monitor scenario where manual refresh returns `UP`.
2. Wait one scheduler interval.
3. Assert monitor remains `UP` and latest heartbeat is not forced `Host unreachable` for reachable URL.
4. Assert scenario: internal host precheck unreachable + public URL 200 => monitor remains `UP`.
5. Assert scenario: internal host precheck reachable + public URL failure => monitor is `DOWN`.
## Regression Guardrails
- Add a test explicitly asserting that host precheck must not unconditionally override HTTP monitor checks.
- Add explicit assertions that HTTP monitors under host-down precheck emit
check-derived heartbeat messages and do not produce duplicate notifications
under the `(monitor_id, status, scheduler_tick_id)` rule within a single
scheduler run.
## 9. Risks and Rollback
## Risks
1. More HTTP checks under true host outage may increase check volume.
2. Notification patterns may shift from single host-level event to monitor-level batched events.
3. Edge cases for mixed-type monitor groups (HTTP + TCP) need deterministic behavior.
## Mitigations
1. Preserve batching (`queueDownNotification`) and existing retry thresholds.
2. Keep TCP strict path unchanged in minimal fix.
3. Add explicit log fields and targeted tests for mixed groups.
## Rollback Plan
1. Revert the `CheckAll()` branch change only (single-file rollback).
2. Keep added tests; mark expected behavior as legacy if temporary rollback needed.
3. If necessary, introduce temporary feature toggle to switch between strict and tolerant host gating.
## 10. PR Slicing Strategy
Decision: Single focused PR (hotfix + tests)
Trigger reasons:
- High-severity runtime behavior fix requiring minimal blast radius
- Fast review/rollback with behavior-only delta plus regression coverage
- Avoid scope creep into optional hardening/feature-flag work
### PR-1 (Hotfix + Tests)
Scope:
- `CheckAll()` host-down branch adjustment for HTTP/HTTPS
- Unit/integration/E2E regression tests for URL-truth semantics
Files:
- `backend/internal/services/uptime_service.go`
- `backend/internal/services/uptime_service_test.go`
- `backend/internal/api/routes/routes_test.go` (or equivalent)
- `tests/monitoring/uptime-monitoring.spec.ts` (or equivalent)
Validation gates:
- backend unit tests pass
- targeted uptime integration tests pass
- targeted uptime E2E tests pass
- no behavior regression in existing `CheckAll` tests
Rollback:
- single revert of PR-1 commit
## 11. Acceptance Criteria (DoD)
1. Scheduled and manual checks produce consistent status for HTTP/HTTPS monitors.
2. A reachable monitor URL is not forced to `DOWN` solely by host precheck failure.
3. New regression tests fail before fix and pass after fix.
4. No break in TCP monitor behavior expectations.
5. No new critical/high security findings in touched paths.
6. Blocking parity case passes: internal host precheck unreachable + public URL 200 => scheduled result is `UP`.
7. Blocking parity case passes: internal host precheck reachable + public URL failure => scheduled result is `DOWN`.
8. Under host-down precheck, HTTP monitors produce check-derived heartbeat messages (not synthetic `Host unreachable` from host path).
9. No duplicate notifications are produced by host+monitor paths within a
single scheduler run, where duplicate is defined as repeated
`(monitor_id, status, scheduler_tick_id)`.
## 12. Implementation Risks
1. Increased scheduler workload during host-precheck failures because HTTP/HTTPS checks continue to run.
2. Notification cadence may change due to check-derived monitor outcomes replacing host-forced synthetic downs.
3. Mixed monitor groups (TCP + HTTP/HTTPS) require strict ordering/partitioning to avoid regression.
Mitigations:
- Keep change localized to `CheckAll()` host-down branch decisioning.
- Add explicit regression tests for both parity directions and mixed monitor types.
- Keep rollback path as single-commit revert.