Files
Charon/docs/plans/current_spec.md

17 KiB

Uptime Monitoring Regression Investigation (Scheduled vs Manual)

Date: 2026-03-01 Owner: Planning Agent Status: Investigation Complete, Fix Plan Proposed Severity: High (false DOWN states on automated monitoring)

1. Executive Summary

Two services (Wizarr and Charon) can flip to DOWN during scheduled cycles while manual checks immediately return UP because scheduled checks use a host-level TCP gate that can short-circuit monitor-level HTTP checks.

The scheduled path is:

  • ticker -> CheckAll -> checkAllHosts -> (host status down) -> markHostMonitorsDown

The manual path is:

  • POST /api/v1/uptime/monitors/:id/check -> CheckMonitor -> checkMonitor

Only the scheduled path runs host precheck gating. If host precheck fails (TCP to upstream host/port), CheckAll skips HTTP checks and forcibly writes monitor status to down with heartbeat message Host unreachable.

This is a backend state mutation problem (not only UI rendering).

1.1 Monitoring Policy (Authoritative Behavior)

Charon uptime monitoring SHALL follow URL-truth semantics for HTTP/HTTPS monitors, matching third-party external monitor behavior (Uptime Kuma style) without requiring any additional service.

Policy:

  • HTTP/HTTPS monitors are URL-truth based. The monitor result is authoritative based on the configured URL check outcome (status code/timeout/TLS/connectivity from URL perspective).
  • Internal TCP reachability precheck (ForwardHost:ForwardPort) is non-authoritative for HTTP/HTTPS monitor status.
  • TCP monitors remain endpoint-socket checks and may rely on direct socket reachability semantics.
  • Host precheck may still be used for optimization, grouping telemetry, and operator diagnostics, but SHALL NOT force HTTP/HTTPS monitors to DOWN.

2. Research Findings

2.1 Execution Path Comparison (Required)

Scheduled path behavior

  • Entry: backend/internal/api/routes/routes.go (background ticker, calls uptimeService.CheckAll())
  • CheckAll() calls checkAllHosts() first.
    • File: backend/internal/services/uptime_service.go:354
  • checkAllHosts() updates each UptimeHost.Status via TCP checks in checkHost().
    • File: backend/internal/services/uptime_service.go:395
  • checkHost() dials UptimeHost.Host + monitor port (prefer ProxyHost.ForwardPort, fallback to URL port).
    • File: backend/internal/services/uptime_service.go:437
  • Back in CheckAll(), monitors are grouped by UptimeHostID.
    • File: backend/internal/services/uptime_service.go:367
  • If UptimeHost.Status == "down", markHostMonitorsDown() is called and individual monitor checks are skipped.
    • File: backend/internal/services/uptime_service.go:381
    • File: backend/internal/services/uptime_service.go:593

Manual path behavior

  • Entry: POST /api/v1/uptime/monitors/:id/check.
    • Handler: backend/internal/api/handlers/uptime_handler.go:107
  • Calls service.CheckMonitor(*monitor) asynchronously.
    • File: backend/internal/services/uptime_service.go:707
  • checkMonitor() performs direct HTTP/TCP monitor check and updates monitor + heartbeat.
    • File: backend/internal/services/uptime_service.go:711

Key divergence

  • Scheduled: host-gated (precheck can override monitor)
  • Manual: direct monitor check (no host gate)

3. Root Cause With Evidence

3.1 Primary Root Cause: Host Precheck Overrides HTTP Success in Scheduled Cycles

When UptimeHost is marked down, scheduled checks do not run checkMonitor() for that host's monitors. Instead they call markHostMonitorsDown() which:

  • sets each monitor Status = "down"
  • writes UptimeHeartbeat{Status: "down", Message: "Host unreachable"}
  • maxes failure count (FailureCount = MaxRetries)

Evidence:

  • Short-circuit: backend/internal/services/uptime_service.go:381
  • Forced down write: backend/internal/services/uptime_service.go:610
  • Forced heartbeat message: backend/internal/services/uptime_service.go:624

This exactly matches symptom pattern:

  1. Manual refresh sets monitor UP via direct HTTP check.
  2. Next scheduler cycle can force it back to DOWN from host precheck path.

3.2 Hypothesis Check: TCP precheck can fail while public URL HTTP check succeeds

Confirmed as plausible by design:

  • checkHost() tests upstream reachability (ForwardHost:ForwardPort) from Charon runtime.
  • checkMonitor() tests monitor URL (public domain URL, often via Caddy/public routing).

A service can be publicly reachable by monitor URL while upstream TCP precheck fails due to network namespace/routing/DNS/hairpin differences.

This is especially likely for:

  • self-referential routes (Charon monitoring Charon via public hostname)
  • host/container networking asymmetry
  • services reachable through proxy path but not directly on upstream socket from current runtime context

3.3 Recent Change Correlation (Required)

SyncAndCheckForHost (regression amplifier)

  • Introduced in commit 2cd19d89 and called from proxy host create path.
  • Files:
    • backend/internal/services/uptime_service.go:1195
    • backend/internal/api/handlers/proxy_host_handler.go:418
  • Behavior: creates/syncs monitor and immediately runs checkMonitor().

Impact: makes monitors quickly show UP after create/manual, then scheduler can flip to DOWN if host precheck fails. This increased visibility of scheduled/manual inconsistency.

CleanupStaleFailureCounts

  • Introduced in 2cd19d89, refined in 7a12ab79.
  • File: backend/internal/services/uptime_service.go:1277
  • It runs at startup and resets stale monitor states only; not per-cycle override logic.
  • Not root cause of recurring per-cycle flip.

Frontend effective status changes

  • Latest commit 0241de69 refactors effectiveStatus handling.
  • File: frontend/src/pages/Uptime.tsx.
  • Backend evidence proves this is not visual-only: scheduler writes down heartbeats/messages directly in DB.

3.4 Grouping Logic Analysis (UptimeHost/UpstreamHost)

Monitors are grouped by UptimeHostID in CheckAll(). UptimeHost is derived from ProxyHost.ForwardHost in sync flows.

Relevant code:

  • group map by UptimeHostID: backend/internal/services/uptime_service.go:367
  • host linkage in sync: backend/internal/services/uptime_service.go:189, backend/internal/services/uptime_service.go:226
  • sync single-host update path: backend/internal/services/uptime_service.go:1023

Risk: one host precheck failure can mark all grouped monitors down without URL-level validation.

4. Technical Specification (Fix Plan)

4.1 Minimal Proper Fix (First)

Goal: eliminate false DOWN while preserving existing behavior as much as possible.

Change CheckAll() host-down branch to avoid hard override for HTTP/HTTPS monitors.

Mandatory hotfix rule:

  • WHEN a host precheck is down, THE SYSTEM SHALL partition host monitors by type inside CheckAll().
  • markHostMonitorsDown MUST be invoked only for tcp monitors.
  • http/https monitors MUST still run through checkMonitor() and MUST NOT be force-written down by the host precheck path.
  • Host precheck outcomes MAY be recorded for optimization/telemetry/grouping, but MUST NOT be treated as final status for http/https monitors.

Proposed rule:

  1. If host is down:
  • For http/https monitors: still run checkMonitor() (do not force down).
  • For tcp monitors: keep current host-down fast-path (markHostMonitorsDown) or direct tcp check.
  1. If host is not down:
    • Keep existing behavior (run checkMonitor() for all monitors).

Rationale:

  • Aligns scheduled behavior with manual for URL-based monitors.
  • Preserves reverse proxy product semantics where public URL availability is the source of truth.
  • Minimal code delta in CheckAll() decision branch.
  • Preserves optimization for true TCP-only monitors.

Exact file/function targets

  • backend/internal/services/uptime_service.go
    • CheckAll()
    • add small helper (optional): partitionMonitorsByType(...)

4.2 Long-Term Robust Fix (Deferred)

Introduce host precheck as advisory signal, not authoritative override.

Design:

  1. Add HostReachability result to run context (not persisted as forced monitor status).
  2. Always execute per-monitor checks, but use host precheck to:
    • tune retries/backoff
    • annotate failure reason
    • optimize notification batching
  3. Optionally add feature flag:
    • feature.uptime.strict_host_precheck (default false)
    • allows legacy strict gating in environments that want it.

Benefits:

  • Removes false DOWN caused by precheck mismatch.
  • Keeps performance and batching controls.
  • More explicit semantics for operators.

5. API/Schema Impact

No API contract change required for minimal fix. No database migration required for minimal fix.

Long-term fix may add one feature flag setting only.

6. EARS Requirements

Ubiquitous

  • THE SYSTEM SHALL evaluate HTTP/HTTPS monitor availability using URL-level checks as the authoritative signal.

Event-driven

  • WHEN the scheduled uptime cycle runs, THE SYSTEM SHALL execute HTTP/HTTPS monitor checks regardless of internal host precheck state.
  • WHEN the scheduled uptime cycle runs and host precheck is down, THE SYSTEM SHALL apply host-level forced-down logic only to TCP monitors.

State-driven

  • WHILE a monitor type is http or https, THE SYSTEM SHALL NOT force monitor status to down solely from internal host precheck failure.
  • WHILE a monitor type is tcp, THE SYSTEM SHALL evaluate status using endpoint socket reachability semantics.

Unwanted behavior

  • IF internal host precheck is unreachable AND URL-level HTTP/HTTPS check returns success, THEN THE SYSTEM SHALL set monitor status to up.
  • IF internal host precheck is reachable AND URL-level HTTP/HTTPS check fails, THEN THE SYSTEM SHALL set monitor status to down.

Optional

  • WHERE host precheck telemetry is enabled, THE SYSTEM SHALL record host-level reachability for diagnostics and grouping without overriding HTTP/HTTPS monitor final state.

7. Implementation Plan

Phase 1: Reproduction Lock-In (Tests First)

  • Add backend service test proving current regression:
    • host precheck fails
    • monitor URL check would succeed
    • scheduled CheckAll() currently writes down (existing behavior)
  • File: backend/internal/services/uptime_service_test.go (new test block)

Phase 2: Minimal Backend Fix

  • Update CheckAll() branch logic to run HTTP/HTTPS monitors even when host is down.
  • Make monitor partitioning explicit and mandatory in CheckAll() host-down branch.
  • Add an implementation guard before partitioning: normalize monitor type using strings.TrimSpace + strings.ToLower to prevent HTTP/HTTPS case regressions and whitespace-related misclassification.
  • Ensure markHostMonitorsDown is called only for TCP monitor partitions.
  • File: backend/internal/services/uptime_service.go

Phase 3: Backend Validation

  • Add/adjust tests:
    • scheduled path no longer forces down when HTTP succeeds
    • manual and scheduled reach same final state for HTTP monitors
    • internal host unreachable + public URL HTTP 200 => monitor is UP
    • internal host reachable + public URL failure => monitor is DOWN
    • TCP monitor behavior unchanged under host-down conditions
  • Files:
    • backend/internal/services/uptime_service_test.go
    • backend/internal/services/uptime_service_race_test.go (if needed for concurrency side-effects)

Phase 4: Integration/E2E Coverage

  • Add targeted API-level integration test for scheduler vs manual parity.
  • Add Playwright scenario for:
    • monitor set UP by manual check
    • remains UP after scheduled cycle when URL is reachable
  • Add parity scenario for:
    • internal TCP precheck unreachable + URL returns 200 => UP
    • internal TCP precheck reachable + URL failure => DOWN
  • Files:
    • backend/internal/api/routes/routes_test.go (or uptime handler integration suite)
    • tests/monitoring/uptime-monitoring.spec.ts (or equivalent uptime spec file)

Scope note:

  • This hotfix plan is intentionally limited to backend behavior correction and regression tests (unit/integration/E2E).
  • Dedicated documentation-phase work is deferred and out of scope for this hotfix PR.

8. Test Plan (Unit / Integration / E2E)

Duplicate notification definition (hotfix acceptance/testing):

  • A duplicate notification means the same (monitor_id, status, scheduler_tick_id) is emitted more than once within a single scheduler run.

Unit Tests

  1. CheckAll_HostDown_DoesNotForceDown_HTTPMonitor_WhenHTTPCheckSucceeds
  2. CheckAll_HostDown_StillHandles_TCPMonitor_Conservatively
  3. CheckAll_ManualAndScheduledParity_HTTPMonitor
  4. CheckAll_InternalHostUnreachable_PublicURL200_HTTPMonitorEndsUp (blocking)
  5. CheckAll_InternalHostReachable_PublicURLFail_HTTPMonitorEndsDown (blocking)

Integration Tests

  1. Scheduler endpoint (/api/v1/system/uptime/check) parity with monitor check endpoint.
  2. Verify DB heartbeat message is real HTTP result (not Host unreachable) for HTTP monitors where URL is reachable.
  3. Verify when host precheck is down, HTTP monitor heartbeat/notification output is derived from checkMonitor() (not synthetic host-path Host unreachable).
  4. Verify no duplicate notifications are emitted from host+monitor paths for the same scheduler run, where duplicate is defined as repeated (monitor_id, status, scheduler_tick_id).
  5. Verify internal host precheck unreachable + public URL 200 still resolves monitor UP.
  6. Verify internal host precheck reachable + public URL failure resolves monitor DOWN.

E2E Tests

  1. Create/sync monitor scenario where manual refresh returns UP.
  2. Wait one scheduler interval.
  3. Assert monitor remains UP and latest heartbeat is not forced Host unreachable for reachable URL.
  4. Assert scenario: internal host precheck unreachable + public URL 200 => monitor remains UP.
  5. Assert scenario: internal host precheck reachable + public URL failure => monitor is DOWN.

Regression Guardrails

  • Add a test explicitly asserting that host precheck must not unconditionally override HTTP monitor checks.
  • Add explicit assertions that HTTP monitors under host-down precheck emit check-derived heartbeat messages and do not produce duplicate notifications under the (monitor_id, status, scheduler_tick_id) rule within a single scheduler run.

9. Risks and Rollback

Risks

  1. More HTTP checks under true host outage may increase check volume.
  2. Notification patterns may shift from single host-level event to monitor-level batched events.
  3. Edge cases for mixed-type monitor groups (HTTP + TCP) need deterministic behavior.

Mitigations

  1. Preserve batching (queueDownNotification) and existing retry thresholds.
  2. Keep TCP strict path unchanged in minimal fix.
  3. Add explicit log fields and targeted tests for mixed groups.

Rollback Plan

  1. Revert the CheckAll() branch change only (single-file rollback).
  2. Keep added tests; mark expected behavior as legacy if temporary rollback needed.
  3. If necessary, introduce temporary feature toggle to switch between strict and tolerant host gating.

10. PR Slicing Strategy

Decision: Single focused PR (hotfix + tests)

Trigger reasons:

  • High-severity runtime behavior fix requiring minimal blast radius
  • Fast review/rollback with behavior-only delta plus regression coverage
  • Avoid scope creep into optional hardening/feature-flag work

PR-1 (Hotfix + Tests)

Scope:

  • CheckAll() host-down branch adjustment for HTTP/HTTPS
  • Unit/integration/E2E regression tests for URL-truth semantics

Files:

  • backend/internal/services/uptime_service.go
  • backend/internal/services/uptime_service_test.go
  • backend/internal/api/routes/routes_test.go (or equivalent)
  • tests/monitoring/uptime-monitoring.spec.ts (or equivalent)

Validation gates:

  • backend unit tests pass
  • targeted uptime integration tests pass
  • targeted uptime E2E tests pass
  • no behavior regression in existing CheckAll tests

Rollback:

  • single revert of PR-1 commit

11. Acceptance Criteria (DoD)

  1. Scheduled and manual checks produce consistent status for HTTP/HTTPS monitors.
  2. A reachable monitor URL is not forced to DOWN solely by host precheck failure.
  3. New regression tests fail before fix and pass after fix.
  4. No break in TCP monitor behavior expectations.
  5. No new critical/high security findings in touched paths.
  6. Blocking parity case passes: internal host precheck unreachable + public URL 200 => scheduled result is UP.
  7. Blocking parity case passes: internal host precheck reachable + public URL failure => scheduled result is DOWN.
  8. Under host-down precheck, HTTP monitors produce check-derived heartbeat messages (not synthetic Host unreachable from host path).
  9. No duplicate notifications are produced by host+monitor paths within a single scheduler run, where duplicate is defined as repeated (monitor_id, status, scheduler_tick_id).

12. Implementation Risks

  1. Increased scheduler workload during host-precheck failures because HTTP/HTTPS checks continue to run.
  2. Notification cadence may change due to check-derived monitor outcomes replacing host-forced synthetic downs.
  3. Mixed monitor groups (TCP + HTTP/HTTPS) require strict ordering/partitioning to avoid regression.

Mitigations:

  • Keep change localized to CheckAll() host-down branch decisioning.
  • Add explicit regression tests for both parity directions and mixed monitor types.
  • Keep rollback path as single-commit revert.