Files

GitHub Actions 2cd19d8964 fix(uptime): implement SyncAndCheckForHost and cleanup stale failure counts; add tests for concurrency and feature flag handling

2026-03-01 02:46:49 +00:00

29 KiB

Raw Blame History

Uptime Monitoring Bug Triage & Fix Plan

1. Introduction

Overview

Uptime Monitoring in Charon uses a two-level check system: host-level TCP pre-checks followed by per-monitor HTTP/TCP checks. Newly added proxy hosts (specifically Wizarr and Charon itself) display as "DOWN" in the UI even though the underlying services are fully accessible. Manual refresh via the health check button on the Uptime page correctly shows "UP", but the automated background checker fails to produce the same result.

Objectives

Eliminate false "DOWN" status for newly added proxy hosts
Ensure the background checker produces consistent results with manual health checks
Improve the initial monitor lifecycle (creation → first check → display)
Address the dual UptimeService instance functional inconsistency
Evaluate whether a "custom health endpoint URL" feature is warranted

Scope

Backend: backend/internal/services/uptime_service.go, backend/internal/api/routes/routes.go, backend/internal/api/handlers/proxy_host_handler.go
Frontend: frontend/src/pages/Uptime.tsx, frontend/src/api/uptime.ts
Models: backend/internal/models/uptime.go, backend/internal/models/uptime_host.go
Tests: backend/internal/services/uptime_service_test.go (1519 LOC), uptime_service_unit_test.go (257 LOC), uptime_service_race_test.go (402 LOC), tests/monitoring/uptime-monitoring.spec.ts (E2E)

2. Research Findings

2.1 Root Cause #1: Port Mismatch in Host-Level TCP Check (FIXED)

Status: Fixed in commit 209b2fc8, refactored in bfc19ef3.

The checkHost() function extracted the port from the monitor's public URL (e.g., 443 for HTTPS) instead of using ProxyHost.ForwardPort (e.g., 5690 for Wizarr). This caused TCP checks to fail, marking the host as down, which then skipped individual HTTP monitor checks.

Fix applied: Added Preload("ProxyHost") and prioritized monitor.ProxyHost.ForwardPort over extractPort(monitor.URL).

Evidence: Archived in docs/plans/archive/uptime_monitoring_diagnosis.md and docs/implementation/uptime_monitoring_port_fix_COMPLETE.md.

Remaining risk: If this fix has not been deployed to production, this remains the primary cause. If deployed, residual elevated failure_count values in the DB may need to be reset.

2.2 Root Cause #2: Dual UptimeService Instance (OPEN — Functional Inconsistency)

File: backend/internal/api/routes/routes.go

Two separate UptimeService instances are created:

Instance	Line	Scope
`uptimeService`	226	Background ticker goroutine, `ProxyHostHandler`, `/system/uptime/check` endpoint
`uptimeSvc`	414	Uptime API handler routes (List, Create, Update, Delete, Check, Sync)

Both share the same *gorm.DB (so data consistency via DB is maintained), but each has independent in-memory state:

pendingNotifications map (notification batching)
hostMutexes map (per-host mutex for concurrent writes)
batchWindow timers

Impact: This is a functional inconsistency that can cause race conditions between ProxyHostHandler operations and Uptime API operations. Specifically:

ProxyHostHandler.Create() uses instance #1 (uptimeService) for SyncAndCheckForHost
Uptime API queries (List, GetHistory) use instance #2 (uptimeSvc)
In-memory state (host mutexes, pending notifications) is invisible between instances

This creates a functional bug path because:

When a user triggers a manual check via POST /api/v1/uptime/monitors/:id/check, the handler uses uptimeSvc.CheckMonitor(). If the monitor transitions to "down", the notification is queued in uptimeSvc's pendingNotifications map. Meanwhile, the background checker uses uptimeService, which has a separate pendingNotifications map.
Duplicate or missed notifications
Independent failure debouncing state
Mutex contention issues between the two instances

While NOT the direct cause of the "DOWN" display bug, this is a functional inconsistency — not merely a code smell — that can produce observable bugs in notification delivery and state synchronization.

2.3 Root Cause #3: No Immediate Monitor Creation on Proxy Host Create (OPEN)

Note — Create ↔ Update asymmetry: ProxyHostHandler.Update() already calls SyncMonitorForHost (established pattern). The fix for Create should follow the same pattern for consistency.

When a user creates a new proxy host:

The proxy host is saved to DB
No uptime monitor is created — there is no hook in ProxyHostHandler.Create() to trigger SyncMonitors() or create a monitor
SyncMonitorForHost() (called on proxy host update) only updates existing monitors — it does NOT create new ones
The background ticker must fire (up to 1 minute) for SyncMonitors() to create the monitor

Timeline for a new proxy host to show status:

T+0s: Proxy host created via API
T+0s to T+60s: No uptime monitor exists — Uptime page shows nothing for this host
T+60s: Background ticker fires, SyncMonitors() creates monitor with status: "pending"
T+60s: CheckAll() runs, attempts host check + individual check
T+62s: If checks succeed, monitor status: "up" is saved to DB
T+90s (worst case): Frontend polls monitors and picks up the update

This is a poor UX experience. Users expect to see their new host on the Uptime page immediately.

2.4 Root Cause #4: "pending" Status Displayed as DOWN (OPEN)

File: frontend/src/pages/Uptime.tsx, MonitorCard component

const isUp = latestBeat ? latestBeat.status === 'up' : monitor.status === 'up';

When a new monitor has status: "pending" and no heartbeat history:

latestBeat = null (no history yet)
Falls back to monitor.status === 'up'
"pending" === "up" → false
Displayed with red DOWN styling

The UI has no dedicated "pending" or "unknown" state. Between creation and first check, every monitor appears DOWN.

2.5 Root Cause #5: No Initial CheckAll After Server Start Sync (OPEN)

File: backend/internal/api/routes/routes.go, lines 455-490

The background goroutine flow on server start:

Sleep 30 seconds
Call SyncMonitors() — creates monitors for all proxy hosts
Does NOT call CheckAll()
Start 1-minute ticker
First CheckAll() runs on first tick (~90 seconds after server start)

This means after every server restart, all monitors sit in "pending" (displayed as DOWN) for up to 90 seconds.

2.6 Concern #6: Self-Referencing Check (Charon Pinging Itself)

If Charon has a proxy host pointing to itself (e.g., charon.example.com → localhost:8080):

TCP host check: Connects to localhost:8080 → succeeds (Gin server is running locally).

HTTP monitor check: Sends GET to https://charon.example.com → requires DNS resolution from inside the Docker container. This may fail due to:

Docker hairpin NAT: Containers cannot reach their own published ports via the host's external IP by default
Split-horizon DNS: The domain may resolve to a public IP that isn't routable from within the container
Caddy certificate validation: The HTTP client might reject a self-signed or incorrectly configured cert

When the user clicks manual refresh, the same checkMonitor() function runs with the same options (WithAllowLocalhost(), WithMaxRedirects(0)). If manual check succeeds but background check fails, the difference is likely timing-dependent — the alternating "up"/"down" pattern observed in the archived diagnosis (heartbeat records alternating between up|HTTP 200 and down|Host unreachable) supports this hypothesis.

2.7 Feature Gap: No Custom Health Endpoint URL

The UptimeMonitor model has no health_endpoint or custom_url field. All monitors check the public root URL (/). This is problematic because:

Some services redirect root → /login → 302 → tracked inconsistently
Services with dedicated health endpoints (/health, /api/health) provide more reliable status
Self-referencing checks (Charon) could use http://localhost:8080/api/v1/health instead of routing through DNS/Caddy

2.8 Existing Test Coverage

File	LOC	Focus
`uptime_service_test.go`	1519	Integration tests with SQLite DB
`uptime_service_unit_test.go`	257	Unit tests for service methods
`uptime_service_race_test.go`	402	Concurrency/race condition tests
`uptime_service_notification_test.go`	—	Notification batching tests
`uptime_handler_test.go`	—	Handler HTTP endpoint tests
`uptime_monitor_initial_state_test.go`	—	Initial state tests
`uptime-monitoring.spec.ts`	—	Playwright E2E (22 scenarios)

3. Technical Specifications

3.1 Consolidate UptimeService Singleton

Current: Two instances (uptimeService line 226, uptimeSvc line 414) in routes.go.

Target: Single instance passed to both the background goroutine AND the API handlers.

// routes.go — BEFORE (two instances)
uptimeService := services.NewUptimeService(db, notificationService)  // line 226
uptimeSvc := services.NewUptimeService(db, notificationService)      // line 414

// routes.go — AFTER (single instance)
uptimeService := services.NewUptimeService(db, notificationService)  // line 226
// line 414: reuse uptimeService for handler registration
uptimeHandler := handlers.NewUptimeHandler(uptimeService)

Impact: All in-memory state (mutexes, notification batching, pending notifications) is shared. The single instance must remain thread-safe (it already is — methods use sync.Mutex).

3.2 Trigger Monitor Creation + Immediate Check on Proxy Host Create

File: backend/internal/api/handlers/proxy_host_handler.go

After successfully creating a proxy host, call SyncMonitors() (or a targeted sync) and trigger an immediate check:

// In Create handler, after host is saved:
if h.uptimeService != nil {
    _ = h.uptimeService.SyncMonitors()
    // Trigger immediate check for the new monitor
    var monitor models.UptimeMonitor
    if err := h.uptimeService.DB.Where("proxy_host_id = ?", host.ID).First(&monitor).Error; err == nil {
        go h.uptimeService.CheckMonitor(monitor)
    }
}

Alternative (lighter-weight): Add a SyncAndCheckForHost(hostID uint) method that creates the monitor if needed and immediately checks it.

3.3 Add "pending" UI State

File: frontend/src/pages/Uptime.tsx

Add dedicated handling for "pending" status:

const isPending = monitor.status === 'pending' && (!history || history.length === 0);
const isUp = latestBeat ? latestBeat.status === 'up' : monitor.status === 'up';
const isPaused = monitor.enabled === false;

Visual treatment for pending state:

Yellow/gray pulsing indicator (distinct from DOWN red and UP green)
Badge text: "CHECKING..." or "PENDING"
Heartbeat bar: show empty placeholder bars with a spinner or pulse animation

3.4 Run CheckAll After Initial SyncMonitors

File: backend/internal/api/routes/routes.go

// AFTER initial sync
if enabled {
    if err := uptimeService.SyncMonitors(); err != nil {
        logger.Log().WithError(err).Error("Failed to sync monitors")
    }
    // Run initial check immediately
    uptimeService.CheckAll()
}

3.5 Add Optional `check_url` Field to UptimeMonitor (Enhancement)

Model change (backend/internal/models/uptime.go):

type UptimeMonitor struct {
    // ... existing fields
    CheckURL string `json:"check_url,omitempty" gorm:"default:null"`
}

Service behavior (uptime_service.go checkMonitor()):

If monitor.CheckURL is set and non-empty, use it instead of monitor.URL for the HTTP check
This allows users to configure /health or http://localhost:8080/api/v1/health for self-referencing

Frontend: Add an optional "Health Check URL" field in the edit monitor modal.

Auto-migration: GORM handles adding the column. Existing monitors keep CheckURL = "" (uses default URL behavior).

3.5.1 SSRF Protection for CheckURL

The CheckURL field accepts user-controlled URLs that the server will fetch. This requires layered SSRF defenses:

Write-time validation (on Create/Update API):

Validate CheckURL before saving to DB
Scheme restriction: Only http:// and https:// allowed. Block file://, ftp://, gopher://, and all other schemes
Max URL length: 2048 characters
Reject URLs that fail url.Parse() or have empty host components

Check-time validation (before each HTTP request):

Re-validate the URL against the deny list before every check execution (defense-in-depth — the stored URL could have been valid at write time but conditions may change)
Localhost handling: Allow loopback addresses (127.0.0.1, ::1, localhost) since self-referencing checks are a valid use case. Block cloud metadata IPs:
- 169.254.169.254 (AWS/GCP/Azure instance metadata)
- fd00::/8 (unique local addresses)
- 100.100.100.200 (Alibaba Cloud metadata)
- 169.254.0.0/16 link-local range (except loopback)
DNS rebinding protection: Resolve the hostname at request time, pin the resolved IP, and validate the resolved IP against the deny list before establishing a connection. Use a custom net.Dialer or http.Transport.DialContext to enforce this
Redirect validation: If CheckURL follows HTTP redirects (3xx), validate each redirect target URL against the same deny list (scheme, host, resolved IP). Use a CheckRedirect function on the http.Client to intercept and validate each hop

Implementation pattern:

func validateCheckURL(rawURL string) error {
    if len(rawURL) > 2048 {
        return ErrURLTooLong
    }
    parsed, err := url.Parse(rawURL)
    if err != nil {
        return ErrInvalidURL
    }
    if parsed.Scheme != "http" && parsed.Scheme != "https" {
        return ErrDisallowedScheme
    }
    if parsed.Host == "" {
        return ErrEmptyHost
    }
    return nil
}

func validateResolvedIP(ip net.IP) error {
    // Allow loopback
    if ip.IsLoopback() {
        return nil
    }
    // Block cloud metadata and link-local
    if isCloudMetadataIP(ip) || ip.IsLinkLocalUnicast() {
        return ErrDeniedIP
    }
    return nil
}

3.6 Data Cleanup: Reset Stale Failure Counts

After deploying the port fix (if not already deployed), run a one-time DB cleanup:

-- Reset failure counts for hosts/monitors stuck from the port mismatch era
-- Only reset monitors with elevated failure counts AND no recent successful heartbeat
UPDATE uptime_hosts SET failure_count = 0, status = 'pending' WHERE status = 'down';
UPDATE uptime_monitors SET failure_count = 0, status = 'pending'
WHERE status = 'down'
  AND failure_count > 5
  AND id NOT IN (
    SELECT DISTINCT monitor_id FROM uptime_heartbeats
    WHERE status = 'up' AND created_at > datetime('now', '-24 hours')
  );

This could be automated in SyncMonitors() or done via a migration.

4. Data Flow Diagrams

Current Flow (Buggy)

[Proxy Host Created] → (no uptime action)
  → [Wait up to 60s for ticker]
  → SyncMonitors() creates monitor (status: "pending")
  → CheckAll() runs:
      → checkAllHosts() TCP to ForwardHost:ForwardPort
      → If host up → checkMonitor() HTTP to public URL
      → DB updated
  → [Wait up to 30s for frontend poll]
  → Frontend displays status

Proposed Flow (Fixed)

[Proxy Host Created]
  → SyncMonitors() or SyncAndCheckForHost() immediately
  → Monitor created (status: "pending")
  → Frontend shows "PENDING" (yellow indicator)
  → Immediate checkMonitor() in background goroutine
  → DB updated (status: "up" or "down")
  → Frontend polls in 30s → shows actual status

5. Implementation Plan

Phase 1: Playwright E2E Tests (Behavior Specification)

Define expected behavior before implementation:

Test	Description
New proxy host monitor appears immediately	After creating a proxy host, navigate to Uptime page, verify the monitor card exists
New monitor shows pending state	Verify "PENDING" badge before first check completes
Monitor status updates after check	Trigger manual check, verify status changes from pending/down to up
Verify no false DOWN on first load	Create host, wait for background check, verify status is UP (not DOWN)

Files: tests/monitoring/uptime-monitoring.spec.ts (extend existing suite)

Phase 2: Backend — Consolidate UptimeService Instance

Remove second NewUptimeService call at routes.go line 414
Pass uptimeService (line 226) to NewUptimeHandler()
Verify all handler operations use the shared instance
Update existing tests that may create multiple instances

Files: backend/internal/api/routes/routes.go

Phase 3: Backend — Immediate Monitor Lifecycle

In ProxyHostHandler.Create(), after saving host: call SyncMonitors() or create a targeted SyncAndCheckForHost() method
Add CheckAll() call after initial SyncMonitors() in the background goroutine
Consider adding a SyncAndCheckForHost(hostID uint) method to UptimeService that:
- Finds or creates the monitor for the given proxy host
- Immediately runs checkMonitor() in a goroutine
- Returns the monitor ID for the caller

Files: backend/internal/services/uptime_service.go, backend/internal/api/handlers/proxy_host_handler.go, backend/internal/api/routes/routes.go

Phase 4: Frontend — Pending State Display

Add isPending check in MonitorCard component
Add yellow/gray styling for pending state
Add pulsing animation for pending badge
Add i18n key uptime.pending → "CHECKING..." for all 5 supported languages (not just the default locale)
Ensure heartbeat bar handles zero-length history gracefully

Files: frontend/src/pages/Uptime.tsx, frontend/src/i18n/ locale files

Phase 5: Backend — Optional `check_url` Field (Enhancement)

Add CheckURL field to UptimeMonitor model
Update checkMonitor() to use CheckURL if set
Update SyncMonitors() — do NOT overwrite user-configured CheckURL
Update API DTOs for create/update

Files: backend/internal/models/uptime.go, backend/internal/services/uptime_service.go, backend/internal/api/handlers/uptime_handler.go

Add optional "Health Check URL" field to EditMonitorModal and CreateMonitorModal
Show placeholder text: "Leave empty to use monitor URL"
Validate URL format on frontend

Files: frontend/src/pages/Uptime.tsx

Phase 7: Testing & Validation

Run existing backend test suites (2178 LOC across 3 files)
Add tests for:
- Single UptimeService instance behavior
- Immediate monitor creation on proxy host create
- CheckURL fallback logic
- "pending" → "up" transition
Add edge case tests:
- Rapid Create-Delete: Proxy host created and immediately deleted before SyncAndCheckForHost goroutine completes — goroutine should handle non-existent proxy host gracefully (no panic, no orphaned monitor)
- Concurrent Creates: Multiple proxy hosts created simultaneously — verify SyncMonitors() from Create handlers doesn't conflict with background ticker's SyncMonitors() (no duplicate monitors, no data races)
- Feature Flag Toggle: If feature.uptime.enabled is toggled to false while immediate check goroutine is running — goroutine should exit cleanly without writing stale results
- CheckURL with redirects: CheckURL that 302-redirects to a private IP — redirect target must be validated against the deny list (SSRF redirect chain)
Run Playwright E2E suite with Docker rebuild
Verify coverage thresholds

Phase 8: Data Cleanup Migration

Add one-time migration or startup hook to reset stale failure_count and status on hosts/monitors that were stuck from the port mismatch era
Log the cleanup action

6. EARS Requirements

WHEN a new proxy host is created, THE SYSTEM SHALL create a corresponding uptime monitor within 5 seconds (not waiting for the 1-minute ticker)
WHEN a new uptime monitor is created, THE SYSTEM SHALL immediately trigger a health check in a background goroutine
WHEN a monitor has status "pending" and no heartbeat history, THE SYSTEM SHALL display a distinct visual indicator (not DOWN red)
WHEN the server starts, THE SYSTEM SHALL run CheckAll() immediately after SyncMonitors() (not wait for first tick)
THE SYSTEM SHALL use a single UptimeService instance for both background checks and API handlers
WHERE a monitor has a check_url configured, THE SYSTEM SHALL use it for health checks instead of the monitor URL
WHEN a monitor's host-level TCP check succeeds but HTTP check fails, THE SYSTEM SHALL record the specific failure reason in the heartbeat message
IF the uptime feature flag is disabled, THEN THE SYSTEM SHALL skip all monitor sync and check operations

7. Acceptance Criteria

Must Have

WHEN a new proxy host is created, a corresponding uptime monitor exists within 5 seconds
WHEN a new uptime monitor is created, an immediate health check runs
WHEN a monitor has status "pending", a distinct yellow/gray visual indicator is shown (not red DOWN)
WHEN the server starts, CheckAll() runs immediately after SyncMonitors()
Only one UptimeService instance exists at runtime

Should Have

WHEN a monitor has a check_url configured, it is used for health checks
WHEN a monitor's host-level TCP check succeeds but HTTP check fails, the heartbeat message contains the failure reason
Stale failure_count values from the port mismatch era are reset on deployment

Nice to Have

Dedicated UI indicator for "first check in progress" (animated pulse)
Automatic detection of health endpoints (try /health first, fall back to /)

8. PR Slicing Strategy

Decision: 3 PRs

Trigger reasons: Cross-domain changes (backend + frontend + model), independent concerns (UX fix vs backend architecture vs new feature), review size management.

PR-1: Backend Bug Fixes (Architecture + Lifecycle)

Scope: Phases 2, 3, and initial CheckAll (Section 3.4)

Files:

backend/internal/api/routes/routes.go — consolidate to single UptimeService instance, add CheckAll after initial sync
backend/internal/services/uptime_service.go — add SyncAndCheckForHost() method
backend/internal/api/handlers/proxy_host_handler.go — call SyncAndCheckForHost on Create
Backend test files — update for single instance, add new lifecycle tests
Data cleanup migration
ARCHITECTURE.md — update to reflect the UptimeService singleton consolidation (architecture change)

Dependencies: None (independent of frontend changes)

Validation: All backend tests pass, no duplicate UptimeService instantiation, new proxy hosts get immediate monitors, ARCHITECTURE.md reflects current design

Rollback: Revert commit; behavior returns to previous (ticker-based) lifecycle

PR-2: Frontend Pending State

Scope: Phase 4

Files:

frontend/src/pages/Uptime.tsx — add pending state handling
frontend/src/i18n/ locale files — add uptime.pending key
frontend/src/pages/__tests__/Uptime.spec.tsx — update tests

Dependencies: Works independently of PR-1 (pending state display improves UX regardless of backend fix timing)

Validation: Playwright E2E tests pass, pending monitors show yellow indicator

Rollback: Revert commit; pending monitors display as DOWN (existing behavior)

PR-3: Custom Health Check URL (Enhancement)

Scope: Phases 5, 6

Files:

backend/internal/models/uptime.go — add CheckURL field
backend/internal/services/uptime_service.go — use CheckURL in checkMonitor
backend/internal/api/handlers/uptime_handler.go — update DTOs
frontend/src/pages/Uptime.tsx — add form field
Test files — add coverage for CheckURL logic

Dependencies: PR-1 should be merged first (shared instance simplifies testing)

Validation: Create monitor with custom health URL, verify check uses it

Rollback: Revert commit; GORM auto-migration adds the column but it remains unused

9. Risk Assessment

Risk	Severity	Likelihood	Mitigation
Consolidating UptimeService instance introduces race conditions	High	Low	Existing mutex protections are designed for shared use; run race tests with `-race` flag
Immediate SyncMonitors on proxy host create adds latency to API response	Medium	Medium	Run SyncAndCheckForHost in a goroutine; return HTTP 201 immediately
"pending" UI state confuses users who expect UP/DOWN binary	Low	Low	Clear tooltip/label: "Initial health check in progress..."
CheckURL allows SSRF if user provides malicious URL	High	Low	Layered SSRF defense (see Section 3.5.1): write-time validation (scheme, length, parse), check-time re-validation, DNS rebinding protection (pin resolved IP against deny list), redirect chain validation. Allow loopback for self-referencing checks; block cloud metadata IPs (`169.254.169.254`, `fd00::`, etc.)
Data cleanup migration resets legitimate DOWN status	Medium	Medium	Only reset monitors with elevated failure counts AND no recent successful heartbeat
Self-referencing check (Charon) still fails due to Docker DNS	Medium	High	PR-3 scope: When `SyncMonitors()` creates a monitor, if `ForwardHost` resolves to loopback (`localhost`, `127.0.0.1`, or the container's own hostname), automatically set `CheckURL` to `http://{ForwardHost}:{ForwardPort}/` to bypass the DNS/Caddy round-trip. Tracked as technical debt if deferred beyond PR-3

10. Validation Plan (Mandatory Sequence)

E2E environment prerequisite
- Determine rebuild necessity per testing policy: if application/runtime or Docker input changes are present, rebuild is required.
- If rebuild is required or the container is unhealthy, run .github/skills/scripts/skill-runner.sh docker-rebuild-e2e.
- Record container health outcome before executing tests.
Playwright first
- Run targeted uptime monitoring E2E scenarios.
Local patch coverage preflight
- Generate test-results/local-patch-report.md and test-results/local-patch-report.json.
Unit and coverage
- Backend coverage run (threshold >= 85%).
- Frontend coverage run (threshold >= 85%).
Race condition tests
- Run go test -race ./backend/internal/services/... to verify single-instance thread safety.
Type checks
- Frontend TypeScript check.
Pre-commit
- pre-commit run --all-files with zero blocking failures.
Security scans
- CodeQL Go + JS (security-and-quality).
- GORM security scan (model changes in PR-3).
- Trivy scan.
Build verification
- Backend build + frontend build pass.

11. Architecture Reference

Two-Level Check System

Level 1: Host-Level TCP Pre-Check
├── Purpose: Quickly determine if backend host/container is reachable
├── Method: TCP connection to ForwardHost:ForwardPort
├── Runs: Once per unique UptimeHost
├── If DOWN → Skip all Level 2 checks, mark all monitors DOWN
└── If UP → Proceed to Level 2

Level 2: Service-Level HTTP/TCP Check
├── Purpose: Verify specific service is responding correctly
├── Method: HTTP GET to monitor URL (or CheckURL if set)
├── Runs: Per-monitor (in parallel goroutines)
└── Accepts: 2xx, 3xx, 401, 403 as "up"

Background Ticker Flow

Server Start → Sleep 30s → SyncMonitors()
  → [PROPOSED] CheckAll()
  → Start 1-minute ticker
  → Each tick: SyncMonitors() → CheckAll()
      → checkAllHosts() [parallel, staggered]
      → Group monitors by host
      → For each host:
          If down → markHostMonitorsDown()
          If up → checkMonitor() per monitor [parallel goroutines]

Key Configuration Values

Setting	Value	Source
`batchWindow`	30s	`NewUptimeService()`
`TCPTimeout`	10s	`NewUptimeService()`
`MaxRetries` (host)	2	`NewUptimeService()`
`FailureThreshold` (host)	2	`NewUptimeService()`
`CheckTimeout`	60s	`NewUptimeService()`
`StaggerDelay`	100ms	`NewUptimeService()`
`MaxRetries` (monitor)	3	`UptimeMonitor.MaxRetries` default
Ticker interval	1 min	`routes.go` ticker
Frontend poll interval	30s	`Uptime.tsx` refetchInterval
History poll interval	60s	`MonitorCard` refetchInterval

12. Rollback and Contingency

PR-1: If consolidating UptimeService causes regressions → revert commit; background checker and API revert to two separate instances (existing behavior).
PR-2: If pending state display causes confusion → revert commit; monitors display DOWN for pending (existing behavior).
PR-3: If CheckURL introduces SSRF or regressions → revert commit; column stays in DB but is unused.
Data cleanup: If migration resets legitimate DOWN hosts → restore from SQLite backup (standard Charon backup flow).

Post-rollback smoke checks:

Verify background ticker creates monitors for all proxy hosts
Verify manual health check button produces correct status
Verify notification batching works correctly

29 KiB Raw Blame History

Uptime Monitoring Bug Triage & Fix Plan

1. Introduction

Overview

Objectives

Scope

2. Research Findings

2.1 Root Cause #1: Port Mismatch in Host-Level TCP Check (FIXED)

2.2 Root Cause #2: Dual UptimeService Instance (OPEN — Functional Inconsistency)

2.3 Root Cause #3: No Immediate Monitor Creation on Proxy Host Create (OPEN)

2.4 Root Cause #4: "pending" Status Displayed as DOWN (OPEN)

2.5 Root Cause #5: No Initial CheckAll After Server Start Sync (OPEN)

2.6 Concern #6: Self-Referencing Check (Charon Pinging Itself)

2.7 Feature Gap: No Custom Health Endpoint URL

2.8 Existing Test Coverage

3. Technical Specifications

3.1 Consolidate UptimeService Singleton

3.2 Trigger Monitor Creation + Immediate Check on Proxy Host Create

3.3 Add "pending" UI State

3.4 Run CheckAll After Initial SyncMonitors

3.5 Add Optional check_url Field to UptimeMonitor (Enhancement)

3.5.1 SSRF Protection for CheckURL

3.6 Data Cleanup: Reset Stale Failure Counts

4. Data Flow Diagrams

Current Flow (Buggy)

Proposed Flow (Fixed)

5. Implementation Plan

Phase 1: Playwright E2E Tests (Behavior Specification)

Phase 2: Backend — Consolidate UptimeService Instance

Phase 3: Backend — Immediate Monitor Lifecycle

Phase 4: Frontend — Pending State Display

Phase 5: Backend — Optional check_url Field (Enhancement)

Phase 6: Frontend — Health Check URL in Edit Modal

Phase 7: Testing & Validation

Phase 8: Data Cleanup Migration

6. EARS Requirements

7. Acceptance Criteria

Must Have

Should Have

Nice to Have

8. PR Slicing Strategy

Decision: 3 PRs

PR-1: Backend Bug Fixes (Architecture + Lifecycle)

PR-2: Frontend Pending State

PR-3: Custom Health Check URL (Enhancement)

9. Risk Assessment

10. Validation Plan (Mandatory Sequence)

11. Architecture Reference

Two-Level Check System

Background Ticker Flow

Key Configuration Values

12. Rollback and Contingency

29 KiB

Raw Blame History

3.5 Add Optional `check_url` Field to UptimeMonitor (Enhancement)

Phase 5: Backend — Optional `check_url` Field (Enhancement)