29 KiB
Uptime Monitoring Bug Triage & Fix Plan
1. Introduction
Overview
Uptime Monitoring in Charon uses a two-level check system: host-level TCP pre-checks followed by per-monitor HTTP/TCP checks. Newly added proxy hosts (specifically Wizarr and Charon itself) display as "DOWN" in the UI even though the underlying services are fully accessible. Manual refresh via the health check button on the Uptime page correctly shows "UP", but the automated background checker fails to produce the same result.
Objectives
- Eliminate false "DOWN" status for newly added proxy hosts
- Ensure the background checker produces consistent results with manual health checks
- Improve the initial monitor lifecycle (creation → first check → display)
- Address the dual
UptimeServiceinstance functional inconsistency - Evaluate whether a "custom health endpoint URL" feature is warranted
Scope
- Backend:
backend/internal/services/uptime_service.go,backend/internal/api/routes/routes.go,backend/internal/api/handlers/proxy_host_handler.go - Frontend:
frontend/src/pages/Uptime.tsx,frontend/src/api/uptime.ts - Models:
backend/internal/models/uptime.go,backend/internal/models/uptime_host.go - Tests:
backend/internal/services/uptime_service_test.go(1519 LOC),uptime_service_unit_test.go(257 LOC),uptime_service_race_test.go(402 LOC),tests/monitoring/uptime-monitoring.spec.ts(E2E)
2. Research Findings
2.1 Root Cause #1: Port Mismatch in Host-Level TCP Check (FIXED)
Status: Fixed in commit 209b2fc8, refactored in bfc19ef3.
The checkHost() function extracted the port from the monitor's public URL (e.g., 443 for HTTPS) instead of using ProxyHost.ForwardPort (e.g., 5690 for Wizarr). This caused TCP checks to fail, marking the host as down, which then skipped individual HTTP monitor checks.
Fix applied: Added Preload("ProxyHost") and prioritized monitor.ProxyHost.ForwardPort over extractPort(monitor.URL).
Evidence: Archived in docs/plans/archive/uptime_monitoring_diagnosis.md and docs/implementation/uptime_monitoring_port_fix_COMPLETE.md.
Remaining risk: If this fix has not been deployed to production, this remains the primary cause. If deployed, residual elevated failure_count values in the DB may need to be reset.
2.2 Root Cause #2: Dual UptimeService Instance (OPEN — Functional Inconsistency)
File: backend/internal/api/routes/routes.go
Two separate UptimeService instances are created:
| Instance | Line | Scope |
|---|---|---|
uptimeService |
226 | Background ticker goroutine, ProxyHostHandler, /system/uptime/check endpoint |
uptimeSvc |
414 | Uptime API handler routes (List, Create, Update, Delete, Check, Sync) |
Both share the same *gorm.DB (so data consistency via DB is maintained), but each has independent in-memory state:
pendingNotificationsmap (notification batching)hostMutexesmap (per-host mutex for concurrent writes)batchWindowtimers
Impact: This is a functional inconsistency that can cause race conditions between ProxyHostHandler operations and Uptime API operations. Specifically:
ProxyHostHandler.Create()uses instance #1 (uptimeService) forSyncAndCheckForHost- Uptime API queries (List, GetHistory) use instance #2 (
uptimeSvc) - In-memory state (host mutexes, pending notifications) is invisible between instances
This creates a functional bug path because:
- When a user triggers a manual check via
POST /api/v1/uptime/monitors/:id/check, the handler usesuptimeSvc.CheckMonitor(). If the monitor transitions to "down", the notification is queued inuptimeSvc'spendingNotificationsmap. Meanwhile, the background checker usesuptimeService, which has a separatependingNotificationsmap. - Duplicate or missed notifications
- Independent failure debouncing state
- Mutex contention issues between the two instances
While NOT the direct cause of the "DOWN" display bug, this is a functional inconsistency — not merely a code smell — that can produce observable bugs in notification delivery and state synchronization.
2.3 Root Cause #3: No Immediate Monitor Creation on Proxy Host Create (OPEN)
Note — Create ↔ Update asymmetry:
ProxyHostHandler.Update()already callsSyncMonitorForHost(established pattern). The fix forCreateshould follow the same pattern for consistency.
When a user creates a new proxy host:
- The proxy host is saved to DB
- No uptime monitor is created — there is no hook in
ProxyHostHandler.Create()to triggerSyncMonitors()or create a monitor SyncMonitorForHost()(called on proxy host update) only updates existing monitors — it does NOT create new ones- The background ticker must fire (up to 1 minute) for
SyncMonitors()to create the monitor
Timeline for a new proxy host to show status:
- T+0s: Proxy host created via API
- T+0s to T+60s: No uptime monitor exists — Uptime page shows nothing for this host
- T+60s: Background ticker fires,
SyncMonitors()creates monitor withstatus: "pending" - T+60s:
CheckAll()runs, attempts host check + individual check - T+62s: If checks succeed, monitor
status: "up"is saved to DB - T+90s (worst case): Frontend polls monitors and picks up the update
This is a poor UX experience. Users expect to see their new host on the Uptime page immediately.
2.4 Root Cause #4: "pending" Status Displayed as DOWN (OPEN)
File: frontend/src/pages/Uptime.tsx, MonitorCard component
const isUp = latestBeat ? latestBeat.status === 'up' : monitor.status === 'up';
When a new monitor has status: "pending" and no heartbeat history:
latestBeat=null(no history yet)- Falls back to
monitor.status === 'up' "pending" === "up"→false- Displayed with red DOWN styling
The UI has no dedicated "pending" or "unknown" state. Between creation and first check, every monitor appears DOWN.
2.5 Root Cause #5: No Initial CheckAll After Server Start Sync (OPEN)
File: backend/internal/api/routes/routes.go, lines 455-490
The background goroutine flow on server start:
- Sleep 30 seconds
- Call
SyncMonitors()— creates monitors for all proxy hosts - Does NOT call
CheckAll() - Start 1-minute ticker
- First
CheckAll()runs on first tick (~90 seconds after server start)
This means after every server restart, all monitors sit in "pending" (displayed as DOWN) for up to 90 seconds.
2.6 Concern #6: Self-Referencing Check (Charon Pinging Itself)
If Charon has a proxy host pointing to itself (e.g., charon.example.com → localhost:8080):
TCP host check: Connects to localhost:8080 → succeeds (Gin server is running locally).
HTTP monitor check: Sends GET to https://charon.example.com → requires DNS resolution from inside the Docker container. This may fail due to:
- Docker hairpin NAT: Containers cannot reach their own published ports via the host's external IP by default
- Split-horizon DNS: The domain may resolve to a public IP that isn't routable from within the container
- Caddy certificate validation: The HTTP client might reject a self-signed or incorrectly configured cert
When the user clicks manual refresh, the same checkMonitor() function runs with the same options (WithAllowLocalhost(), WithMaxRedirects(0)). If manual check succeeds but background check fails, the difference is likely timing-dependent — the alternating "up"/"down" pattern observed in the archived diagnosis (heartbeat records alternating between up|HTTP 200 and down|Host unreachable) supports this hypothesis.
2.7 Feature Gap: No Custom Health Endpoint URL
The UptimeMonitor model has no health_endpoint or custom_url field. All monitors check the public root URL (/). This is problematic because:
- Some services redirect root →
/login→ 302 → tracked inconsistently - Services with dedicated health endpoints (
/health,/api/health) provide more reliable status - Self-referencing checks (Charon) could use
http://localhost:8080/api/v1/healthinstead of routing through DNS/Caddy
2.8 Existing Test Coverage
| File | LOC | Focus |
|---|---|---|
uptime_service_test.go |
1519 | Integration tests with SQLite DB |
uptime_service_unit_test.go |
257 | Unit tests for service methods |
uptime_service_race_test.go |
402 | Concurrency/race condition tests |
uptime_service_notification_test.go |
— | Notification batching tests |
uptime_handler_test.go |
— | Handler HTTP endpoint tests |
uptime_monitor_initial_state_test.go |
— | Initial state tests |
uptime-monitoring.spec.ts |
— | Playwright E2E (22 scenarios) |
3. Technical Specifications
3.1 Consolidate UptimeService Singleton
Current: Two instances (uptimeService line 226, uptimeSvc line 414) in routes.go.
Target: Single instance passed to both the background goroutine AND the API handlers.
// routes.go — BEFORE (two instances)
uptimeService := services.NewUptimeService(db, notificationService) // line 226
uptimeSvc := services.NewUptimeService(db, notificationService) // line 414
// routes.go — AFTER (single instance)
uptimeService := services.NewUptimeService(db, notificationService) // line 226
// line 414: reuse uptimeService for handler registration
uptimeHandler := handlers.NewUptimeHandler(uptimeService)
Impact: All in-memory state (mutexes, notification batching, pending notifications) is shared. The single instance must remain thread-safe (it already is — methods use sync.Mutex).
3.2 Trigger Monitor Creation + Immediate Check on Proxy Host Create
File: backend/internal/api/handlers/proxy_host_handler.go
After successfully creating a proxy host, call SyncMonitors() (or a targeted sync) and trigger an immediate check:
// In Create handler, after host is saved:
if h.uptimeService != nil {
_ = h.uptimeService.SyncMonitors()
// Trigger immediate check for the new monitor
var monitor models.UptimeMonitor
if err := h.uptimeService.DB.Where("proxy_host_id = ?", host.ID).First(&monitor).Error; err == nil {
go h.uptimeService.CheckMonitor(monitor)
}
}
Alternative (lighter-weight): Add a SyncAndCheckForHost(hostID uint) method that creates the monitor if needed and immediately checks it.
3.3 Add "pending" UI State
File: frontend/src/pages/Uptime.tsx
Add dedicated handling for "pending" status:
const isPending = monitor.status === 'pending' && (!history || history.length === 0);
const isUp = latestBeat ? latestBeat.status === 'up' : monitor.status === 'up';
const isPaused = monitor.enabled === false;
Visual treatment for pending state:
- Yellow/gray pulsing indicator (distinct from DOWN red and UP green)
- Badge text: "CHECKING..." or "PENDING"
- Heartbeat bar: show empty placeholder bars with a spinner or pulse animation
3.4 Run CheckAll After Initial SyncMonitors
File: backend/internal/api/routes/routes.go
// AFTER initial sync
if enabled {
if err := uptimeService.SyncMonitors(); err != nil {
logger.Log().WithError(err).Error("Failed to sync monitors")
}
// Run initial check immediately
uptimeService.CheckAll()
}
3.5 Add Optional check_url Field to UptimeMonitor (Enhancement)
Model change (backend/internal/models/uptime.go):
type UptimeMonitor struct {
// ... existing fields
CheckURL string `json:"check_url,omitempty" gorm:"default:null"`
}
Service behavior (uptime_service.go checkMonitor()):
- If
monitor.CheckURLis set and non-empty, use it instead ofmonitor.URLfor the HTTP check - This allows users to configure
/healthorhttp://localhost:8080/api/v1/healthfor self-referencing
Frontend: Add an optional "Health Check URL" field in the edit monitor modal.
Auto-migration: GORM handles adding the column. Existing monitors keep CheckURL = "" (uses default URL behavior).
3.5.1 SSRF Protection for CheckURL
The CheckURL field accepts user-controlled URLs that the server will fetch. This requires layered SSRF defenses:
Write-time validation (on Create/Update API):
- Validate
CheckURLbefore saving to DB - Scheme restriction: Only
http://andhttps://allowed. Blockfile://,ftp://,gopher://, and all other schemes - Max URL length: 2048 characters
- Reject URLs that fail
url.Parse()or have empty host components
Check-time validation (before each HTTP request):
- Re-validate the URL against the deny list before every check execution (defense-in-depth — the stored URL could have been valid at write time but conditions may change)
- Localhost handling: Allow loopback addresses (
127.0.0.1,::1,localhost) since self-referencing checks are a valid use case. Block cloud metadata IPs:169.254.169.254(AWS/GCP/Azure instance metadata)fd00::/8(unique local addresses)100.100.100.200(Alibaba Cloud metadata)169.254.0.0/16link-local range (except loopback)
- DNS rebinding protection: Resolve the hostname at request time, pin the resolved IP, and validate the resolved IP against the deny list before establishing a connection. Use a custom
net.Dialerorhttp.Transport.DialContextto enforce this - Redirect validation: If
CheckURLfollows HTTP redirects (3xx), validate each redirect target URL against the same deny list (scheme, host, resolved IP). Use aCheckRedirectfunction on thehttp.Clientto intercept and validate each hop
Implementation pattern:
func validateCheckURL(rawURL string) error {
if len(rawURL) > 2048 {
return ErrURLTooLong
}
parsed, err := url.Parse(rawURL)
if err != nil {
return ErrInvalidURL
}
if parsed.Scheme != "http" && parsed.Scheme != "https" {
return ErrDisallowedScheme
}
if parsed.Host == "" {
return ErrEmptyHost
}
return nil
}
func validateResolvedIP(ip net.IP) error {
// Allow loopback
if ip.IsLoopback() {
return nil
}
// Block cloud metadata and link-local
if isCloudMetadataIP(ip) || ip.IsLinkLocalUnicast() {
return ErrDeniedIP
}
return nil
}
3.6 Data Cleanup: Reset Stale Failure Counts
After deploying the port fix (if not already deployed), run a one-time DB cleanup:
-- Reset failure counts for hosts/monitors stuck from the port mismatch era
-- Only reset monitors with elevated failure counts AND no recent successful heartbeat
UPDATE uptime_hosts SET failure_count = 0, status = 'pending' WHERE status = 'down';
UPDATE uptime_monitors SET failure_count = 0, status = 'pending'
WHERE status = 'down'
AND failure_count > 5
AND id NOT IN (
SELECT DISTINCT monitor_id FROM uptime_heartbeats
WHERE status = 'up' AND created_at > datetime('now', '-24 hours')
);
This could be automated in SyncMonitors() or done via a migration.
4. Data Flow Diagrams
Current Flow (Buggy)
[Proxy Host Created] → (no uptime action)
→ [Wait up to 60s for ticker]
→ SyncMonitors() creates monitor (status: "pending")
→ CheckAll() runs:
→ checkAllHosts() TCP to ForwardHost:ForwardPort
→ If host up → checkMonitor() HTTP to public URL
→ DB updated
→ [Wait up to 30s for frontend poll]
→ Frontend displays status
Proposed Flow (Fixed)
[Proxy Host Created]
→ SyncMonitors() or SyncAndCheckForHost() immediately
→ Monitor created (status: "pending")
→ Frontend shows "PENDING" (yellow indicator)
→ Immediate checkMonitor() in background goroutine
→ DB updated (status: "up" or "down")
→ Frontend polls in 30s → shows actual status
5. Implementation Plan
Phase 1: Playwright E2E Tests (Behavior Specification)
Define expected behavior before implementation:
| Test | Description |
|---|---|
| New proxy host monitor appears immediately | After creating a proxy host, navigate to Uptime page, verify the monitor card exists |
| New monitor shows pending state | Verify "PENDING" badge before first check completes |
| Monitor status updates after check | Trigger manual check, verify status changes from pending/down to up |
| Verify no false DOWN on first load | Create host, wait for background check, verify status is UP (not DOWN) |
Files: tests/monitoring/uptime-monitoring.spec.ts (extend existing suite)
Phase 2: Backend — Consolidate UptimeService Instance
- Remove second
NewUptimeServicecall atroutes.goline 414 - Pass
uptimeService(line 226) toNewUptimeHandler() - Verify all handler operations use the shared instance
- Update existing tests that may create multiple instances
Files: backend/internal/api/routes/routes.go
Phase 3: Backend — Immediate Monitor Lifecycle
- In
ProxyHostHandler.Create(), after saving host: callSyncMonitors()or create a targetedSyncAndCheckForHost()method - Add
CheckAll()call after initialSyncMonitors()in the background goroutine - Consider adding a
SyncAndCheckForHost(hostID uint)method toUptimeServicethat:- Finds or creates the monitor for the given proxy host
- Immediately runs
checkMonitor()in a goroutine - Returns the monitor ID for the caller
Files: backend/internal/services/uptime_service.go, backend/internal/api/handlers/proxy_host_handler.go, backend/internal/api/routes/routes.go
Phase 4: Frontend — Pending State Display
- Add
isPendingcheck inMonitorCardcomponent - Add yellow/gray styling for pending state
- Add pulsing animation for pending badge
- Add i18n key
uptime.pending→ "CHECKING..." for all 5 supported languages (not just the default locale) - Ensure heartbeat bar handles zero-length history gracefully
Files: frontend/src/pages/Uptime.tsx, frontend/src/i18n/ locale files
Phase 5: Backend — Optional check_url Field (Enhancement)
- Add
CheckURLfield toUptimeMonitormodel - Update
checkMonitor()to useCheckURLif set - Update
SyncMonitors()— do NOT overwrite user-configuredCheckURL - Update API DTOs for create/update
Files: backend/internal/models/uptime.go, backend/internal/services/uptime_service.go, backend/internal/api/handlers/uptime_handler.go
Phase 6: Frontend — Health Check URL in Edit Modal
- Add optional "Health Check URL" field to
EditMonitorModalandCreateMonitorModal - Show placeholder text: "Leave empty to use monitor URL"
- Validate URL format on frontend
Files: frontend/src/pages/Uptime.tsx
Phase 7: Testing & Validation
- Run existing backend test suites (2178 LOC across 3 files)
- Add tests for:
- Single
UptimeServiceinstance behavior - Immediate monitor creation on proxy host create
CheckURLfallback logic- "pending" → "up" transition
- Single
- Add edge case tests:
- Rapid Create-Delete: Proxy host created and immediately deleted before
SyncAndCheckForHostgoroutine completes — goroutine should handle non-existent proxy host gracefully (no panic, no orphaned monitor) - Concurrent Creates: Multiple proxy hosts created simultaneously — verify
SyncMonitors()from Create handlers doesn't conflict with background ticker'sSyncMonitors()(no duplicate monitors, no data races) - Feature Flag Toggle: If
feature.uptime.enabledis toggled tofalsewhile immediate check goroutine is running — goroutine should exit cleanly without writing stale results - CheckURL with redirects:
CheckURLthat 302-redirects to a private IP — redirect target must be validated against the deny list (SSRF redirect chain)
- Rapid Create-Delete: Proxy host created and immediately deleted before
- Run Playwright E2E suite with Docker rebuild
- Verify coverage thresholds
Phase 8: Data Cleanup Migration
- Add one-time migration or startup hook to reset stale
failure_countandstatuson hosts/monitors that were stuck from the port mismatch era - Log the cleanup action
6. EARS Requirements
- WHEN a new proxy host is created, THE SYSTEM SHALL create a corresponding uptime monitor within 5 seconds (not waiting for the 1-minute ticker)
- WHEN a new uptime monitor is created, THE SYSTEM SHALL immediately trigger a health check in a background goroutine
- WHEN a monitor has status "pending" and no heartbeat history, THE SYSTEM SHALL display a distinct visual indicator (not DOWN red)
- WHEN the server starts, THE SYSTEM SHALL run
CheckAll()immediately afterSyncMonitors()(not wait for first tick) - THE SYSTEM SHALL use a single
UptimeServiceinstance for both background checks and API handlers - WHERE a monitor has a
check_urlconfigured, THE SYSTEM SHALL use it for health checks instead of the monitor URL - WHEN a monitor's host-level TCP check succeeds but HTTP check fails, THE SYSTEM SHALL record the specific failure reason in the heartbeat message
- IF the uptime feature flag is disabled, THEN THE SYSTEM SHALL skip all monitor sync and check operations
7. Acceptance Criteria
Must Have
- WHEN a new proxy host is created, a corresponding uptime monitor exists within 5 seconds
- WHEN a new uptime monitor is created, an immediate health check runs
- WHEN a monitor has status "pending", a distinct yellow/gray visual indicator is shown (not red DOWN)
- WHEN the server starts,
CheckAll()runs immediately afterSyncMonitors() - Only one
UptimeServiceinstance exists at runtime
Should Have
- WHEN a monitor has a
check_urlconfigured, it is used for health checks - WHEN a monitor's host-level TCP check succeeds but HTTP check fails, the heartbeat message contains the failure reason
- Stale
failure_countvalues from the port mismatch era are reset on deployment
Nice to Have
- Dedicated UI indicator for "first check in progress" (animated pulse)
- Automatic detection of health endpoints (try
/healthfirst, fall back to/)
8. PR Slicing Strategy
Decision: 3 PRs
Trigger reasons: Cross-domain changes (backend + frontend + model), independent concerns (UX fix vs backend architecture vs new feature), review size management.
PR-1: Backend Bug Fixes (Architecture + Lifecycle)
Scope: Phases 2, 3, and initial CheckAll (Section 3.4)
Files:
backend/internal/api/routes/routes.go— consolidate to single UptimeService instance, add CheckAll after initial syncbackend/internal/services/uptime_service.go— addSyncAndCheckForHost()methodbackend/internal/api/handlers/proxy_host_handler.go— call SyncAndCheckForHost on Create- Backend test files — update for single instance, add new lifecycle tests
- Data cleanup migration
ARCHITECTURE.md— update to reflect the UptimeService singleton consolidation (architecture change)
Dependencies: None (independent of frontend changes)
Validation: All backend tests pass, no duplicate UptimeService instantiation, new proxy hosts get immediate monitors, ARCHITECTURE.md reflects current design
Rollback: Revert commit; behavior returns to previous (ticker-based) lifecycle
PR-2: Frontend Pending State
Scope: Phase 4
Files:
frontend/src/pages/Uptime.tsx— add pending state handlingfrontend/src/i18n/locale files — adduptime.pendingkeyfrontend/src/pages/__tests__/Uptime.spec.tsx— update tests
Dependencies: Works independently of PR-1 (pending state display improves UX regardless of backend fix timing)
Validation: Playwright E2E tests pass, pending monitors show yellow indicator
Rollback: Revert commit; pending monitors display as DOWN (existing behavior)
PR-3: Custom Health Check URL (Enhancement)
Scope: Phases 5, 6
Files:
backend/internal/models/uptime.go— add CheckURL fieldbackend/internal/services/uptime_service.go— use CheckURL in checkMonitorbackend/internal/api/handlers/uptime_handler.go— update DTOsfrontend/src/pages/Uptime.tsx— add form field- Test files — add coverage for CheckURL logic
Dependencies: PR-1 should be merged first (shared instance simplifies testing)
Validation: Create monitor with custom health URL, verify check uses it
Rollback: Revert commit; GORM auto-migration adds the column but it remains unused
9. Risk Assessment
| Risk | Severity | Likelihood | Mitigation |
|---|---|---|---|
| Consolidating UptimeService instance introduces race conditions | High | Low | Existing mutex protections are designed for shared use; run race tests with -race flag |
| Immediate SyncMonitors on proxy host create adds latency to API response | Medium | Medium | Run SyncAndCheckForHost in a goroutine; return HTTP 201 immediately |
| "pending" UI state confuses users who expect UP/DOWN binary | Low | Low | Clear tooltip/label: "Initial health check in progress..." |
| CheckURL allows SSRF if user provides malicious URL | High | Low | Layered SSRF defense (see Section 3.5.1): write-time validation (scheme, length, parse), check-time re-validation, DNS rebinding protection (pin resolved IP against deny list), redirect chain validation. Allow loopback for self-referencing checks; block cloud metadata IPs (169.254.169.254, fd00::, etc.) |
| Data cleanup migration resets legitimate DOWN status | Medium | Medium | Only reset monitors with elevated failure counts AND no recent successful heartbeat |
| Self-referencing check (Charon) still fails due to Docker DNS | Medium | High | PR-3 scope: When SyncMonitors() creates a monitor, if ForwardHost resolves to loopback (localhost, 127.0.0.1, or the container's own hostname), automatically set CheckURL to http://{ForwardHost}:{ForwardPort}/ to bypass the DNS/Caddy round-trip. Tracked as technical debt if deferred beyond PR-3 |
10. Validation Plan (Mandatory Sequence)
-
E2E environment prerequisite
- Determine rebuild necessity per testing policy: if application/runtime or Docker input changes are present, rebuild is required.
- If rebuild is required or the container is unhealthy, run
.github/skills/scripts/skill-runner.sh docker-rebuild-e2e. - Record container health outcome before executing tests.
-
Playwright first
- Run targeted uptime monitoring E2E scenarios.
-
Local patch coverage preflight
- Generate
test-results/local-patch-report.mdandtest-results/local-patch-report.json.
- Generate
-
Unit and coverage
- Backend coverage run (threshold >= 85%).
- Frontend coverage run (threshold >= 85%).
-
Race condition tests
- Run
go test -race ./backend/internal/services/...to verify single-instance thread safety.
- Run
-
Type checks
- Frontend TypeScript check.
-
Pre-commit
pre-commit run --all-fileswith zero blocking failures.
-
Security scans
- CodeQL Go + JS (security-and-quality).
- GORM security scan (model changes in PR-3).
- Trivy scan.
-
Build verification
- Backend build + frontend build pass.
11. Architecture Reference
Two-Level Check System
Level 1: Host-Level TCP Pre-Check
├── Purpose: Quickly determine if backend host/container is reachable
├── Method: TCP connection to ForwardHost:ForwardPort
├── Runs: Once per unique UptimeHost
├── If DOWN → Skip all Level 2 checks, mark all monitors DOWN
└── If UP → Proceed to Level 2
Level 2: Service-Level HTTP/TCP Check
├── Purpose: Verify specific service is responding correctly
├── Method: HTTP GET to monitor URL (or CheckURL if set)
├── Runs: Per-monitor (in parallel goroutines)
└── Accepts: 2xx, 3xx, 401, 403 as "up"
Background Ticker Flow
Server Start → Sleep 30s → SyncMonitors()
→ [PROPOSED] CheckAll()
→ Start 1-minute ticker
→ Each tick: SyncMonitors() → CheckAll()
→ checkAllHosts() [parallel, staggered]
→ Group monitors by host
→ For each host:
If down → markHostMonitorsDown()
If up → checkMonitor() per monitor [parallel goroutines]
Key Configuration Values
| Setting | Value | Source |
|---|---|---|
batchWindow |
30s | NewUptimeService() |
TCPTimeout |
10s | NewUptimeService() |
MaxRetries (host) |
2 | NewUptimeService() |
FailureThreshold (host) |
2 | NewUptimeService() |
CheckTimeout |
60s | NewUptimeService() |
StaggerDelay |
100ms | NewUptimeService() |
MaxRetries (monitor) |
3 | UptimeMonitor.MaxRetries default |
| Ticker interval | 1 min | routes.go ticker |
| Frontend poll interval | 30s | Uptime.tsx refetchInterval |
| History poll interval | 60s | MonitorCard refetchInterval |
12. Rollback and Contingency
- PR-1: If consolidating UptimeService causes regressions → revert commit; background checker and API revert to two separate instances (existing behavior).
- PR-2: If pending state display causes confusion → revert commit; monitors display DOWN for pending (existing behavior).
- PR-3: If CheckURL introduces SSRF or regressions → revert commit; column stays in DB but is unused.
- Data cleanup: If migration resets legitimate DOWN hosts → restore from SQLite backup (standard Charon backup flow).
Post-rollback smoke checks:
- Verify background ticker creates monitors for all proxy hosts
- Verify manual health check button produces correct status
- Verify notification batching works correctly