357 lines
10 KiB
Markdown
357 lines
10 KiB
Markdown
# Uptime Monitoring Diagnosis: Wizarr Host False "Down" Status
|
|
|
|
## Summary
|
|
|
|
**Issue**: Newly created Wizarr Proxy Host shows as "down" in uptime monitoring, despite the domain working correctly when accessed by users.
|
|
|
|
**Root Cause**: Port mismatch in host-level TCP connectivity check. The `checkHost()` function extracts the port from the public URL (443 for HTTPS) but should be checking the actual backend `forward_port` (5690 for Wizarr).
|
|
|
|
**Status**: Identified - Fix Required
|
|
|
|
## Detailed Analysis
|
|
|
|
### 1. Code Location
|
|
|
|
**Primary Issue**: `backend/internal/services/uptime_service.go`
|
|
|
|
- **Function**: `checkHost()` (lines 359-402)
|
|
- **Logic Flow**: `checkAllHosts()` → `checkHost()` → `CheckAll()` → `checkMonitor()`
|
|
|
|
### 2. How Uptime Monitoring Works
|
|
|
|
#### Two-Level Check System
|
|
|
|
1. **Host-Level Pre-Check** (TCP connectivity)
|
|
- Runs first via `checkAllHosts()` → `checkHost()`
|
|
- Groups services by their backend `forward_host` (e.g., `172.20.0.11`)
|
|
- Attempts TCP connection to determine if host is reachable
|
|
- If host is DOWN, marks all monitors on that host as down **without** checking individual services
|
|
|
|
2. **Service-Level Check** (HTTP/HTTPS)
|
|
- Only runs if host-level check passes
|
|
- Performs actual HTTP GET to public URL
|
|
- Accepts 2xx, 3xx, 401, 403 as "up"
|
|
- Correctly handles redirects (302)
|
|
|
|
### 3. The Bug
|
|
|
|
In `checkHost()` at line 375:
|
|
|
|
```go
|
|
for _, monitor := range monitors {
|
|
port := extractPort(monitor.URL) // Gets port from public URL
|
|
if port == "" {
|
|
continue
|
|
}
|
|
|
|
// Tries to connect using extracted port
|
|
addr := net.JoinHostPort(host.Host, port) // 172.20.0.11:443
|
|
conn, err := net.DialTimeout("tcp", addr, 5*time.Second)
|
|
```
|
|
|
|
**Problem**:
|
|
|
|
- `monitor.URL` is the **public URL**: `https://wizarr.hatfieldhosted.com`
|
|
- `extractPort()` returns `443` (HTTPS default)
|
|
- But Wizarr backend actually runs on `172.20.0.11:5690`
|
|
- TCP connection to `172.20.0.11:443` **fails** (no service listening)
|
|
- Host marked as "down"
|
|
- All monitors on that host marked "down" without individual checks
|
|
|
|
### 4. Evidence from Logs and Database
|
|
|
|
#### Heartbeat Records (Most Recent First)
|
|
|
|
```
|
|
down|Host unreachable|0|2025-12-22 21:29:05
|
|
up|HTTP 200|64|2025-12-22 21:29:04
|
|
down|Host unreachable|0|2025-12-22 21:01:26
|
|
up|HTTP 200|47|2025-12-22 21:00:19
|
|
```
|
|
|
|
**Pattern**: Alternating between successful HTTP checks and host-level failures.
|
|
|
|
#### Database State
|
|
|
|
```sql
|
|
-- uptime_monitors
|
|
name: Wizarr
|
|
url: https://wizarr.hatfieldhosted.com
|
|
status: down
|
|
failure_count: 3
|
|
max_retries: 3
|
|
|
|
-- uptime_hosts
|
|
id: 0c764438-35ff-451f-822a-7297f39f39d4
|
|
name: Wizarr
|
|
host: 172.20.0.11
|
|
status: down ← This is causing the problem
|
|
|
|
-- proxy_hosts
|
|
name: Wizarr
|
|
domain_names: wizarr.hatfieldhosted.com
|
|
forward_host: 172.20.0.11
|
|
forward_port: 5690 ← This is the actual port!
|
|
```
|
|
|
|
#### Caddy Access Logs
|
|
|
|
Uptime check succeeds at HTTP level:
|
|
|
|
```
|
|
172.20.0.1 → GET / → 302 → /admin
|
|
172.20.0.1 → GET /admin → 302 → /login
|
|
172.20.0.1 → GET /login → 200 OK (16905 bytes)
|
|
```
|
|
|
|
### 5. Why Other Hosts Don't Have This Issue
|
|
|
|
Checking working hosts (using Radarr as example):
|
|
|
|
```sql
|
|
-- Radarr (working)
|
|
forward_host: 100.99.23.57
|
|
forward_port: 7878
|
|
url: https://radarr.hatfieldhosted.com
|
|
|
|
-- 302 redirect logic works correctly:
|
|
GET / → 302 → /login
|
|
```
|
|
|
|
**Why it works**: For services that redirect on root path, the HTTP check succeeds with 200-399 status codes. The port mismatch issue exists for all hosts, but:
|
|
|
|
1. **If the forward_port happens to be a standard port** (80, 443, 8080) that the extractPort() function returns, it may work by coincidence
|
|
2. **If the host IP doesn't respond on that port**, the TCP check fails
|
|
3. **Wizarr uses port 5690** - a non-standard port that extractPort() will never return
|
|
|
|
### 6. Additional Context
|
|
|
|
The uptime monitoring feature was recently enhanced with host-level grouping to:
|
|
|
|
- Reduce check overhead for multiple services on same host
|
|
- Provide consolidated DOWN notifications
|
|
- Avoid individual checks when host is unreachable
|
|
|
|
This is a good architectural decision, but the port extraction logic has a bug.
|
|
|
|
## Root Cause Summary
|
|
|
|
**The `checkHost()` function extracts the port from the monitor's public URL instead of using the actual backend forward_port from the proxy host configuration.**
|
|
|
|
### Why This Happens
|
|
|
|
1. `UptimeMonitor` stores the public URL (e.g., `https://wizarr.hatfieldhosted.com`)
|
|
2. `UptimeHost` only stores the `forward_host` IP, not the port
|
|
3. `checkHost()` tries to extract port from monitor URLs
|
|
4. For HTTPS URLs, it extracts 443
|
|
5. Wizarr backend is on 172.20.0.11:5690, not :443
|
|
6. TCP connection fails → host marked down → monitor marked down
|
|
|
|
## Proposed Fixes
|
|
|
|
### Option 1: Store Forward Port in UptimeHost (Recommended)
|
|
|
|
**Changes Required**:
|
|
|
|
1. Add `Ports` field to `UptimeHost` model:
|
|
|
|
```go
|
|
type UptimeHost struct {
|
|
// ... existing fields
|
|
Ports []int `json:"ports" gorm:"-"` // Not stored, computed on the fly
|
|
}
|
|
```
|
|
|
|
2. Modify `checkHost()` to try all ports associated with monitors on that host:
|
|
|
|
```go
|
|
// Collect unique ports from all monitors for this host
|
|
portSet := make(map[int]bool)
|
|
for _, monitor := range monitors {
|
|
if monitor.ProxyHostID != nil {
|
|
var proxyHost models.ProxyHost
|
|
if err := s.DB.First(&proxyHost, *monitor.ProxyHostID).Error; err == nil {
|
|
portSet[proxyHost.ForwardPort] = true
|
|
}
|
|
}
|
|
}
|
|
|
|
// Try connecting to any of the ports
|
|
success := false
|
|
for port := range portSet {
|
|
addr := net.JoinHostPort(host.Host, strconv.Itoa(port))
|
|
conn, err := net.DialTimeout("tcp", addr, 5*time.Second)
|
|
// ... rest of logic
|
|
}
|
|
```
|
|
|
|
**Pros**:
|
|
|
|
- Checks actual backend ports
|
|
- More accurate for non-standard ports
|
|
- Minimal schema changes
|
|
|
|
**Cons**:
|
|
|
|
- Requires database queries in check loop
|
|
- More complex logic
|
|
|
|
### Option 2: Store ForwardPort Reference in UptimeMonitor
|
|
|
|
**Changes Required**:
|
|
|
|
1. Add `ForwardPort` field to `UptimeMonitor`:
|
|
|
|
```go
|
|
type UptimeMonitor struct {
|
|
// ... existing fields
|
|
ForwardPort int `json:"forward_port"`
|
|
}
|
|
```
|
|
|
|
2. Update `SyncMonitors()` to populate it:
|
|
|
|
```go
|
|
monitor = models.UptimeMonitor{
|
|
// ... existing fields
|
|
ForwardPort: host.ForwardPort,
|
|
}
|
|
```
|
|
|
|
3. Update `checkHost()` to use stored forward port:
|
|
|
|
```go
|
|
for _, monitor := range monitors {
|
|
port := monitor.ForwardPort
|
|
if port == 0 {
|
|
continue
|
|
}
|
|
addr := net.JoinHostPort(host.Host, strconv.Itoa(port))
|
|
// ... rest of logic
|
|
}
|
|
```
|
|
|
|
**Pros**:
|
|
|
|
- Simple, no extra DB queries
|
|
- Forward port readily available
|
|
|
|
**Cons**:
|
|
|
|
- Schema migration required
|
|
- Duplication of data (port stored in both ProxyHost and UptimeMonitor)
|
|
|
|
### Option 3: Skip Host-Level Check for Non-Standard Ports
|
|
|
|
**Temporary workaround** - not recommended for production.
|
|
|
|
Only perform host-level checks for monitors on standard ports (80, 443, 8080).
|
|
|
|
### Option 4: Use ProxyHost Forward Port Directly (Simplest)
|
|
|
|
**Changes Required**:
|
|
|
|
Modify `checkHost()` to query the proxy host for each monitor to get the actual forward port:
|
|
|
|
```go
|
|
// In checkHost(), replace the port extraction:
|
|
for _, monitor := range monitors {
|
|
var port int
|
|
|
|
if monitor.ProxyHostID != nil {
|
|
var proxyHost models.ProxyHost
|
|
if err := s.DB.First(&proxyHost, *monitor.ProxyHostID).Error; err == nil {
|
|
port = proxyHost.ForwardPort
|
|
}
|
|
} else {
|
|
// Fallback to URL extraction for non-proxy monitors
|
|
portStr := extractPort(monitor.URL)
|
|
if portStr != "" {
|
|
port, _ = strconv.Atoi(portStr)
|
|
}
|
|
}
|
|
|
|
if port == 0 {
|
|
continue
|
|
}
|
|
|
|
addr := net.JoinHostPort(host.Host, strconv.Itoa(port))
|
|
conn, err := net.DialTimeout("tcp", addr, 5*time.Second)
|
|
// ... rest of check
|
|
}
|
|
```
|
|
|
|
**Pros**:
|
|
|
|
- No schema changes
|
|
- Works immediately
|
|
- Handles both proxy hosts and standalone monitors
|
|
|
|
**Cons**:
|
|
|
|
- Database query in check loop (but monitors are already cached)
|
|
- Slight performance overhead
|
|
|
|
## Recommended Solution
|
|
|
|
**Option 4** (Use ProxyHost Forward Port Directly) is recommended because:
|
|
|
|
1. No schema migration required
|
|
2. Simple fix, easy to test
|
|
3. Minimal performance impact (monitors already queried)
|
|
4. Can be deployed immediately
|
|
5. Handles edge cases (standalone monitors)
|
|
|
|
## Testing Plan
|
|
|
|
1. **Unit Test**: Add test case for non-standard port host check
|
|
2. **Integration Test**:
|
|
- Create proxy host with non-standard forward port
|
|
- Verify host-level check uses correct port
|
|
- Verify monitor status updates correctly
|
|
3. **Manual Test**:
|
|
- Apply fix
|
|
- Wait for next uptime check cycle (60 seconds)
|
|
- Verify Wizarr shows as "up"
|
|
- Verify no other monitors affected
|
|
|
|
## Debugging Commands
|
|
|
|
```bash
|
|
# Check Wizarr monitor status
|
|
docker compose -f docker-compose.test.yml exec charon sh -c \
|
|
"sqlite3 /app/data/charon.db \"SELECT name, status, failure_count, url FROM uptime_monitors WHERE name = 'Wizarr';\""
|
|
|
|
# Check Wizarr host status
|
|
docker compose -f docker-compose.test.yml exec charon sh -c \
|
|
"sqlite3 /app/data/charon.db \"SELECT name, host, status FROM uptime_hosts WHERE name = 'Wizarr';\""
|
|
|
|
# Check recent heartbeats
|
|
docker compose -f docker-compose.test.yml exec charon sh -c \
|
|
"sqlite3 /app/data/charon.db \"SELECT status, message, created_at FROM uptime_heartbeats WHERE monitor_id = 'eed56336-e646-4cf5-a3fc-ac4d2dd8760e' ORDER BY created_at DESC LIMIT 5;\""
|
|
|
|
# Check Wizarr proxy host config
|
|
docker compose -f docker-compose.test.yml exec charon sh -c \
|
|
"sqlite3 /app/data/charon.db \"SELECT name, forward_host, forward_port FROM proxy_hosts WHERE name = 'Wizarr';\""
|
|
|
|
# Monitor real-time uptime checks in logs
|
|
docker compose -f docker-compose.test.yml logs -f charon | grep -i "wizarr\|uptime"
|
|
```
|
|
|
|
## Related Files
|
|
|
|
- `backend/internal/services/uptime_service.go` - Main uptime service
|
|
- `backend/internal/models/uptime.go` - UptimeMonitor model
|
|
- `backend/internal/models/uptime_host.go` - UptimeHost model
|
|
- `backend/internal/services/uptime_service_test.go` - Unit tests
|
|
|
|
## References
|
|
|
|
- Issue created: 2025-12-23
|
|
- Related feature: Host-level uptime grouping
|
|
- Related PR: [Reference to ACL/permission changes if applicable]
|
|
|
|
---
|
|
|
|
**Next Steps**: Implement Option 4 fix and add test coverage for non-standard port scenarios.
|