Charon/docs/plans/uptime_monitoring_diagnosis.md

# Uptime Monitoring Diagnosis: Wizarr Host False "Down" Status

## Summary

**Issue**: Newly created Wizarr Proxy Host shows as "down" in uptime monitoring, despite the domain working correctly when accessed by users.

**Root Cause**: Port mismatch in host-level TCP connectivity check. The `checkHost()` function extracts the port from the public URL (443 for HTTPS) but should be checking the actual backend `forward_port` (5690 for Wizarr).

**Status**: Identified - Fix Required

## Detailed Analysis

### 1. Code Location

**Primary Issue**: `backend/internal/services/uptime_service.go`

- **Function**: `checkHost()` (lines 359-402)
- **Logic Flow**: `checkAllHosts()` → `checkHost()` → `CheckAll()` → `checkMonitor()`

### 2. How Uptime Monitoring Works

#### Two-Level Check System

1. **Host-Level Pre-Check** (TCP connectivity)
   - Runs first via `checkAllHosts()` → `checkHost()`
   - Groups services by their backend `forward_host` (e.g., `172.20.0.11`)
   - Attempts TCP connection to determine if host is reachable
   - If host is DOWN, marks all monitors on that host as down **without** checking individual services

2. **Service-Level Check** (HTTP/HTTPS)
   - Only runs if host-level check passes
   - Performs actual HTTP GET to public URL
   - Accepts 2xx, 3xx, 401, 403 as "up"
   - Correctly handles redirects (302)

### 3. The Bug

In `checkHost()` at line 375:

```go
for _, monitor := range monitors {
    port := extractPort(monitor.URL)  // Gets port from public URL
    if port == "" {
        continue
    }

    // Tries to connect using extracted port
    addr := net.JoinHostPort(host.Host, port)  // 172.20.0.11:443
    conn, err := net.DialTimeout("tcp", addr, 5*time.Second)
```

**Problem**:

- `monitor.URL` is the **public URL**: `https://wizarr.hatfieldhosted.com`
- `extractPort()` returns `443` (HTTPS default)
- But Wizarr backend actually runs on `172.20.0.11:5690`
- TCP connection to `172.20.0.11:443` **fails** (no service listening)
- Host marked as "down"
- All monitors on that host marked "down" without individual checks

### 4. Evidence from Logs and Database

#### Heartbeat Records (Most Recent First)

```
down|Host unreachable|0|2025-12-22 21:29:05
up|HTTP 200|64|2025-12-22 21:29:04
down|Host unreachable|0|2025-12-22 21:01:26
up|HTTP 200|47|2025-12-22 21:00:19
```

**Pattern**: Alternating between successful HTTP checks and host-level failures.

#### Database State

```sql
-- uptime_monitors
name: Wizarr
url: https://wizarr.hatfieldhosted.com
status: down
failure_count: 3
max_retries: 3

-- uptime_hosts
id: 0c764438-35ff-451f-822a-7297f39f39d4
name: Wizarr
host: 172.20.0.11
status: down  ← This is causing the problem

-- proxy_hosts
name: Wizarr
domain_names: wizarr.hatfieldhosted.com
forward_host: 172.20.0.11
forward_port: 5690  ← This is the actual port!
```

#### Caddy Access Logs

Uptime check succeeds at HTTP level:

```
172.20.0.1 → GET / → 302 → /admin
172.20.0.1 → GET /admin → 302 → /login
172.20.0.1 → GET /login → 200 OK (16905 bytes)
```

### 5. Why Other Hosts Don't Have This Issue

Checking working hosts (using Radarr as example):

```sql
-- Radarr (working)
forward_host: 100.99.23.57
forward_port: 7878
url: https://radarr.hatfieldhosted.com

-- 302 redirect logic works correctly:
GET / → 302 → /login
```

**Why it works**: For services that redirect on root path, the HTTP check succeeds with 200-399 status codes. The port mismatch issue exists for all hosts, but:

1. **If the forward_port happens to be a standard port** (80, 443, 8080) that the extractPort() function returns, it may work by coincidence
2. **If the host IP doesn't respond on that port**, the TCP check fails
3. **Wizarr uses port 5690** - a non-standard port that extractPort() will never return

### 6. Additional Context

The uptime monitoring feature was recently enhanced with host-level grouping to:

- Reduce check overhead for multiple services on same host
- Provide consolidated DOWN notifications
- Avoid individual checks when host is unreachable

This is a good architectural decision, but the port extraction logic has a bug.

## Root Cause Summary

**The `checkHost()` function extracts the port from the monitor's public URL instead of using the actual backend forward_port from the proxy host configuration.**

### Why This Happens

1. `UptimeMonitor` stores the public URL (e.g., `https://wizarr.hatfieldhosted.com`)
2. `UptimeHost` only stores the `forward_host` IP, not the port
3. `checkHost()` tries to extract port from monitor URLs
4. For HTTPS URLs, it extracts 443
5. Wizarr backend is on 172.20.0.11:5690, not :443
6. TCP connection fails → host marked down → monitor marked down

## Proposed Fixes

### Option 1: Store Forward Port in UptimeHost (Recommended)

**Changes Required**:

1. Add `Ports` field to `UptimeHost` model:

   ```go
   type UptimeHost struct {
       // ... existing fields
       Ports []int `json:"ports" gorm:"-"` // Not stored, computed on the fly
   }
   ```

2. Modify `checkHost()` to try all ports associated with monitors on that host:

   ```go
   // Collect unique ports from all monitors for this host
   portSet := make(map[int]bool)
   for _, monitor := range monitors {
       if monitor.ProxyHostID != nil {
           var proxyHost models.ProxyHost
           if err := s.DB.First(&proxyHost, *monitor.ProxyHostID).Error; err == nil {
               portSet[proxyHost.ForwardPort] = true
           }
       }
   }

   // Try connecting to any of the ports
   success := false
   for port := range portSet {
       addr := net.JoinHostPort(host.Host, strconv.Itoa(port))
       conn, err := net.DialTimeout("tcp", addr, 5*time.Second)
       // ... rest of logic
   }
   ```

**Pros**:

- Checks actual backend ports
- More accurate for non-standard ports
- Minimal schema changes

**Cons**:

- Requires database queries in check loop
- More complex logic

### Option 2: Store ForwardPort Reference in UptimeMonitor

**Changes Required**:

1. Add `ForwardPort` field to `UptimeMonitor`:

   ```go
   type UptimeMonitor struct {
       // ... existing fields
       ForwardPort int `json:"forward_port"`
   }
   ```

2. Update `SyncMonitors()` to populate it:

   ```go
   monitor = models.UptimeMonitor{
       // ... existing fields
       ForwardPort: host.ForwardPort,
   }
   ```

3. Update `checkHost()` to use stored forward port:

   ```go
   for _, monitor := range monitors {
       port := monitor.ForwardPort
       if port == 0 {
           continue
       }
       addr := net.JoinHostPort(host.Host, strconv.Itoa(port))
       // ... rest of logic
   }
   ```

**Pros**:

- Simple, no extra DB queries
- Forward port readily available

**Cons**:

- Schema migration required
- Duplication of data (port stored in both ProxyHost and UptimeMonitor)

### Option 3: Skip Host-Level Check for Non-Standard Ports

**Temporary workaround** - not recommended for production.

Only perform host-level checks for monitors on standard ports (80, 443, 8080).

### Option 4: Use ProxyHost Forward Port Directly (Simplest)

**Changes Required**:

Modify `checkHost()` to query the proxy host for each monitor to get the actual forward port:

```go
// In checkHost(), replace the port extraction:
for _, monitor := range monitors {
    var port int

    if monitor.ProxyHostID != nil {
        var proxyHost models.ProxyHost
        if err := s.DB.First(&proxyHost, *monitor.ProxyHostID).Error; err == nil {
            port = proxyHost.ForwardPort
        }
    } else {
        // Fallback to URL extraction for non-proxy monitors
        portStr := extractPort(monitor.URL)
        if portStr != "" {
            port, _ = strconv.Atoi(portStr)
        }
    }

    if port == 0 {
        continue
    }

    addr := net.JoinHostPort(host.Host, strconv.Itoa(port))
    conn, err := net.DialTimeout("tcp", addr, 5*time.Second)
    // ... rest of check
}
```

**Pros**:

- No schema changes
- Works immediately
- Handles both proxy hosts and standalone monitors

**Cons**:

- Database query in check loop (but monitors are already cached)
- Slight performance overhead

## Recommended Solution

**Option 4** (Use ProxyHost Forward Port Directly) is recommended because:

1. No schema migration required
2. Simple fix, easy to test
3. Minimal performance impact (monitors already queried)
4. Can be deployed immediately
5. Handles edge cases (standalone monitors)

## Testing Plan

1. **Unit Test**: Add test case for non-standard port host check
2. **Integration Test**:
   - Create proxy host with non-standard forward port
   - Verify host-level check uses correct port
   - Verify monitor status updates correctly
3. **Manual Test**:
   - Apply fix
   - Wait for next uptime check cycle (60 seconds)
   - Verify Wizarr shows as "up"
   - Verify no other monitors affected

## Debugging Commands

```bash
# Check Wizarr monitor status
docker compose -f docker-compose.test.yml exec charon sh -c \
  "sqlite3 /app/data/charon.db \"SELECT name, status, failure_count, url FROM uptime_monitors WHERE name = 'Wizarr';\""

# Check Wizarr host status
docker compose -f docker-compose.test.yml exec charon sh -c \
  "sqlite3 /app/data/charon.db \"SELECT name, host, status FROM uptime_hosts WHERE name = 'Wizarr';\""

# Check recent heartbeats
docker compose -f docker-compose.test.yml exec charon sh -c \
  "sqlite3 /app/data/charon.db \"SELECT status, message, created_at FROM uptime_heartbeats WHERE monitor_id = 'eed56336-e646-4cf5-a3fc-ac4d2dd8760e' ORDER BY created_at DESC LIMIT 5;\""

# Check Wizarr proxy host config
docker compose -f docker-compose.test.yml exec charon sh -c \
  "sqlite3 /app/data/charon.db \"SELECT name, forward_host, forward_port FROM proxy_hosts WHERE name = 'Wizarr';\""

# Monitor real-time uptime checks in logs
docker compose -f docker-compose.test.yml logs -f charon | grep -i "wizarr\|uptime"
```

## Related Files

- `backend/internal/services/uptime_service.go` - Main uptime service
- `backend/internal/models/uptime.go` - UptimeMonitor model
- `backend/internal/models/uptime_host.go` - UptimeHost model
- `backend/internal/services/uptime_service_test.go` - Unit tests

## References

- Issue created: 2025-12-23
- Related feature: Host-level uptime grouping
- Related PR: [Reference to ACL/permission changes if applicable]

---

**Next Steps**: Implement Option 4 fix and add test coverage for non-standard port scenarios.