14 KiB
Executable File
Uptime Monitoring
Charon's uptime monitoring system continuously checks the availability of your proxy hosts and alerts you when issues occur. The system is designed to minimize false positives while quickly detecting real problems.
Overview
Uptime monitoring performs automated health checks on your proxy hosts at regular intervals, tracking:
- Host availability (TCP connectivity)
- Response times (latency measurements)
- Status history (uptime/downtime tracking)
- Failure patterns (debounced detection)
How It Works
Check Cycle
- Scheduled Checks: Every 60 seconds (default), Charon checks all enabled hosts
- Port Detection: Uses the proxy host's
ForwardPortfor TCP checks - Connection Test: Attempts TCP connection with configurable timeout
- Status Update: Records success/failure in database
- Notification Trigger: Sends alerts on status changes (if configured)
Failure Debouncing
To prevent false alarms from transient network issues, Charon uses failure debouncing:
How it works:
- A host must fail 2 consecutive checks before being marked "down"
- Single failures are logged but don't trigger status changes
- Counter resets immediately on any successful check
Why this matters:
- Network hiccups don't cause false alarms
- Container restarts don't trigger unnecessary alerts
- Transient DNS issues are ignored
- You only get notified about real problems
Example scenario:
Check 1: ✅ Success → Status: Up, Failure Count: 0
Check 2: ❌ Failed → Status: Up, Failure Count: 1 (no alert)
Check 3: ❌ Failed → Status: Down, Failure Count: 2 (alert sent!)
Check 4: ✅ Success → Status: Up, Failure Count: 0 (recovery alert)
Configuration
Timeout Settings
Default TCP timeout: 10 seconds
This timeout determines how long Charon waits for a TCP connection before considering it failed.
Increase timeout if:
- You have slow networks
- Hosts are geographically distant
- Containers take time to warm up
- You see intermittent false "down" alerts
Decrease timeout if:
- You want faster failure detection
- Your hosts are on local network
- Response times are consistently fast
Note: Timeout settings are currently set in the backend configuration. A future release will make this configurable via the UI.
Retry Behavior
When a check fails, Charon automatically retries:
- Max retries: 2 attempts
- Retry delay: 2 seconds between attempts
- Timeout per attempt: 10 seconds (configurable)
Total check time calculation:
Max time = (timeout × max_retries) + (retry_delay × (max_retries - 1))
= (10s × 2) + (2s × 1)
= 22 seconds worst case
Check Interval
Default: 60 seconds
The interval between check cycles for all hosts.
Performance considerations:
- Shorter intervals = faster detection but higher CPU/network usage
- Longer intervals = lower overhead but slower failure detection
- Recommended: 30-120 seconds depending on criticality
Enabling Uptime Monitoring
For a Single Host
- Navigate to Proxy Hosts
- Click Edit on the host
- Scroll to Uptime Monitoring section
- Toggle "Enable Uptime Monitoring" to ON
- Click Save
For Multiple Hosts (Bulk)
- Navigate to Proxy Hosts
- Select checkboxes for hosts to monitor
- Click "Bulk Apply" button
- Find "Uptime Monitoring" section
- Toggle the switch to ON
- Check "Apply to selected hosts"
- Click "Apply Changes"
Monitoring Dashboard
Host Status Display
Each monitored host shows:
- Status Badge: 🟢 Up / 🔴 Down
- Response Time: Last successful check latency
- Uptime Percentage: Success rate over time
- Last Check: Timestamp of most recent check
Status Page
View all monitored hosts at a glance:
- Navigate to Dashboard → Uptime Status
- See real-time status of all hosts
- Click any host for detailed history
- Filter by status (up/down/all)
Troubleshooting
False Positive: Host Shown as Down but Actually Up
Symptoms:
- Host shows "down" in Charon
- Service is accessible directly
- Status changes back to "up" shortly after
Common causes:
-
Timeout too short for slow network
Solution: Increase TCP timeout in configuration
-
Container warmup time exceeds timeout
Solution: Use longer timeout or optimize container startup
-
Network congestion during check
Solution: Debouncing (already enabled) should handle this automatically
-
Firewall blocking health checks
Solution: Ensure Charon container can reach proxy host ports
-
Multiple checks running concurrently
Solution: Automatic synchronization ensures checks complete before next cycle
Diagnostic steps:
# Check Charon logs for timing info
docker logs charon 2>&1 | grep "Host TCP check completed"
# Look for retry attempts
docker logs charon 2>&1 | grep "Retrying TCP check"
# Check failure count patterns
docker logs charon 2>&1 | grep "failure_count"
# View host status changes
docker logs charon 2>&1 | grep "Host status changed"
False Negative: Host Shown as Up but Actually Down
Symptoms:
- Host shows "up" in Charon
- Service returns errors or is inaccessible
- No down alerts received
Common causes:
-
TCP port open but service not responding
Explanation: Uptime monitoring only checks TCP connectivity, not application health
Solution: Consider implementing application-level health checks (future feature)
-
Service accepts connections but returns errors
Solution: Monitor application logs separately; TCP checks don't validate responses
-
Partial service degradation
Solution: Use multiple monitoring providers for critical services
Current limitation: Charon performs TCP health checks only. HTTP-based health checks are planned for a future release.
Intermittent Status Flapping
Symptoms:
- Status rapidly changes between up/down
- Multiple notifications in short time
- Logs show alternating success/failure
Causes:
-
Marginal network conditions
Solution: Increase failure threshold (requires configuration change)
-
Resource exhaustion on target host
Solution: Investigate target host performance, increase resources
-
Shared network congestion
Solution: Consider dedicated monitoring network or VLAN
Mitigation:
The built-in debouncing (2 consecutive failures required) should prevent most flapping. If issues persist, check:
# Review consecutive check results
docker logs charon 2>&1 | grep -A 2 "Host TCP check completed" | grep "host_name"
# Check response time trends
docker logs charon 2>&1 | grep "elapsed_ms"
No Notifications Received
Checklist:
- ✅ Uptime monitoring is enabled for the host
- ✅ Notification provider is configured and enabled
- ✅ Provider is set to trigger on uptime events
- ✅ Status has actually changed (check logs)
- ✅ Debouncing threshold has been met (2 consecutive failures)
Debug notifications:
# Check for notification attempts
docker logs charon 2>&1 | grep "notification"
# Look for uptime-related notifications
docker logs charon 2>&1 | grep "uptime_down\|uptime_up"
# Verify notification service is working
docker logs charon 2>&1 | grep "Failed to send notification"
High CPU Usage from Monitoring
Symptoms:
- Charon container using excessive CPU
- System becomes slow during check cycles
- Logs show slow check times
Solutions:
-
Reduce number of monitored hosts
Monitor only critical services; disable monitoring for non-essential hosts
-
Increase check interval
Change from 60s to 120s to reduce frequency
-
Optimize Docker resource allocation
Ensure adequate CPU/memory allocated to Charon container
-
Check for network issues
Slow DNS or network problems can cause checks to hang
Monitor check performance:
# View check duration distribution
docker logs charon 2>&1 | grep "elapsed_ms" | tail -50
# Count concurrent checks
docker logs charon 2>&1 | grep "All host checks completed"
Advanced Topics
Port Detection
Charon automatically determines which port to check:
Priority order:
- ProxyHost.ForwardPort: Preferred, most reliable
- URL extraction: Fallback for hosts without proxy configuration
- Default ports: 80 (HTTP) or 443 (HTTPS) if port not specified
Example:
Host: example.com
Forward Port: 8080
→ Checks: example.com:8080
Host: api.example.com
URL: https://api.example.com/health
Forward Port: (not set)
→ Checks: api.example.com:443
Concurrent Check Processing
All host checks run concurrently for better performance:
- Each host checked in separate goroutine
- WaitGroup ensures all checks complete before next cycle
- Prevents database race conditions
- No single slow host blocks other checks
Performance characteristics:
- Sequential checks (old):
time = hosts × timeout - Concurrent checks (current):
time = max(individual_check_times)
Example: With 10 hosts and 10s timeout:
- Sequential: ~100 seconds minimum
- Concurrent: ~10 seconds (if all succeed on first try)
Database Storage
Uptime data is stored efficiently:
UptimeHost table:
status: Current status ("up"/"down")failure_count: Consecutive failure counterlast_check: Timestamp of last checkresponse_time: Last successful response time
UptimeMonitor table:
- Links monitors to proxy hosts
- Stores check configuration
- Tracks enabled state
Heartbeat records (future):
- Detailed history of each check
- Used for uptime percentage calculations
- Queryable for historical analysis
Best Practices
1. Monitor Critical Services Only
Don't monitor every host. Focus on:
- Production services
- User-facing applications
- External dependencies
- High-availability requirements
Skip monitoring for:
- Development/test instances
- Internal tools with built-in redundancy
- Services with their own monitoring
2. Configure Appropriate Notifications
Critical services:
- Multiple notification channels (Discord + Slack)
- Immediate alerts (no batching)
- On-call team notifications
Non-critical services:
- Single notification channel
- Digest/batch notifications (future feature)
- Email to team (low priority)
3. Review False Positives
If you receive false alarms:
- Check logs to understand why
- Adjust timeout if needed
- Verify network stability
- Consider increasing failure threshold (future config option)
4. Regular Status Review
Weekly review of:
- Uptime percentages (identify problematic hosts)
- Response time trends (detect degradation)
- Notification frequency (too many alerts?)
- False positive rate (refine configuration)
5. Combine with Application Monitoring
Uptime monitoring checks availability, not functionality.
Complement with:
- Application-level health checks
- Error rate monitoring
- Performance metrics (APM tools)
- User experience monitoring
Planned Improvements
Future enhancements under consideration:
- HTTP health check support - Check specific endpoints with status code validation
- Configurable failure threshold - Adjust consecutive failure count via UI
- Custom check intervals per host - Different intervals for different criticality levels
- Response time alerts - Notify on degraded performance, not just failures
- Notification batching - Group multiple alerts to reduce noise
- Maintenance windows - Disable alerts during scheduled maintenance
- Historical graphs - Visual uptime trends over time
- Status page export - Public status page for external visibility
Monitoring the Monitors
How do you know if Charon's monitoring is working?
Check Charon's own health:
# Verify check cycle is running
docker logs charon 2>&1 | grep "All host checks completed" | tail -5
# Confirm recent checks happened
docker logs charon 2>&1 | grep "Host TCP check completed" | tail -20
# Look for any errors in monitoring system
docker logs charon 2>&1 | grep "ERROR.*uptime\|ERROR.*monitor"
Expected log pattern:
INFO[...] All host checks completed host_count=5
DEBUG[...] Host TCP check completed elapsed_ms=156 host_name=example.com success=true
Warning signs:
- No "All host checks completed" messages in recent logs
- Checks taking longer than expected (>30s with 10s timeout)
- Frequent timeout errors
- High failure_count values
API Integration
Uptime monitoring data is accessible via API:
Get uptime status:
GET /api/uptime/hosts
Authorization: Bearer <token>
Response:
{
"hosts": [
{
"id": "123",
"name": "example.com",
"status": "up",
"last_check": "2025-12-24T10:30:00Z",
"response_time": 156,
"failure_count": 0,
"uptime_percentage": 99.8
}
]
}
Programmatic monitoring:
Use this API to integrate Charon's uptime data with:
- External monitoring dashboards (Grafana, etc.)
- Incident response systems (PagerDuty, etc.)
- Custom alerting tools
- Status page generators