chore: clean .gitignore cache
This commit is contained in:
@@ -1,528 +0,0 @@
|
||||
# Uptime Monitoring
|
||||
|
||||
Charon's uptime monitoring system continuously checks the availability of your proxy hosts and alerts you when issues occur. The system is designed to minimize false positives while quickly detecting real problems.
|
||||
|
||||
## Overview
|
||||
|
||||
Uptime monitoring performs automated health checks on your proxy hosts at regular intervals, tracking:
|
||||
|
||||
- **Host availability** (TCP connectivity)
|
||||
- **Response times** (latency measurements)
|
||||
- **Status history** (uptime/downtime tracking)
|
||||
- **Failure patterns** (debounced detection)
|
||||
|
||||
## How It Works
|
||||
|
||||
### Check Cycle
|
||||
|
||||
1. **Scheduled Checks**: Every 60 seconds (default), Charon checks all enabled hosts
|
||||
2. **Port Detection**: Uses the proxy host's `ForwardPort` for TCP checks
|
||||
3. **Connection Test**: Attempts TCP connection with configurable timeout
|
||||
4. **Status Update**: Records success/failure in database
|
||||
5. **Notification Trigger**: Sends alerts on status changes (if configured)
|
||||
|
||||
### Failure Debouncing
|
||||
|
||||
To prevent false alarms from transient network issues, Charon uses **failure debouncing**:
|
||||
|
||||
**How it works:**
|
||||
|
||||
- A host must **fail 2 consecutive checks** before being marked "down"
|
||||
- Single failures are logged but don't trigger status changes
|
||||
- Counter resets immediately on any successful check
|
||||
|
||||
**Why this matters:**
|
||||
|
||||
- Network hiccups don't cause false alarms
|
||||
- Container restarts don't trigger unnecessary alerts
|
||||
- Transient DNS issues are ignored
|
||||
- You only get notified about real problems
|
||||
|
||||
**Example scenario:**
|
||||
|
||||
```
|
||||
Check 1: ✅ Success → Status: Up, Failure Count: 0
|
||||
Check 2: ❌ Failed → Status: Up, Failure Count: 1 (no alert)
|
||||
Check 3: ❌ Failed → Status: Down, Failure Count: 2 (alert sent!)
|
||||
Check 4: ✅ Success → Status: Up, Failure Count: 0 (recovery alert)
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Timeout Settings
|
||||
|
||||
**Default TCP timeout:** 10 seconds
|
||||
|
||||
This timeout determines how long Charon waits for a TCP connection before considering it failed.
|
||||
|
||||
**Increase timeout if:**
|
||||
|
||||
- You have slow networks
|
||||
- Hosts are geographically distant
|
||||
- Containers take time to warm up
|
||||
- You see intermittent false "down" alerts
|
||||
|
||||
**Decrease timeout if:**
|
||||
|
||||
- You want faster failure detection
|
||||
- Your hosts are on local network
|
||||
- Response times are consistently fast
|
||||
|
||||
**Note:** Timeout settings are currently set in the backend configuration. A future release will make this configurable via the UI.
|
||||
|
||||
### Retry Behavior
|
||||
|
||||
When a check fails, Charon automatically retries:
|
||||
|
||||
- **Max retries:** 2 attempts
|
||||
- **Retry delay:** 2 seconds between attempts
|
||||
- **Timeout per attempt:** 10 seconds (configurable)
|
||||
|
||||
**Total check time calculation:**
|
||||
|
||||
```
|
||||
Max time = (timeout × max_retries) + (retry_delay × (max_retries - 1))
|
||||
= (10s × 2) + (2s × 1)
|
||||
= 22 seconds worst case
|
||||
```
|
||||
|
||||
### Check Interval
|
||||
|
||||
**Default:** 60 seconds
|
||||
|
||||
The interval between check cycles for all hosts.
|
||||
|
||||
**Performance considerations:**
|
||||
|
||||
- Shorter intervals = faster detection but higher CPU/network usage
|
||||
- Longer intervals = lower overhead but slower failure detection
|
||||
- Recommended: 30-120 seconds depending on criticality
|
||||
|
||||
## Enabling Uptime Monitoring
|
||||
|
||||
### For a Single Host
|
||||
|
||||
1. Navigate to **Proxy Hosts**
|
||||
2. Click **Edit** on the host
|
||||
3. Scroll to **Uptime Monitoring** section
|
||||
4. Toggle **"Enable Uptime Monitoring"** to ON
|
||||
5. Click **Save**
|
||||
|
||||
### For Multiple Hosts (Bulk)
|
||||
|
||||
1. Navigate to **Proxy Hosts**
|
||||
2. Select checkboxes for hosts to monitor
|
||||
3. Click **"Bulk Apply"** button
|
||||
4. Find **"Uptime Monitoring"** section
|
||||
5. Toggle the switch to **ON**
|
||||
6. Check **"Apply to selected hosts"**
|
||||
7. Click **"Apply Changes"**
|
||||
|
||||
## Monitoring Dashboard
|
||||
|
||||
### Host Status Display
|
||||
|
||||
Each monitored host shows:
|
||||
|
||||
- **Status Badge**: 🟢 Up / 🔴 Down
|
||||
- **Response Time**: Last successful check latency
|
||||
- **Uptime Percentage**: Success rate over time
|
||||
- **Last Check**: Timestamp of most recent check
|
||||
|
||||
### Status Page
|
||||
|
||||
View all monitored hosts at a glance:
|
||||
|
||||
1. Navigate to **Dashboard** → **Uptime Status**
|
||||
2. See real-time status of all hosts
|
||||
3. Click any host for detailed history
|
||||
4. Filter by status (up/down/all)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### False Positive: Host Shown as Down but Actually Up
|
||||
|
||||
**Symptoms:**
|
||||
|
||||
- Host shows "down" in Charon
|
||||
- Service is accessible directly
|
||||
- Status changes back to "up" shortly after
|
||||
|
||||
**Common causes:**
|
||||
|
||||
1. **Timeout too short for slow network**
|
||||
|
||||
**Solution:** Increase TCP timeout in configuration
|
||||
|
||||
2. **Container warmup time exceeds timeout**
|
||||
|
||||
**Solution:** Use longer timeout or optimize container startup
|
||||
|
||||
3. **Network congestion during check**
|
||||
|
||||
**Solution:** Debouncing (already enabled) should handle this automatically
|
||||
|
||||
4. **Firewall blocking health checks**
|
||||
|
||||
**Solution:** Ensure Charon container can reach proxy host ports
|
||||
|
||||
5. **Multiple checks running concurrently**
|
||||
|
||||
**Solution:** Automatic synchronization ensures checks complete before next cycle
|
||||
|
||||
**Diagnostic steps:**
|
||||
|
||||
```bash
|
||||
# Check Charon logs for timing info
|
||||
docker logs charon 2>&1 | grep "Host TCP check completed"
|
||||
|
||||
# Look for retry attempts
|
||||
docker logs charon 2>&1 | grep "Retrying TCP check"
|
||||
|
||||
# Check failure count patterns
|
||||
docker logs charon 2>&1 | grep "failure_count"
|
||||
|
||||
# View host status changes
|
||||
docker logs charon 2>&1 | grep "Host status changed"
|
||||
```
|
||||
|
||||
### False Negative: Host Shown as Up but Actually Down
|
||||
|
||||
**Symptoms:**
|
||||
|
||||
- Host shows "up" in Charon
|
||||
- Service returns errors or is inaccessible
|
||||
- No down alerts received
|
||||
|
||||
**Common causes:**
|
||||
|
||||
1. **TCP port open but service not responding**
|
||||
|
||||
**Explanation:** Uptime monitoring only checks TCP connectivity, not application health
|
||||
|
||||
**Solution:** Consider implementing application-level health checks (future feature)
|
||||
|
||||
2. **Service accepts connections but returns errors**
|
||||
|
||||
**Solution:** Monitor application logs separately; TCP checks don't validate responses
|
||||
|
||||
3. **Partial service degradation**
|
||||
|
||||
**Solution:** Use multiple monitoring providers for critical services
|
||||
|
||||
**Current limitation:** Charon performs TCP health checks only. HTTP-based health checks are planned for a future release.
|
||||
|
||||
### Intermittent Status Flapping
|
||||
|
||||
**Symptoms:**
|
||||
|
||||
- Status rapidly changes between up/down
|
||||
- Multiple notifications in short time
|
||||
- Logs show alternating success/failure
|
||||
|
||||
**Causes:**
|
||||
|
||||
1. **Marginal network conditions**
|
||||
|
||||
**Solution:** Increase failure threshold (requires configuration change)
|
||||
|
||||
2. **Resource exhaustion on target host**
|
||||
|
||||
**Solution:** Investigate target host performance, increase resources
|
||||
|
||||
3. **Shared network congestion**
|
||||
|
||||
**Solution:** Consider dedicated monitoring network or VLAN
|
||||
|
||||
**Mitigation:**
|
||||
|
||||
The built-in debouncing (2 consecutive failures required) should prevent most flapping. If issues persist, check:
|
||||
|
||||
```bash
|
||||
# Review consecutive check results
|
||||
docker logs charon 2>&1 | grep -A 2 "Host TCP check completed" | grep "host_name"
|
||||
|
||||
# Check response time trends
|
||||
docker logs charon 2>&1 | grep "elapsed_ms"
|
||||
```
|
||||
|
||||
### No Notifications Received
|
||||
|
||||
**Checklist:**
|
||||
|
||||
1. ✅ Uptime monitoring is enabled for the host
|
||||
2. ✅ Notification provider is configured and enabled
|
||||
3. ✅ Provider is set to trigger on uptime events
|
||||
4. ✅ Status has actually changed (check logs)
|
||||
5. ✅ Debouncing threshold has been met (2 consecutive failures)
|
||||
|
||||
**Debug notifications:**
|
||||
|
||||
```bash
|
||||
# Check for notification attempts
|
||||
docker logs charon 2>&1 | grep "notification"
|
||||
|
||||
# Look for uptime-related notifications
|
||||
docker logs charon 2>&1 | grep "uptime_down\|uptime_up"
|
||||
|
||||
# Verify notification service is working
|
||||
docker logs charon 2>&1 | grep "Failed to send notification"
|
||||
```
|
||||
|
||||
### High CPU Usage from Monitoring
|
||||
|
||||
**Symptoms:**
|
||||
|
||||
- Charon container using excessive CPU
|
||||
- System becomes slow during check cycles
|
||||
- Logs show slow check times
|
||||
|
||||
**Solutions:**
|
||||
|
||||
1. **Reduce number of monitored hosts**
|
||||
|
||||
Monitor only critical services; disable monitoring for non-essential hosts
|
||||
|
||||
2. **Increase check interval**
|
||||
|
||||
Change from 60s to 120s to reduce frequency
|
||||
|
||||
3. **Optimize Docker resource allocation**
|
||||
|
||||
Ensure adequate CPU/memory allocated to Charon container
|
||||
|
||||
4. **Check for network issues**
|
||||
|
||||
Slow DNS or network problems can cause checks to hang
|
||||
|
||||
**Monitor check performance:**
|
||||
|
||||
```bash
|
||||
# View check duration distribution
|
||||
docker logs charon 2>&1 | grep "elapsed_ms" | tail -50
|
||||
|
||||
# Count concurrent checks
|
||||
docker logs charon 2>&1 | grep "All host checks completed"
|
||||
```
|
||||
|
||||
## Advanced Topics
|
||||
|
||||
### Port Detection
|
||||
|
||||
Charon automatically determines which port to check:
|
||||
|
||||
**Priority order:**
|
||||
|
||||
1. **ProxyHost.ForwardPort**: Preferred, most reliable
|
||||
2. **URL extraction**: Fallback for hosts without proxy configuration
|
||||
3. **Default ports**: 80 (HTTP) or 443 (HTTPS) if port not specified
|
||||
|
||||
**Example:**
|
||||
|
||||
```
|
||||
Host: example.com
|
||||
Forward Port: 8080
|
||||
→ Checks: example.com:8080
|
||||
|
||||
Host: api.example.com
|
||||
URL: https://api.example.com/health
|
||||
Forward Port: (not set)
|
||||
→ Checks: api.example.com:443
|
||||
```
|
||||
|
||||
### Concurrent Check Processing
|
||||
|
||||
All host checks run concurrently for better performance:
|
||||
|
||||
- Each host checked in separate goroutine
|
||||
- WaitGroup ensures all checks complete before next cycle
|
||||
- Prevents database race conditions
|
||||
- No single slow host blocks other checks
|
||||
|
||||
**Performance characteristics:**
|
||||
|
||||
- **Sequential checks** (old): `time = hosts × timeout`
|
||||
- **Concurrent checks** (current): `time = max(individual_check_times)`
|
||||
|
||||
**Example:** With 10 hosts and 10s timeout:
|
||||
|
||||
- Sequential: ~100 seconds minimum
|
||||
- Concurrent: ~10 seconds (if all succeed on first try)
|
||||
|
||||
### Database Storage
|
||||
|
||||
Uptime data is stored efficiently:
|
||||
|
||||
**UptimeHost table:**
|
||||
|
||||
- `status`: Current status ("up"/"down")
|
||||
- `failure_count`: Consecutive failure counter
|
||||
- `last_check`: Timestamp of last check
|
||||
- `response_time`: Last successful response time
|
||||
|
||||
**UptimeMonitor table:**
|
||||
|
||||
- Links monitors to proxy hosts
|
||||
- Stores check configuration
|
||||
- Tracks enabled state
|
||||
|
||||
**Heartbeat records** (future):
|
||||
|
||||
- Detailed history of each check
|
||||
- Used for uptime percentage calculations
|
||||
- Queryable for historical analysis
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Monitor Critical Services Only
|
||||
|
||||
Don't monitor every host. Focus on:
|
||||
|
||||
- Production services
|
||||
- User-facing applications
|
||||
- External dependencies
|
||||
- High-availability requirements
|
||||
|
||||
**Skip monitoring for:**
|
||||
|
||||
- Development/test instances
|
||||
- Internal tools with built-in redundancy
|
||||
- Services with their own monitoring
|
||||
|
||||
### 2. Configure Appropriate Notifications
|
||||
|
||||
**Critical services:**
|
||||
|
||||
- Multiple notification channels (Discord + Slack)
|
||||
- Immediate alerts (no batching)
|
||||
- On-call team notifications
|
||||
|
||||
**Non-critical services:**
|
||||
|
||||
- Single notification channel
|
||||
- Digest/batch notifications (future feature)
|
||||
- Email to team (low priority)
|
||||
|
||||
### 3. Review False Positives
|
||||
|
||||
If you receive false alarms:
|
||||
|
||||
1. Check logs to understand why
|
||||
2. Adjust timeout if needed
|
||||
3. Verify network stability
|
||||
4. Consider increasing failure threshold (future config option)
|
||||
|
||||
### 4. Regular Status Review
|
||||
|
||||
Weekly review of:
|
||||
|
||||
- Uptime percentages (identify problematic hosts)
|
||||
- Response time trends (detect degradation)
|
||||
- Notification frequency (too many alerts?)
|
||||
- False positive rate (refine configuration)
|
||||
|
||||
### 5. Combine with Application Monitoring
|
||||
|
||||
Uptime monitoring checks **availability**, not **functionality**.
|
||||
|
||||
Complement with:
|
||||
|
||||
- Application-level health checks
|
||||
- Error rate monitoring
|
||||
- Performance metrics (APM tools)
|
||||
- User experience monitoring
|
||||
|
||||
## Planned Improvements
|
||||
|
||||
Future enhancements under consideration:
|
||||
|
||||
- [ ] **HTTP health check support** - Check specific endpoints with status code validation
|
||||
- [ ] **Configurable failure threshold** - Adjust consecutive failure count via UI
|
||||
- [ ] **Custom check intervals per host** - Different intervals for different criticality levels
|
||||
- [ ] **Response time alerts** - Notify on degraded performance, not just failures
|
||||
- [ ] **Notification batching** - Group multiple alerts to reduce noise
|
||||
- [ ] **Maintenance windows** - Disable alerts during scheduled maintenance
|
||||
- [ ] **Historical graphs** - Visual uptime trends over time
|
||||
- [ ] **Status page export** - Public status page for external visibility
|
||||
|
||||
## Monitoring the Monitors
|
||||
|
||||
How do you know if Charon's monitoring is working?
|
||||
|
||||
**Check Charon's own health:**
|
||||
|
||||
```bash
|
||||
# Verify check cycle is running
|
||||
docker logs charon 2>&1 | grep "All host checks completed" | tail -5
|
||||
|
||||
# Confirm recent checks happened
|
||||
docker logs charon 2>&1 | grep "Host TCP check completed" | tail -20
|
||||
|
||||
# Look for any errors in monitoring system
|
||||
docker logs charon 2>&1 | grep "ERROR.*uptime\|ERROR.*monitor"
|
||||
```
|
||||
|
||||
**Expected log pattern:**
|
||||
|
||||
```
|
||||
INFO[...] All host checks completed host_count=5
|
||||
DEBUG[...] Host TCP check completed elapsed_ms=156 host_name=example.com success=true
|
||||
```
|
||||
|
||||
**Warning signs:**
|
||||
|
||||
- No "All host checks completed" messages in recent logs
|
||||
- Checks taking longer than expected (>30s with 10s timeout)
|
||||
- Frequent timeout errors
|
||||
- High failure_count values
|
||||
|
||||
## API Integration
|
||||
|
||||
Uptime monitoring data is accessible via API:
|
||||
|
||||
**Get uptime status:**
|
||||
|
||||
```bash
|
||||
GET /api/uptime/hosts
|
||||
Authorization: Bearer <token>
|
||||
```
|
||||
|
||||
**Response:**
|
||||
|
||||
```json
|
||||
{
|
||||
"hosts": [
|
||||
{
|
||||
"id": "123",
|
||||
"name": "example.com",
|
||||
"status": "up",
|
||||
"last_check": "2025-12-24T10:30:00Z",
|
||||
"response_time": 156,
|
||||
"failure_count": 0,
|
||||
"uptime_percentage": 99.8
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Programmatic monitoring:**
|
||||
|
||||
Use this API to integrate Charon's uptime data with:
|
||||
|
||||
- External monitoring dashboards (Grafana, etc.)
|
||||
- Incident response systems (PagerDuty, etc.)
|
||||
- Custom alerting tools
|
||||
- Status page generators
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- [Notification Configuration Guide](notifications.md)
|
||||
- [Proxy Host Setup](../getting-started.md)
|
||||
- [Troubleshooting Guide](../troubleshooting/)
|
||||
- [Security Best Practices](../security.md)
|
||||
|
||||
## Need Help?
|
||||
|
||||
- 💬 [Ask in Discussions](https://github.com/Wikid82/charon/discussions)
|
||||
- 🐛 [Report Issues](https://github.com/Wikid82/charon/issues)
|
||||
- 📖 [View Full Documentation](https://wikid82.github.io/charon/)
|
||||
Reference in New Issue
Block a user