Files
Charon/docs/features/uptime-monitoring.md
akanealw eec8c28fb3
Some checks are pending
Go Benchmark / Performance Regression Check (push) Waiting to run
Cerberus Integration / Cerberus Security Stack Integration (push) Waiting to run
Upload Coverage to Codecov / Backend Codecov Upload (push) Waiting to run
Upload Coverage to Codecov / Frontend Codecov Upload (push) Waiting to run
CodeQL - Analyze / CodeQL analysis (go) (push) Waiting to run
CodeQL - Analyze / CodeQL analysis (javascript-typescript) (push) Waiting to run
CrowdSec Integration / CrowdSec Bouncer Integration (push) Waiting to run
Docker Build, Publish & Test / build-and-push (push) Waiting to run
Docker Build, Publish & Test / Security Scan PR Image (push) Blocked by required conditions
Quality Checks / Auth Route Protection Contract (push) Waiting to run
Quality Checks / Codecov Trigger/Comment Parity Guard (push) Waiting to run
Quality Checks / Backend (Go) (push) Waiting to run
Quality Checks / Frontend (React) (push) Waiting to run
Rate Limit integration / Rate Limiting Integration (push) Waiting to run
Security Scan (PR) / Trivy Binary Scan (push) Waiting to run
Supply Chain Verification (PR) / Verify Supply Chain (push) Waiting to run
WAF integration / Coraza WAF Integration (push) Waiting to run
changed perms
2026-04-22 18:19:14 +00:00

14 KiB
Executable File
Raw Permalink Blame History

Uptime Monitoring

Charon's uptime monitoring system continuously checks the availability of your proxy hosts and alerts you when issues occur. The system is designed to minimize false positives while quickly detecting real problems.

Overview

Uptime monitoring performs automated health checks on your proxy hosts at regular intervals, tracking:

  • Host availability (TCP connectivity)
  • Response times (latency measurements)
  • Status history (uptime/downtime tracking)
  • Failure patterns (debounced detection)

How It Works

Check Cycle

  1. Scheduled Checks: Every 60 seconds (default), Charon checks all enabled hosts
  2. Port Detection: Uses the proxy host's ForwardPort for TCP checks
  3. Connection Test: Attempts TCP connection with configurable timeout
  4. Status Update: Records success/failure in database
  5. Notification Trigger: Sends alerts on status changes (if configured)

Failure Debouncing

To prevent false alarms from transient network issues, Charon uses failure debouncing:

How it works:

  • A host must fail 2 consecutive checks before being marked "down"
  • Single failures are logged but don't trigger status changes
  • Counter resets immediately on any successful check

Why this matters:

  • Network hiccups don't cause false alarms
  • Container restarts don't trigger unnecessary alerts
  • Transient DNS issues are ignored
  • You only get notified about real problems

Example scenario:

Check 1: ✅ Success → Status: Up, Failure Count: 0
Check 2: ❌ Failed  → Status: Up, Failure Count: 1  (no alert)
Check 3: ❌ Failed  → Status: Down, Failure Count: 2 (alert sent!)
Check 4: ✅ Success → Status: Up, Failure Count: 0  (recovery alert)

Configuration

Timeout Settings

Default TCP timeout: 10 seconds

This timeout determines how long Charon waits for a TCP connection before considering it failed.

Increase timeout if:

  • You have slow networks
  • Hosts are geographically distant
  • Containers take time to warm up
  • You see intermittent false "down" alerts

Decrease timeout if:

  • You want faster failure detection
  • Your hosts are on local network
  • Response times are consistently fast

Note: Timeout settings are currently set in the backend configuration. A future release will make this configurable via the UI.

Retry Behavior

When a check fails, Charon automatically retries:

  • Max retries: 2 attempts
  • Retry delay: 2 seconds between attempts
  • Timeout per attempt: 10 seconds (configurable)

Total check time calculation:

Max time = (timeout × max_retries) + (retry_delay × (max_retries - 1))
         = (10s × 2) + (2s × 1)
         = 22 seconds worst case

Check Interval

Default: 60 seconds

The interval between check cycles for all hosts.

Performance considerations:

  • Shorter intervals = faster detection but higher CPU/network usage
  • Longer intervals = lower overhead but slower failure detection
  • Recommended: 30-120 seconds depending on criticality

Enabling Uptime Monitoring

For a Single Host

  1. Navigate to Proxy Hosts
  2. Click Edit on the host
  3. Scroll to Uptime Monitoring section
  4. Toggle "Enable Uptime Monitoring" to ON
  5. Click Save

For Multiple Hosts (Bulk)

  1. Navigate to Proxy Hosts
  2. Select checkboxes for hosts to monitor
  3. Click "Bulk Apply" button
  4. Find "Uptime Monitoring" section
  5. Toggle the switch to ON
  6. Check "Apply to selected hosts"
  7. Click "Apply Changes"

Monitoring Dashboard

Host Status Display

Each monitored host shows:

  • Status Badge: 🟢 Up / 🔴 Down
  • Response Time: Last successful check latency
  • Uptime Percentage: Success rate over time
  • Last Check: Timestamp of most recent check

Status Page

View all monitored hosts at a glance:

  1. Navigate to DashboardUptime Status
  2. See real-time status of all hosts
  3. Click any host for detailed history
  4. Filter by status (up/down/all)

Troubleshooting

False Positive: Host Shown as Down but Actually Up

Symptoms:

  • Host shows "down" in Charon
  • Service is accessible directly
  • Status changes back to "up" shortly after

Common causes:

  1. Timeout too short for slow network

    Solution: Increase TCP timeout in configuration

  2. Container warmup time exceeds timeout

    Solution: Use longer timeout or optimize container startup

  3. Network congestion during check

    Solution: Debouncing (already enabled) should handle this automatically

  4. Firewall blocking health checks

    Solution: Ensure Charon container can reach proxy host ports

  5. Multiple checks running concurrently

    Solution: Automatic synchronization ensures checks complete before next cycle

Diagnostic steps:

# Check Charon logs for timing info
docker logs charon 2>&1 | grep "Host TCP check completed"

# Look for retry attempts
docker logs charon 2>&1 | grep "Retrying TCP check"

# Check failure count patterns
docker logs charon 2>&1 | grep "failure_count"

# View host status changes
docker logs charon 2>&1 | grep "Host status changed"

False Negative: Host Shown as Up but Actually Down

Symptoms:

  • Host shows "up" in Charon
  • Service returns errors or is inaccessible
  • No down alerts received

Common causes:

  1. TCP port open but service not responding

    Explanation: Uptime monitoring only checks TCP connectivity, not application health

    Solution: Consider implementing application-level health checks (future feature)

  2. Service accepts connections but returns errors

    Solution: Monitor application logs separately; TCP checks don't validate responses

  3. Partial service degradation

    Solution: Use multiple monitoring providers for critical services

Current limitation: Charon performs TCP health checks only. HTTP-based health checks are planned for a future release.

Intermittent Status Flapping

Symptoms:

  • Status rapidly changes between up/down
  • Multiple notifications in short time
  • Logs show alternating success/failure

Causes:

  1. Marginal network conditions

    Solution: Increase failure threshold (requires configuration change)

  2. Resource exhaustion on target host

    Solution: Investigate target host performance, increase resources

  3. Shared network congestion

    Solution: Consider dedicated monitoring network or VLAN

Mitigation:

The built-in debouncing (2 consecutive failures required) should prevent most flapping. If issues persist, check:

# Review consecutive check results
docker logs charon 2>&1 | grep -A 2 "Host TCP check completed" | grep "host_name"

# Check response time trends
docker logs charon 2>&1 | grep "elapsed_ms"

No Notifications Received

Checklist:

  1. Uptime monitoring is enabled for the host
  2. Notification provider is configured and enabled
  3. Provider is set to trigger on uptime events
  4. Status has actually changed (check logs)
  5. Debouncing threshold has been met (2 consecutive failures)

Debug notifications:

# Check for notification attempts
docker logs charon 2>&1 | grep "notification"

# Look for uptime-related notifications
docker logs charon 2>&1 | grep "uptime_down\|uptime_up"

# Verify notification service is working
docker logs charon 2>&1 | grep "Failed to send notification"

High CPU Usage from Monitoring

Symptoms:

  • Charon container using excessive CPU
  • System becomes slow during check cycles
  • Logs show slow check times

Solutions:

  1. Reduce number of monitored hosts

    Monitor only critical services; disable monitoring for non-essential hosts

  2. Increase check interval

    Change from 60s to 120s to reduce frequency

  3. Optimize Docker resource allocation

    Ensure adequate CPU/memory allocated to Charon container

  4. Check for network issues

    Slow DNS or network problems can cause checks to hang

Monitor check performance:

# View check duration distribution
docker logs charon 2>&1 | grep "elapsed_ms" | tail -50

# Count concurrent checks
docker logs charon 2>&1 | grep "All host checks completed"

Advanced Topics

Port Detection

Charon automatically determines which port to check:

Priority order:

  1. ProxyHost.ForwardPort: Preferred, most reliable
  2. URL extraction: Fallback for hosts without proxy configuration
  3. Default ports: 80 (HTTP) or 443 (HTTPS) if port not specified

Example:

Host: example.com
Forward Port: 8080
→ Checks: example.com:8080

Host: api.example.com
URL: https://api.example.com/health
Forward Port: (not set)
→ Checks: api.example.com:443

Concurrent Check Processing

All host checks run concurrently for better performance:

  • Each host checked in separate goroutine
  • WaitGroup ensures all checks complete before next cycle
  • Prevents database race conditions
  • No single slow host blocks other checks

Performance characteristics:

  • Sequential checks (old): time = hosts × timeout
  • Concurrent checks (current): time = max(individual_check_times)

Example: With 10 hosts and 10s timeout:

  • Sequential: ~100 seconds minimum
  • Concurrent: ~10 seconds (if all succeed on first try)

Database Storage

Uptime data is stored efficiently:

UptimeHost table:

  • status: Current status ("up"/"down")
  • failure_count: Consecutive failure counter
  • last_check: Timestamp of last check
  • response_time: Last successful response time

UptimeMonitor table:

  • Links monitors to proxy hosts
  • Stores check configuration
  • Tracks enabled state

Heartbeat records (future):

  • Detailed history of each check
  • Used for uptime percentage calculations
  • Queryable for historical analysis

Best Practices

1. Monitor Critical Services Only

Don't monitor every host. Focus on:

  • Production services
  • User-facing applications
  • External dependencies
  • High-availability requirements

Skip monitoring for:

  • Development/test instances
  • Internal tools with built-in redundancy
  • Services with their own monitoring

2. Configure Appropriate Notifications

Critical services:

  • Multiple notification channels (Discord + Slack)
  • Immediate alerts (no batching)
  • On-call team notifications

Non-critical services:

  • Single notification channel
  • Digest/batch notifications (future feature)
  • Email to team (low priority)

3. Review False Positives

If you receive false alarms:

  1. Check logs to understand why
  2. Adjust timeout if needed
  3. Verify network stability
  4. Consider increasing failure threshold (future config option)

4. Regular Status Review

Weekly review of:

  • Uptime percentages (identify problematic hosts)
  • Response time trends (detect degradation)
  • Notification frequency (too many alerts?)
  • False positive rate (refine configuration)

5. Combine with Application Monitoring

Uptime monitoring checks availability, not functionality.

Complement with:

  • Application-level health checks
  • Error rate monitoring
  • Performance metrics (APM tools)
  • User experience monitoring

Planned Improvements

Future enhancements under consideration:

  • HTTP health check support - Check specific endpoints with status code validation
  • Configurable failure threshold - Adjust consecutive failure count via UI
  • Custom check intervals per host - Different intervals for different criticality levels
  • Response time alerts - Notify on degraded performance, not just failures
  • Notification batching - Group multiple alerts to reduce noise
  • Maintenance windows - Disable alerts during scheduled maintenance
  • Historical graphs - Visual uptime trends over time
  • Status page export - Public status page for external visibility

Monitoring the Monitors

How do you know if Charon's monitoring is working?

Check Charon's own health:

# Verify check cycle is running
docker logs charon 2>&1 | grep "All host checks completed" | tail -5

# Confirm recent checks happened
docker logs charon 2>&1 | grep "Host TCP check completed" | tail -20

# Look for any errors in monitoring system
docker logs charon 2>&1 | grep "ERROR.*uptime\|ERROR.*monitor"

Expected log pattern:

INFO[...] All host checks completed host_count=5
DEBUG[...] Host TCP check completed elapsed_ms=156 host_name=example.com success=true

Warning signs:

  • No "All host checks completed" messages in recent logs
  • Checks taking longer than expected (>30s with 10s timeout)
  • Frequent timeout errors
  • High failure_count values

API Integration

Uptime monitoring data is accessible via API:

Get uptime status:

GET /api/uptime/hosts
Authorization: Bearer <token>

Response:

{
  "hosts": [
    {
      "id": "123",
      "name": "example.com",
      "status": "up",
      "last_check": "2025-12-24T10:30:00Z",
      "response_time": 156,
      "failure_count": 0,
      "uptime_percentage": 99.8
    }
  ]
}

Programmatic monitoring:

Use this API to integrate Charon's uptime data with:

  • External monitoring dashboards (Grafana, etc.)
  • Incident response systems (PagerDuty, etc.)
  • Custom alerting tools
  • Status page generators

Additional Resources

Need Help?