akanealw/Charon

Fork 0

Files

akanealw eec8c28fb3

Go Benchmark / Performance Regression Check (push) Waiting to run

Details

Cerberus Integration / Cerberus Security Stack Integration (push) Waiting to run

Details

Upload Coverage to Codecov / Backend Codecov Upload (push) Waiting to run

Details

Upload Coverage to Codecov / Frontend Codecov Upload (push) Waiting to run

Details

CodeQL - Analyze / CodeQL analysis (go) (push) Waiting to run

Details

CodeQL - Analyze / CodeQL analysis (javascript-typescript) (push) Waiting to run

Details

CrowdSec Integration / CrowdSec Bouncer Integration (push) Waiting to run

Details

Docker Build, Publish & Test / build-and-push (push) Waiting to run

Details

Docker Build, Publish & Test / Security Scan PR Image (push) Blocked by required conditions

Details

Quality Checks / Auth Route Protection Contract (push) Waiting to run

Details

Quality Checks / Codecov Trigger/Comment Parity Guard (push) Waiting to run

Details

Quality Checks / Backend (Go) (push) Waiting to run

Details

Quality Checks / Frontend (React) (push) Waiting to run

Details

Rate Limit integration / Rate Limiting Integration (push) Waiting to run

Details

Security Scan (PR) / Trivy Binary Scan (push) Waiting to run

Details

Supply Chain Verification (PR) / Verify Supply Chain (push) Waiting to run

Details

WAF integration / Coraza WAF Integration (push) Waiting to run

Details

changed perms

2026-04-22 18:19:14 +00:00

14 KiB

Executable File

Raw Permalink Blame History

Uptime Monitoring

Charon's uptime monitoring system continuously checks the availability of your proxy hosts and alerts you when issues occur. The system is designed to minimize false positives while quickly detecting real problems.

Overview

Uptime monitoring performs automated health checks on your proxy hosts at regular intervals, tracking:

Host availability (TCP connectivity)
Response times (latency measurements)
Status history (uptime/downtime tracking)
Failure patterns (debounced detection)

How It Works

Check Cycle

Scheduled Checks: Every 60 seconds (default), Charon checks all enabled hosts
Port Detection: Uses the proxy host's ForwardPort for TCP checks
Connection Test: Attempts TCP connection with configurable timeout
Status Update: Records success/failure in database
Notification Trigger: Sends alerts on status changes (if configured)

Failure Debouncing

To prevent false alarms from transient network issues, Charon uses failure debouncing:

How it works:

A host must fail 2 consecutive checks before being marked "down"
Single failures are logged but don't trigger status changes
Counter resets immediately on any successful check

Why this matters:

Network hiccups don't cause false alarms
Container restarts don't trigger unnecessary alerts
Transient DNS issues are ignored
You only get notified about real problems

Example scenario:

Check 1: ✅ Success → Status: Up, Failure Count: 0
Check 2: ❌ Failed  → Status: Up, Failure Count: 1  (no alert)
Check 3: ❌ Failed  → Status: Down, Failure Count: 2 (alert sent!)
Check 4: ✅ Success → Status: Up, Failure Count: 0  (recovery alert)

Configuration

Timeout Settings

Default TCP timeout: 10 seconds

This timeout determines how long Charon waits for a TCP connection before considering it failed.

Increase timeout if:

You have slow networks
Hosts are geographically distant
Containers take time to warm up
You see intermittent false "down" alerts

Decrease timeout if:

You want faster failure detection
Your hosts are on local network
Response times are consistently fast

Note: Timeout settings are currently set in the backend configuration. A future release will make this configurable via the UI.

Retry Behavior

When a check fails, Charon automatically retries:

Max retries: 2 attempts
Retry delay: 2 seconds between attempts
Timeout per attempt: 10 seconds (configurable)

Total check time calculation:

Max time = (timeout × max_retries) + (retry_delay × (max_retries - 1))
         = (10s × 2) + (2s × 1)
         = 22 seconds worst case

Check Interval

Default: 60 seconds

The interval between check cycles for all hosts.

Performance considerations:

Shorter intervals = faster detection but higher CPU/network usage
Longer intervals = lower overhead but slower failure detection
Recommended: 30-120 seconds depending on criticality

Enabling Uptime Monitoring

For a Single Host

Navigate to Proxy Hosts
Click Edit on the host
Scroll to Uptime Monitoring section
Toggle "Enable Uptime Monitoring" to ON
Click Save

For Multiple Hosts (Bulk)

Navigate to Proxy Hosts
Select checkboxes for hosts to monitor
Click "Bulk Apply" button
Find "Uptime Monitoring" section
Toggle the switch to ON
Check "Apply to selected hosts"
Click "Apply Changes"

Monitoring Dashboard

Host Status Display

Each monitored host shows:

Status Badge: 🟢 Up / 🔴 Down
Response Time: Last successful check latency
Uptime Percentage: Success rate over time
Last Check: Timestamp of most recent check

Status Page

View all monitored hosts at a glance:

Navigate to Dashboard → Uptime Status
See real-time status of all hosts
Click any host for detailed history
Filter by status (up/down/all)

Troubleshooting

False Positive: Host Shown as Down but Actually Up

Symptoms:

Host shows "down" in Charon
Service is accessible directly
Status changes back to "up" shortly after

Common causes:

Timeout too short for slow network

Solution: Increase TCP timeout in configuration
Container warmup time exceeds timeout

Solution: Use longer timeout or optimize container startup
Network congestion during check

Solution: Debouncing (already enabled) should handle this automatically
Firewall blocking health checks

Solution: Ensure Charon container can reach proxy host ports
Multiple checks running concurrently

Solution: Automatic synchronization ensures checks complete before next cycle

Diagnostic steps:

# Check Charon logs for timing info
docker logs charon 2>&1 | grep "Host TCP check completed"

# Look for retry attempts
docker logs charon 2>&1 | grep "Retrying TCP check"

# Check failure count patterns
docker logs charon 2>&1 | grep "failure_count"

# View host status changes
docker logs charon 2>&1 | grep "Host status changed"

False Negative: Host Shown as Up but Actually Down

Symptoms:

Host shows "up" in Charon
Service returns errors or is inaccessible
No down alerts received

Common causes:

TCP port open but service not responding

Explanation: Uptime monitoring only checks TCP connectivity, not application health

Solution: Consider implementing application-level health checks (future feature)
Service accepts connections but returns errors

Solution: Monitor application logs separately; TCP checks don't validate responses
Partial service degradation

Solution: Use multiple monitoring providers for critical services

Current limitation: Charon performs TCP health checks only. HTTP-based health checks are planned for a future release.

Intermittent Status Flapping

Symptoms:

Status rapidly changes between up/down
Multiple notifications in short time
Logs show alternating success/failure

Causes:

Marginal network conditions

Solution: Increase failure threshold (requires configuration change)
Resource exhaustion on target host

Solution: Investigate target host performance, increase resources
Shared network congestion

Solution: Consider dedicated monitoring network or VLAN

Mitigation:

The built-in debouncing (2 consecutive failures required) should prevent most flapping. If issues persist, check:

# Review consecutive check results
docker logs charon 2>&1 | grep -A 2 "Host TCP check completed" | grep "host_name"

# Check response time trends
docker logs charon 2>&1 | grep "elapsed_ms"

No Notifications Received

Checklist:

✅ Uptime monitoring is enabled for the host
✅ Notification provider is configured and enabled
✅ Provider is set to trigger on uptime events
✅ Status has actually changed (check logs)
✅ Debouncing threshold has been met (2 consecutive failures)

Debug notifications:

# Check for notification attempts
docker logs charon 2>&1 | grep "notification"

# Look for uptime-related notifications
docker logs charon 2>&1 | grep "uptime_down\|uptime_up"

# Verify notification service is working
docker logs charon 2>&1 | grep "Failed to send notification"

High CPU Usage from Monitoring

Symptoms:

Charon container using excessive CPU
System becomes slow during check cycles
Logs show slow check times

Solutions:

Reduce number of monitored hosts

Monitor only critical services; disable monitoring for non-essential hosts
Increase check interval

Change from 60s to 120s to reduce frequency
Optimize Docker resource allocation

Ensure adequate CPU/memory allocated to Charon container
Check for network issues

Slow DNS or network problems can cause checks to hang

Monitor check performance:

# View check duration distribution
docker logs charon 2>&1 | grep "elapsed_ms" | tail -50

# Count concurrent checks
docker logs charon 2>&1 | grep "All host checks completed"

Advanced Topics

Port Detection

Charon automatically determines which port to check:

Priority order:

ProxyHost.ForwardPort: Preferred, most reliable
URL extraction: Fallback for hosts without proxy configuration
Default ports: 80 (HTTP) or 443 (HTTPS) if port not specified

Example:

Host: example.com
Forward Port: 8080
→ Checks: example.com:8080

Host: api.example.com
URL: https://api.example.com/health
Forward Port: (not set)
→ Checks: api.example.com:443

Concurrent Check Processing

All host checks run concurrently for better performance:

Each host checked in separate goroutine
WaitGroup ensures all checks complete before next cycle
Prevents database race conditions
No single slow host blocks other checks

Performance characteristics:

Sequential checks (old): time = hosts × timeout
Concurrent checks (current): time = max(individual_check_times)

Example: With 10 hosts and 10s timeout:

Sequential: ~100 seconds minimum
Concurrent: ~10 seconds (if all succeed on first try)

Database Storage

Uptime data is stored efficiently:

UptimeHost table:

status: Current status ("up"/"down")
failure_count: Consecutive failure counter
last_check: Timestamp of last check
response_time: Last successful response time

UptimeMonitor table:

Links monitors to proxy hosts
Stores check configuration
Tracks enabled state

Heartbeat records (future):

Detailed history of each check
Used for uptime percentage calculations
Queryable for historical analysis

Best Practices

1. Monitor Critical Services Only

Don't monitor every host. Focus on:

Production services
User-facing applications
External dependencies
High-availability requirements

Skip monitoring for:

Development/test instances
Internal tools with built-in redundancy
Services with their own monitoring

2. Configure Appropriate Notifications

Critical services:

Multiple notification channels (Discord + Slack)
Immediate alerts (no batching)
On-call team notifications

Non-critical services:

Single notification channel
Digest/batch notifications (future feature)
Email to team (low priority)

3. Review False Positives

If you receive false alarms:

Check logs to understand why
Adjust timeout if needed
Verify network stability
Consider increasing failure threshold (future config option)

4. Regular Status Review

Weekly review of:

Uptime percentages (identify problematic hosts)
Response time trends (detect degradation)
Notification frequency (too many alerts?)
False positive rate (refine configuration)

5. Combine with Application Monitoring

Uptime monitoring checks availability, not functionality.

Complement with:

Application-level health checks
Error rate monitoring
Performance metrics (APM tools)
User experience monitoring

Planned Improvements

Future enhancements under consideration:

HTTP health check support - Check specific endpoints with status code validation
Configurable failure threshold - Adjust consecutive failure count via UI
Custom check intervals per host - Different intervals for different criticality levels
Response time alerts - Notify on degraded performance, not just failures
Notification batching - Group multiple alerts to reduce noise
Maintenance windows - Disable alerts during scheduled maintenance
Historical graphs - Visual uptime trends over time
Status page export - Public status page for external visibility

Monitoring the Monitors

How do you know if Charon's monitoring is working?

Check Charon's own health:

# Verify check cycle is running
docker logs charon 2>&1 | grep "All host checks completed" | tail -5

# Confirm recent checks happened
docker logs charon 2>&1 | grep "Host TCP check completed" | tail -20

# Look for any errors in monitoring system
docker logs charon 2>&1 | grep "ERROR.*uptime\|ERROR.*monitor"

Expected log pattern:

INFO[...] All host checks completed host_count=5
DEBUG[...] Host TCP check completed elapsed_ms=156 host_name=example.com success=true

Warning signs:

No "All host checks completed" messages in recent logs
Checks taking longer than expected (>30s with 10s timeout)
Frequent timeout errors
High failure_count values

API Integration

Uptime monitoring data is accessible via API:

Get uptime status:

GET /api/uptime/hosts
Authorization: Bearer <token>

Response:

{
  "hosts": [
    {
      "id": "123",
      "name": "example.com",
      "status": "up",
      "last_check": "2025-12-24T10:30:00Z",
      "response_time": 156,
      "failure_count": 0,
      "uptime_percentage": 99.8
    }
  ]
}

Programmatic monitoring:

Use this API to integrate Charon's uptime data with:

External monitoring dashboards (Grafana, etc.)
Incident response systems (PagerDuty, etc.)
Custom alerting tools
Status page generators

14 KiB

Executable File

Raw Permalink Blame History

Uptime Monitoring

Overview

How It Works

Check Cycle

Failure Debouncing

Configuration

Timeout Settings

Retry Behavior

Check Interval

Enabling Uptime Monitoring

For a Single Host

For Multiple Hosts (Bulk)

Monitoring Dashboard

Host Status Display

Status Page

Troubleshooting

False Positive: Host Shown as Down but Actually Up

False Negative: Host Shown as Up but Actually Down

Intermittent Status Flapping

No Notifications Received

High CPU Usage from Monitoring

Advanced Topics

Port Detection

Concurrent Check Processing

Database Storage

Best Practices

1. Monitor Critical Services Only

2. Configure Appropriate Notifications

3. Review False Positives

4. Regular Status Review

5. Combine with Application Monitoring

Planned Improvements

Monitoring the Monitors

API Integration

Additional Resources

Need Help?

14 KiB Executable File Raw Permalink Blame History Unescape Escape

Uptime Monitoring

Overview

How It Works

Check Cycle

Failure Debouncing

Configuration

Timeout Settings

Retry Behavior

Check Interval

Enabling Uptime Monitoring

For a Single Host

For Multiple Hosts (Bulk)

Monitoring Dashboard

Host Status Display

Status Page

Troubleshooting

False Positive: Host Shown as Down but Actually Up

False Negative: Host Shown as Up but Actually Down

Intermittent Status Flapping

No Notifications Received

High CPU Usage from Monitoring

Advanced Topics

Port Detection

Concurrent Check Processing

Database Storage

Best Practices

1. Monitor Critical Services Only

2. Configure Appropriate Notifications

3. Review False Positives

4. Regular Status Review

5. Combine with Application Monitoring

Planned Improvements

Monitoring the Monitors

API Integration

Additional Resources

Need Help?

14 KiB

Executable File

Raw Permalink Blame History