feat: add JSON template support for all services and fix uptime monitoring reliability
BREAKING CHANGE: None - fully backward compatible Changes: - feat(notifications): extend JSON templates to Discord, Slack, Gotify, and generic - fix(uptime): resolve race conditions and false positives with failure debouncing - chore(tests): add comprehensive test coverage (86.2% backend, 87.61% frontend) - docs: add feature guides and manual test plan Technical Details: - Added supportsJSONTemplates() helper for service capability detection - Renamed sendCustomWebhook → sendJSONPayload for clarity - Added FailureCount field requiring 2 consecutive failures before marking down - Implemented WaitGroup synchronization and host-specific mutexes - Increased TCP timeout to 10s with 2 retry attempts - Added template security: 5s timeout, 10KB size limit - All security scans pass (CodeQL, Trivy)
This commit is contained in:
@@ -0,0 +1,544 @@
|
||||
# Notification System
|
||||
|
||||
Charon's notification system keeps you informed about important events in your infrastructure through multiple channels, including Discord, Slack, Gotify, Telegram, and custom webhooks.
|
||||
|
||||
## Overview
|
||||
|
||||
Notifications can be triggered by various events:
|
||||
|
||||
- **SSL Certificate Events**: Issued, renewed, or failed
|
||||
- **Uptime Monitoring**: Host status changes (up/down)
|
||||
- **Security Events**: WAF blocks, CrowdSec alerts, ACL violations
|
||||
- **System Events**: Configuration changes, backup completions
|
||||
|
||||
## Supported Services
|
||||
|
||||
| Service | JSON Templates | Native API | Rich Formatting |
|
||||
|---------|----------------|------------|-----------------|
|
||||
| **Discord** | ✅ Yes | ✅ Webhooks | ✅ Embeds |
|
||||
| **Slack** | ✅ Yes | ✅ Incoming Webhooks | ✅ Block Kit |
|
||||
| **Gotify** | ✅ Yes | ✅ REST API | ✅ Extras |
|
||||
| **Generic Webhook** | ✅ Yes | ✅ HTTP POST | ✅ Custom |
|
||||
| **Telegram** | ❌ No | ✅ Bot API | ⚠️ Markdown |
|
||||
|
||||
### Why JSON Templates?
|
||||
|
||||
JSON templates give you complete control over notification formatting, allowing you to:
|
||||
|
||||
- **Customize appearance**: Use rich embeds, colors, and formatting
|
||||
- **Add metadata**: Include custom fields, timestamps, and links
|
||||
- **Optimize visibility**: Structure messages for better readability
|
||||
- **Integrate seamlessly**: Match your team's existing notification styles
|
||||
|
||||
## Configuration
|
||||
|
||||
### Basic Setup
|
||||
|
||||
1. Navigate to **Settings** → **Notifications**
|
||||
2. Click **"Add Provider"**
|
||||
3. Select your service type
|
||||
4. Enter the webhook URL
|
||||
5. Configure notification triggers
|
||||
6. Save your provider
|
||||
|
||||
### JSON Template Support
|
||||
|
||||
For services supporting JSON (Discord, Slack, Gotify, Generic, Webhook), you can choose from three template options:
|
||||
|
||||
#### 1. Minimal Template (Default)
|
||||
|
||||
Simple, clean notifications with essential information:
|
||||
|
||||
```json
|
||||
{
|
||||
"content": "{{.Title}}: {{.Message}}"
|
||||
}
|
||||
```
|
||||
|
||||
**Use when:**
|
||||
- You want low-noise notifications
|
||||
- Space is limited (mobile notifications)
|
||||
- Only essential info is needed
|
||||
|
||||
#### 2. Detailed Template
|
||||
|
||||
Comprehensive notifications with all available context:
|
||||
|
||||
```json
|
||||
{
|
||||
"embeds": [{
|
||||
"title": "{{.Title}}",
|
||||
"description": "{{.Message}}",
|
||||
"color": {{.Color}},
|
||||
"timestamp": "{{.Timestamp}}",
|
||||
"fields": [
|
||||
{"name": "Event Type", "value": "{{.EventType}}", "inline": true},
|
||||
{"name": "Host", "value": "{{.HostName}}", "inline": true}
|
||||
]
|
||||
}]
|
||||
}
|
||||
```
|
||||
|
||||
**Use when:**
|
||||
- You need full event context
|
||||
- Multiple team members review notifications
|
||||
- Historical tracking is important
|
||||
|
||||
#### 3. Custom Template
|
||||
|
||||
Create your own template with complete control over structure and formatting.
|
||||
|
||||
**Use when:**
|
||||
- Standard templates don't meet your needs
|
||||
- You have specific formatting requirements
|
||||
- Integrating with custom systems
|
||||
|
||||
## Service-Specific Examples
|
||||
|
||||
### Discord Webhooks
|
||||
|
||||
Discord supports rich embeds with colors, fields, and timestamps.
|
||||
|
||||
#### Basic Embed
|
||||
|
||||
```json
|
||||
{
|
||||
"embeds": [{
|
||||
"title": "{{.Title}}",
|
||||
"description": "{{.Message}}",
|
||||
"color": {{.Color}},
|
||||
"timestamp": "{{.Timestamp}}"
|
||||
}]
|
||||
}
|
||||
```
|
||||
|
||||
#### Advanced Embed with Fields
|
||||
|
||||
```json
|
||||
{
|
||||
"username": "Charon Alerts",
|
||||
"avatar_url": "https://example.com/charon-icon.png",
|
||||
"embeds": [{
|
||||
"title": "🚨 {{.Title}}",
|
||||
"description": "{{.Message}}",
|
||||
"color": {{.Color}},
|
||||
"timestamp": "{{.Timestamp}}",
|
||||
"fields": [
|
||||
{
|
||||
"name": "Event Type",
|
||||
"value": "{{.EventType}}",
|
||||
"inline": true
|
||||
},
|
||||
{
|
||||
"name": "Severity",
|
||||
"value": "{{.Severity}}",
|
||||
"inline": true
|
||||
},
|
||||
{
|
||||
"name": "Host",
|
||||
"value": "{{.HostName}}",
|
||||
"inline": false
|
||||
}
|
||||
],
|
||||
"footer": {
|
||||
"text": "Charon Notification System"
|
||||
}
|
||||
}]
|
||||
}
|
||||
```
|
||||
|
||||
**Available Discord Colors:**
|
||||
|
||||
- `2326507` - Blue (info)
|
||||
- `15158332` - Red (error)
|
||||
- `16776960` - Yellow (warning)
|
||||
- `3066993` - Green (success)
|
||||
|
||||
### Slack Webhooks
|
||||
|
||||
Slack uses Block Kit for rich message formatting.
|
||||
|
||||
#### Basic Block
|
||||
|
||||
```json
|
||||
{
|
||||
"text": "{{.Title}}",
|
||||
"blocks": [
|
||||
{
|
||||
"type": "header",
|
||||
"text": {
|
||||
"type": "plain_text",
|
||||
"text": "{{.Title}}"
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "section",
|
||||
"text": {
|
||||
"type": "mrkdwn",
|
||||
"text": "{{.Message}}"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### Advanced Block with Context
|
||||
|
||||
```json
|
||||
{
|
||||
"text": "{{.Title}}",
|
||||
"blocks": [
|
||||
{
|
||||
"type": "header",
|
||||
"text": {
|
||||
"type": "plain_text",
|
||||
"text": "🔔 {{.Title}}",
|
||||
"emoji": true
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "section",
|
||||
"text": {
|
||||
"type": "mrkdwn",
|
||||
"text": "*Event:* {{.EventType}}\n*Message:* {{.Message}}"
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "section",
|
||||
"fields": [
|
||||
{
|
||||
"type": "mrkdwn",
|
||||
"text": "*Host:*\n{{.HostName}}"
|
||||
},
|
||||
{
|
||||
"type": "mrkdwn",
|
||||
"text": "*Time:*\n{{.Timestamp}}"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"type": "context",
|
||||
"elements": [
|
||||
{
|
||||
"type": "mrkdwn",
|
||||
"text": "Notification from Charon"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Slack Markdown Tips:**
|
||||
|
||||
- `*bold*` for emphasis
|
||||
- `_italic_` for subtle text
|
||||
- `~strike~` for deprecated info
|
||||
- `` `code` `` for technical details
|
||||
- Use `\n` for line breaks
|
||||
|
||||
### Gotify Webhooks
|
||||
|
||||
Gotify supports JSON payloads with priority levels and extras.
|
||||
|
||||
#### Basic Message
|
||||
|
||||
```json
|
||||
{
|
||||
"title": "{{.Title}}",
|
||||
"message": "{{.Message}}",
|
||||
"priority": 5
|
||||
}
|
||||
```
|
||||
|
||||
#### Advanced Message with Extras
|
||||
|
||||
```json
|
||||
{
|
||||
"title": "{{.Title}}",
|
||||
"message": "{{.Message}}",
|
||||
"priority": {{.Priority}},
|
||||
"extras": {
|
||||
"client::display": {
|
||||
"contentType": "text/markdown"
|
||||
},
|
||||
"client::notification": {
|
||||
"click": {
|
||||
"url": "https://your-charon-instance.com"
|
||||
}
|
||||
},
|
||||
"charon": {
|
||||
"event_type": "{{.EventType}}",
|
||||
"host_name": "{{.HostName}}",
|
||||
"timestamp": "{{.Timestamp}}"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Gotify Priority Levels:**
|
||||
|
||||
- `0` - Very low
|
||||
- `2` - Low
|
||||
- `5` - Normal (default)
|
||||
- `8` - High
|
||||
- `10` - Very high (emergency)
|
||||
|
||||
### Generic Webhooks
|
||||
|
||||
For custom integrations, use any JSON structure:
|
||||
|
||||
```json
|
||||
{
|
||||
"notification": {
|
||||
"type": "{{.EventType}}",
|
||||
"level": "{{.Severity}}",
|
||||
"title": "{{.Title}}",
|
||||
"body": "{{.Message}}",
|
||||
"metadata": {
|
||||
"host": "{{.HostName}}",
|
||||
"timestamp": "{{.Timestamp}}",
|
||||
"source": "charon"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Template Variables
|
||||
|
||||
All services support these variables in JSON templates:
|
||||
|
||||
| Variable | Description | Example |
|
||||
|----------|-------------|---------|
|
||||
| `{{.Title}}` | Event title | "SSL Certificate Renewed" |
|
||||
| `{{.Message}}` | Event message/details | "Certificate for example.com renewed" |
|
||||
| `{{.EventType}}` | Type of event | "ssl_renewal", "uptime_down" |
|
||||
| `{{.Severity}}` | Event severity level | "info", "warning", "error" |
|
||||
| `{{.HostName}}` | Affected proxy host | "example.com" |
|
||||
| `{{.Timestamp}}` | ISO 8601 timestamp | "2025-12-24T10:30:00Z" |
|
||||
| `{{.Color}}` | Color code (integer) | 2326507 (blue) |
|
||||
| `{{.Priority}}` | Numeric priority (1-10) | 5 |
|
||||
|
||||
### Event-Specific Variables
|
||||
|
||||
Some events include additional variables:
|
||||
|
||||
**SSL Certificate Events:**
|
||||
|
||||
- `{{.Domain}}` - Certificate domain
|
||||
- `{{.ExpiryDate}}` - Expiration date
|
||||
- `{{.DaysRemaining}}` - Days until expiry
|
||||
|
||||
**Uptime Events:**
|
||||
|
||||
- `{{.StatusChange}}` - "up_to_down" or "down_to_up"
|
||||
- `{{.ResponseTime}}` - Last response time in ms
|
||||
- `{{.Downtime}}` - Duration of downtime
|
||||
|
||||
**Security Events:**
|
||||
|
||||
- `{{.AttackerIP}}` - Source IP address
|
||||
- `{{.RuleID}}` - Triggered rule identifier
|
||||
- `{{.Action}}` - Action taken (block/log)
|
||||
|
||||
## Migration Guide
|
||||
|
||||
### Upgrading from Basic Webhooks
|
||||
|
||||
If you've been using webhook providers without JSON templates:
|
||||
|
||||
**Before (Basic webhook):**
|
||||
```
|
||||
Type: webhook
|
||||
URL: https://discord.com/api/webhooks/...
|
||||
Template: (not available)
|
||||
```
|
||||
|
||||
**After (JSON template):**
|
||||
```
|
||||
Type: discord
|
||||
URL: https://discord.com/api/webhooks/...
|
||||
Template: detailed (or custom)
|
||||
```
|
||||
|
||||
**Steps:**
|
||||
|
||||
1. Edit your existing provider
|
||||
2. Change type from `webhook` to the specific service (e.g., `discord`)
|
||||
3. Select a template (minimal, detailed, or custom)
|
||||
4. Test the notification
|
||||
5. Save changes
|
||||
|
||||
### Testing Your Template
|
||||
|
||||
Before saving, always test your template:
|
||||
|
||||
1. Click **"Send Test Notification"** in the provider form
|
||||
2. Check your notification channel (Discord/Slack/etc.)
|
||||
3. Verify formatting, colors, and all fields appear correctly
|
||||
4. Adjust template if needed
|
||||
5. Test again until satisfied
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Template Validation Errors
|
||||
|
||||
**Error:** `Invalid JSON template`
|
||||
|
||||
**Solution:** Validate your JSON using a tool like [jsonlint.com](https://jsonlint.com). Common issues:
|
||||
- Missing closing braces `}`
|
||||
- Trailing commas
|
||||
- Unescaped quotes in strings
|
||||
|
||||
**Error:** `Template variable not found: {{.CustomVar}}`
|
||||
|
||||
**Solution:** Only use supported template variables listed above.
|
||||
|
||||
### Notification Not Received
|
||||
|
||||
**Checklist:**
|
||||
|
||||
1. ✅ Provider is enabled
|
||||
2. ✅ Event type is configured for notifications
|
||||
3. ✅ Webhook URL is correct
|
||||
4. ✅ Service (Discord/Slack/etc.) is online
|
||||
5. ✅ Test notification succeeds
|
||||
6. ✅ Check Charon logs for errors: `docker logs charon | grep notification`
|
||||
|
||||
### Discord Embed Not Showing
|
||||
|
||||
**Cause:** Embeds require specific structure.
|
||||
|
||||
**Solution:** Ensure your template includes the `embeds` array:
|
||||
|
||||
```json
|
||||
{
|
||||
"embeds": [
|
||||
{
|
||||
"title": "{{.Title}}",
|
||||
"description": "{{.Message}}"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Slack Message Appears Plain
|
||||
|
||||
**Cause:** Block Kit requires specific formatting.
|
||||
|
||||
**Solution:** Use `blocks` array with proper types:
|
||||
|
||||
```json
|
||||
{
|
||||
"blocks": [
|
||||
{
|
||||
"type": "section",
|
||||
"text": {
|
||||
"type": "mrkdwn",
|
||||
"text": "{{.Message}}"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Start Simple
|
||||
|
||||
Begin with the **minimal** template and only customize if you need more information.
|
||||
|
||||
### 2. Test Thoroughly
|
||||
|
||||
Always test notifications before relying on them for critical alerts.
|
||||
|
||||
### 3. Use Color Coding
|
||||
|
||||
Consistent colors help quickly identify severity:
|
||||
- 🔴 Red: Errors, outages
|
||||
- 🟡 Yellow: Warnings
|
||||
- 🟢 Green: Success, recovery
|
||||
- 🔵 Blue: Informational
|
||||
|
||||
### 4. Group Related Events
|
||||
|
||||
Configure multiple providers for different event types:
|
||||
- Critical alerts → Discord (with mentions)
|
||||
- Info notifications → Slack (general channel)
|
||||
- All events → Gotify (personal alerts)
|
||||
|
||||
### 5. Rate Limit Awareness
|
||||
|
||||
Be mindful of service limits:
|
||||
- **Discord**: 5 requests per 2 seconds per webhook
|
||||
- **Slack**: 1 request per second per workspace
|
||||
- **Gotify**: No strict limits (self-hosted)
|
||||
|
||||
### 6. Keep Templates Maintainable
|
||||
|
||||
- Document custom templates
|
||||
- Version control your templates
|
||||
- Test after service updates
|
||||
|
||||
## Advanced Use Cases
|
||||
|
||||
### Multi-Channel Routing
|
||||
|
||||
Create separate providers for different severity levels:
|
||||
|
||||
```
|
||||
Provider: Discord Critical
|
||||
Events: uptime_down, ssl_failure
|
||||
Template: Custom with @everyone mention
|
||||
|
||||
Provider: Slack Info
|
||||
Events: ssl_renewal, backup_success
|
||||
Template: Minimal
|
||||
|
||||
Provider: Gotify All
|
||||
Events: * (all)
|
||||
Template: Detailed
|
||||
```
|
||||
|
||||
### Conditional Formatting
|
||||
|
||||
Use template logic (if supported by your service):
|
||||
|
||||
```json
|
||||
{
|
||||
"embeds": [{
|
||||
"title": "{{.Title}}",
|
||||
"description": "{{.Message}}",
|
||||
"color": {{if eq .Severity "error"}}15158332{{else}}2326507{{end}}
|
||||
}]
|
||||
}
|
||||
```
|
||||
|
||||
### Integration with Automation
|
||||
|
||||
Forward notifications to automation tools:
|
||||
|
||||
```json
|
||||
{
|
||||
"webhook_type": "charon_notification",
|
||||
"trigger_workflow": true,
|
||||
"data": {
|
||||
"event": "{{.EventType}}",
|
||||
"host": "{{.HostName}}",
|
||||
"action_required": {{if eq .Severity "error"}}true{{else}}false{{end}}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- [Discord Webhook Documentation](https://discord.com/developers/docs/resources/webhook)
|
||||
- [Slack Block Kit Builder](https://api.slack.com/block-kit)
|
||||
- [Gotify API Documentation](https://gotify.net/docs/)
|
||||
- [Charon Security Guide](../security.md)
|
||||
|
||||
## Need Help?
|
||||
|
||||
- 💬 [Ask in Discussions](https://github.com/Wikid82/charon/discussions)
|
||||
- 🐛 [Report Issues](https://github.com/Wikid82/charon/issues)
|
||||
- 📖 [View Full Documentation](https://wikid82.github.io/charon/)
|
||||
@@ -0,0 +1,526 @@
|
||||
# Uptime Monitoring
|
||||
|
||||
Charon's uptime monitoring system continuously checks the availability of your proxy hosts and alerts you when issues occur. The system is designed to minimize false positives while quickly detecting real problems.
|
||||
|
||||
## Overview
|
||||
|
||||
Uptime monitoring performs automated health checks on your proxy hosts at regular intervals, tracking:
|
||||
|
||||
- **Host availability** (TCP connectivity)
|
||||
- **Response times** (latency measurements)
|
||||
- **Status history** (uptime/downtime tracking)
|
||||
- **Failure patterns** (debounced detection)
|
||||
|
||||
## How It Works
|
||||
|
||||
### Check Cycle
|
||||
|
||||
1. **Scheduled Checks**: Every 60 seconds (default), Charon checks all enabled hosts
|
||||
2. **Port Detection**: Uses the proxy host's `ForwardPort` for TCP checks
|
||||
3. **Connection Test**: Attempts TCP connection with configurable timeout
|
||||
4. **Status Update**: Records success/failure in database
|
||||
5. **Notification Trigger**: Sends alerts on status changes (if configured)
|
||||
|
||||
### Failure Debouncing
|
||||
|
||||
To prevent false alarms from transient network issues, Charon uses **failure debouncing**:
|
||||
|
||||
**How it works:**
|
||||
|
||||
- A host must **fail 2 consecutive checks** before being marked "down"
|
||||
- Single failures are logged but don't trigger status changes
|
||||
- Counter resets immediately on any successful check
|
||||
|
||||
**Why this matters:**
|
||||
|
||||
- Network hiccups don't cause false alarms
|
||||
- Container restarts don't trigger unnecessary alerts
|
||||
- Transient DNS issues are ignored
|
||||
- You only get notified about real problems
|
||||
|
||||
**Example scenario:**
|
||||
|
||||
```
|
||||
Check 1: ✅ Success → Status: Up, Failure Count: 0
|
||||
Check 2: ❌ Failed → Status: Up, Failure Count: 1 (no alert)
|
||||
Check 3: ❌ Failed → Status: Down, Failure Count: 2 (alert sent!)
|
||||
Check 4: ✅ Success → Status: Up, Failure Count: 0 (recovery alert)
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Timeout Settings
|
||||
|
||||
**Default TCP timeout:** 10 seconds
|
||||
|
||||
This timeout determines how long Charon waits for a TCP connection before considering it failed.
|
||||
|
||||
**Increase timeout if:**
|
||||
- You have slow networks
|
||||
- Hosts are geographically distant
|
||||
- Containers take time to warm up
|
||||
- You see intermittent false "down" alerts
|
||||
|
||||
**Decrease timeout if:**
|
||||
- You want faster failure detection
|
||||
- Your hosts are on local network
|
||||
- Response times are consistently fast
|
||||
|
||||
**Note:** Timeout settings are currently set in the backend configuration. A future release will make this configurable via the UI.
|
||||
|
||||
### Retry Behavior
|
||||
|
||||
When a check fails, Charon automatically retries:
|
||||
|
||||
- **Max retries:** 2 attempts
|
||||
- **Retry delay:** 2 seconds between attempts
|
||||
- **Timeout per attempt:** 10 seconds (configurable)
|
||||
|
||||
**Total check time calculation:**
|
||||
|
||||
```
|
||||
Max time = (timeout × max_retries) + (retry_delay × (max_retries - 1))
|
||||
= (10s × 2) + (2s × 1)
|
||||
= 22 seconds worst case
|
||||
```
|
||||
|
||||
### Check Interval
|
||||
|
||||
**Default:** 60 seconds
|
||||
|
||||
The interval between check cycles for all hosts.
|
||||
|
||||
**Performance considerations:**
|
||||
|
||||
- Shorter intervals = faster detection but higher CPU/network usage
|
||||
- Longer intervals = lower overhead but slower failure detection
|
||||
- Recommended: 30-120 seconds depending on criticality
|
||||
|
||||
## Enabling Uptime Monitoring
|
||||
|
||||
### For a Single Host
|
||||
|
||||
1. Navigate to **Proxy Hosts**
|
||||
2. Click **Edit** on the host
|
||||
3. Scroll to **Uptime Monitoring** section
|
||||
4. Toggle **"Enable Uptime Monitoring"** to ON
|
||||
5. Click **Save**
|
||||
|
||||
### For Multiple Hosts (Bulk)
|
||||
|
||||
1. Navigate to **Proxy Hosts**
|
||||
2. Select checkboxes for hosts to monitor
|
||||
3. Click **"Bulk Apply"** button
|
||||
4. Find **"Uptime Monitoring"** section
|
||||
5. Toggle the switch to **ON**
|
||||
6. Check **"Apply to selected hosts"**
|
||||
7. Click **"Apply Changes"**
|
||||
|
||||
## Monitoring Dashboard
|
||||
|
||||
### Host Status Display
|
||||
|
||||
Each monitored host shows:
|
||||
|
||||
- **Status Badge**: 🟢 Up / 🔴 Down
|
||||
- **Response Time**: Last successful check latency
|
||||
- **Uptime Percentage**: Success rate over time
|
||||
- **Last Check**: Timestamp of most recent check
|
||||
|
||||
### Status Page
|
||||
|
||||
View all monitored hosts at a glance:
|
||||
|
||||
1. Navigate to **Dashboard** → **Uptime Status**
|
||||
2. See real-time status of all hosts
|
||||
3. Click any host for detailed history
|
||||
4. Filter by status (up/down/all)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### False Positive: Host Shown as Down but Actually Up
|
||||
|
||||
**Symptoms:**
|
||||
|
||||
- Host shows "down" in Charon
|
||||
- Service is accessible directly
|
||||
- Status changes back to "up" shortly after
|
||||
|
||||
**Common causes:**
|
||||
|
||||
1. **Timeout too short for slow network**
|
||||
|
||||
**Solution:** Increase TCP timeout in configuration
|
||||
|
||||
2. **Container warmup time exceeds timeout**
|
||||
|
||||
**Solution:** Use longer timeout or optimize container startup
|
||||
|
||||
3. **Network congestion during check**
|
||||
|
||||
**Solution:** Debouncing (already enabled) should handle this automatically
|
||||
|
||||
4. **Firewall blocking health checks**
|
||||
|
||||
**Solution:** Ensure Charon container can reach proxy host ports
|
||||
|
||||
5. **Multiple checks running concurrently**
|
||||
|
||||
**Solution:** Automatic synchronization ensures checks complete before next cycle
|
||||
|
||||
**Diagnostic steps:**
|
||||
|
||||
```bash
|
||||
# Check Charon logs for timing info
|
||||
docker logs charon 2>&1 | grep "Host TCP check completed"
|
||||
|
||||
# Look for retry attempts
|
||||
docker logs charon 2>&1 | grep "Retrying TCP check"
|
||||
|
||||
# Check failure count patterns
|
||||
docker logs charon 2>&1 | grep "failure_count"
|
||||
|
||||
# View host status changes
|
||||
docker logs charon 2>&1 | grep "Host status changed"
|
||||
```
|
||||
|
||||
### False Negative: Host Shown as Up but Actually Down
|
||||
|
||||
**Symptoms:**
|
||||
|
||||
- Host shows "up" in Charon
|
||||
- Service returns errors or is inaccessible
|
||||
- No down alerts received
|
||||
|
||||
**Common causes:**
|
||||
|
||||
1. **TCP port open but service not responding**
|
||||
|
||||
**Explanation:** Uptime monitoring only checks TCP connectivity, not application health
|
||||
|
||||
**Solution:** Consider implementing application-level health checks (future feature)
|
||||
|
||||
2. **Service accepts connections but returns errors**
|
||||
|
||||
**Solution:** Monitor application logs separately; TCP checks don't validate responses
|
||||
|
||||
3. **Partial service degradation**
|
||||
|
||||
**Solution:** Use multiple monitoring providers for critical services
|
||||
|
||||
**Current limitation:** Charon performs TCP health checks only. HTTP-based health checks are planned for a future release.
|
||||
|
||||
### Intermittent Status Flapping
|
||||
|
||||
**Symptoms:**
|
||||
|
||||
- Status rapidly changes between up/down
|
||||
- Multiple notifications in short time
|
||||
- Logs show alternating success/failure
|
||||
|
||||
**Causes:**
|
||||
|
||||
1. **Marginal network conditions**
|
||||
|
||||
**Solution:** Increase failure threshold (requires configuration change)
|
||||
|
||||
2. **Resource exhaustion on target host**
|
||||
|
||||
**Solution:** Investigate target host performance, increase resources
|
||||
|
||||
3. **Shared network congestion**
|
||||
|
||||
**Solution:** Consider dedicated monitoring network or VLAN
|
||||
|
||||
**Mitigation:**
|
||||
|
||||
The built-in debouncing (2 consecutive failures required) should prevent most flapping. If issues persist, check:
|
||||
|
||||
```bash
|
||||
# Review consecutive check results
|
||||
docker logs charon 2>&1 | grep -A 2 "Host TCP check completed" | grep "host_name"
|
||||
|
||||
# Check response time trends
|
||||
docker logs charon 2>&1 | grep "elapsed_ms"
|
||||
```
|
||||
|
||||
### No Notifications Received
|
||||
|
||||
**Checklist:**
|
||||
|
||||
1. ✅ Uptime monitoring is enabled for the host
|
||||
2. ✅ Notification provider is configured and enabled
|
||||
3. ✅ Provider is set to trigger on uptime events
|
||||
4. ✅ Status has actually changed (check logs)
|
||||
5. ✅ Debouncing threshold has been met (2 consecutive failures)
|
||||
|
||||
**Debug notifications:**
|
||||
|
||||
```bash
|
||||
# Check for notification attempts
|
||||
docker logs charon 2>&1 | grep "notification"
|
||||
|
||||
# Look for uptime-related notifications
|
||||
docker logs charon 2>&1 | grep "uptime_down\|uptime_up"
|
||||
|
||||
# Verify notification service is working
|
||||
docker logs charon 2>&1 | grep "Failed to send notification"
|
||||
```
|
||||
|
||||
### High CPU Usage from Monitoring
|
||||
|
||||
**Symptoms:**
|
||||
|
||||
- Charon container using excessive CPU
|
||||
- System becomes slow during check cycles
|
||||
- Logs show slow check times
|
||||
|
||||
**Solutions:**
|
||||
|
||||
1. **Reduce number of monitored hosts**
|
||||
|
||||
Monitor only critical services; disable monitoring for non-essential hosts
|
||||
|
||||
2. **Increase check interval**
|
||||
|
||||
Change from 60s to 120s to reduce frequency
|
||||
|
||||
3. **Optimize Docker resource allocation**
|
||||
|
||||
Ensure adequate CPU/memory allocated to Charon container
|
||||
|
||||
4. **Check for network issues**
|
||||
|
||||
Slow DNS or network problems can cause checks to hang
|
||||
|
||||
**Monitor check performance:**
|
||||
|
||||
```bash
|
||||
# View check duration distribution
|
||||
docker logs charon 2>&1 | grep "elapsed_ms" | tail -50
|
||||
|
||||
# Count concurrent checks
|
||||
docker logs charon 2>&1 | grep "All host checks completed"
|
||||
```
|
||||
|
||||
## Advanced Topics
|
||||
|
||||
### Port Detection
|
||||
|
||||
Charon automatically determines which port to check:
|
||||
|
||||
**Priority order:**
|
||||
|
||||
1. **ProxyHost.ForwardPort**: Preferred, most reliable
|
||||
2. **URL extraction**: Fallback for hosts without proxy configuration
|
||||
3. **Default ports**: 80 (HTTP) or 443 (HTTPS) if port not specified
|
||||
|
||||
**Example:**
|
||||
|
||||
```
|
||||
Host: example.com
|
||||
Forward Port: 8080
|
||||
→ Checks: example.com:8080
|
||||
|
||||
Host: api.example.com
|
||||
URL: https://api.example.com/health
|
||||
Forward Port: (not set)
|
||||
→ Checks: api.example.com:443
|
||||
```
|
||||
|
||||
### Concurrent Check Processing
|
||||
|
||||
All host checks run concurrently for better performance:
|
||||
|
||||
- Each host checked in separate goroutine
|
||||
- WaitGroup ensures all checks complete before next cycle
|
||||
- Prevents database race conditions
|
||||
- No single slow host blocks other checks
|
||||
|
||||
**Performance characteristics:**
|
||||
|
||||
- **Sequential checks** (old): `time = hosts × timeout`
|
||||
- **Concurrent checks** (current): `time = max(individual_check_times)`
|
||||
|
||||
**Example:** With 10 hosts and 10s timeout:
|
||||
|
||||
- Sequential: ~100 seconds minimum
|
||||
- Concurrent: ~10 seconds (if all succeed on first try)
|
||||
|
||||
### Database Storage
|
||||
|
||||
Uptime data is stored efficiently:
|
||||
|
||||
**UptimeHost table:**
|
||||
|
||||
- `status`: Current status ("up"/"down")
|
||||
- `failure_count`: Consecutive failure counter
|
||||
- `last_check`: Timestamp of last check
|
||||
- `response_time`: Last successful response time
|
||||
|
||||
**UptimeMonitor table:**
|
||||
|
||||
- Links monitors to proxy hosts
|
||||
- Stores check configuration
|
||||
- Tracks enabled state
|
||||
|
||||
**Heartbeat records** (future):
|
||||
|
||||
- Detailed history of each check
|
||||
- Used for uptime percentage calculations
|
||||
- Queryable for historical analysis
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Monitor Critical Services Only
|
||||
|
||||
Don't monitor every host. Focus on:
|
||||
|
||||
- Production services
|
||||
- User-facing applications
|
||||
- External dependencies
|
||||
- High-availability requirements
|
||||
|
||||
**Skip monitoring for:**
|
||||
|
||||
- Development/test instances
|
||||
- Internal tools with built-in redundancy
|
||||
- Services with their own monitoring
|
||||
|
||||
### 2. Configure Appropriate Notifications
|
||||
|
||||
**Critical services:**
|
||||
|
||||
- Multiple notification channels (Discord + Slack)
|
||||
- Immediate alerts (no batching)
|
||||
- On-call team notifications
|
||||
|
||||
**Non-critical services:**
|
||||
|
||||
- Single notification channel
|
||||
- Digest/batch notifications (future feature)
|
||||
- Email to team (low priority)
|
||||
|
||||
### 3. Review False Positives
|
||||
|
||||
If you receive false alarms:
|
||||
|
||||
1. Check logs to understand why
|
||||
2. Adjust timeout if needed
|
||||
3. Verify network stability
|
||||
4. Consider increasing failure threshold (future config option)
|
||||
|
||||
### 4. Regular Status Review
|
||||
|
||||
Weekly review of:
|
||||
|
||||
- Uptime percentages (identify problematic hosts)
|
||||
- Response time trends (detect degradation)
|
||||
- Notification frequency (too many alerts?)
|
||||
- False positive rate (refine configuration)
|
||||
|
||||
### 5. Combine with Application Monitoring
|
||||
|
||||
Uptime monitoring checks **availability**, not **functionality**.
|
||||
|
||||
Complement with:
|
||||
|
||||
- Application-level health checks
|
||||
- Error rate monitoring
|
||||
- Performance metrics (APM tools)
|
||||
- User experience monitoring
|
||||
|
||||
## Planned Improvements
|
||||
|
||||
Future enhancements under consideration:
|
||||
|
||||
- [ ] **HTTP health check support** - Check specific endpoints with status code validation
|
||||
- [ ] **Configurable failure threshold** - Adjust consecutive failure count via UI
|
||||
- [ ] **Custom check intervals per host** - Different intervals for different criticality levels
|
||||
- [ ] **Response time alerts** - Notify on degraded performance, not just failures
|
||||
- [ ] **Notification batching** - Group multiple alerts to reduce noise
|
||||
- [ ] **Maintenance windows** - Disable alerts during scheduled maintenance
|
||||
- [ ] **Historical graphs** - Visual uptime trends over time
|
||||
- [ ] **Status page export** - Public status page for external visibility
|
||||
|
||||
## Monitoring the Monitors
|
||||
|
||||
How do you know if Charon's monitoring is working?
|
||||
|
||||
**Check Charon's own health:**
|
||||
|
||||
```bash
|
||||
# Verify check cycle is running
|
||||
docker logs charon 2>&1 | grep "All host checks completed" | tail -5
|
||||
|
||||
# Confirm recent checks happened
|
||||
docker logs charon 2>&1 | grep "Host TCP check completed" | tail -20
|
||||
|
||||
# Look for any errors in monitoring system
|
||||
docker logs charon 2>&1 | grep "ERROR.*uptime\|ERROR.*monitor"
|
||||
```
|
||||
|
||||
**Expected log pattern:**
|
||||
|
||||
```
|
||||
INFO[...] All host checks completed host_count=5
|
||||
DEBUG[...] Host TCP check completed elapsed_ms=156 host_name=example.com success=true
|
||||
```
|
||||
|
||||
**Warning signs:**
|
||||
|
||||
- No "All host checks completed" messages in recent logs
|
||||
- Checks taking longer than expected (>30s with 10s timeout)
|
||||
- Frequent timeout errors
|
||||
- High failure_count values
|
||||
|
||||
## API Integration
|
||||
|
||||
Uptime monitoring data is accessible via API:
|
||||
|
||||
**Get uptime status:**
|
||||
|
||||
```bash
|
||||
GET /api/uptime/hosts
|
||||
Authorization: Bearer <token>
|
||||
```
|
||||
|
||||
**Response:**
|
||||
|
||||
```json
|
||||
{
|
||||
"hosts": [
|
||||
{
|
||||
"id": "123",
|
||||
"name": "example.com",
|
||||
"status": "up",
|
||||
"last_check": "2025-12-24T10:30:00Z",
|
||||
"response_time": 156,
|
||||
"failure_count": 0,
|
||||
"uptime_percentage": 99.8
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Programmatic monitoring:**
|
||||
|
||||
Use this API to integrate Charon's uptime data with:
|
||||
|
||||
- External monitoring dashboards (Grafana, etc.)
|
||||
- Incident response systems (PagerDuty, etc.)
|
||||
- Custom alerting tools
|
||||
- Status page generators
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- [Notification Configuration Guide](notifications.md)
|
||||
- [Proxy Host Setup](../getting-started.md)
|
||||
- [Troubleshooting Guide](../troubleshooting/)
|
||||
- [Security Best Practices](../security.md)
|
||||
|
||||
## Need Help?
|
||||
|
||||
- 💬 [Ask in Discussions](https://github.com/Wikid82/charon/discussions)
|
||||
- 🐛 [Report Issues](https://github.com/Wikid82/charon/issues)
|
||||
- 📖 [View Full Documentation](https://wikid82.github.io/charon/)
|
||||
Reference in New Issue
Block a user