feat: add JSON template support for all services and fix uptime monitoring reliability
BREAKING CHANGE: None - fully backward compatible Changes: - feat(notifications): extend JSON templates to Discord, Slack, Gotify, and generic - fix(uptime): resolve race conditions and false positives with failure debouncing - chore(tests): add comprehensive test coverage (86.2% backend, 87.61% frontend) - docs: add feature guides and manual test plan Technical Details: - Added supportsJSONTemplates() helper for service capability detection - Renamed sendCustomWebhook → sendJSONPayload for clarity - Added FailureCount field requiring 2 consecutive failures before marking down - Implemented WaitGroup synchronization and host-specific mutexes - Increased TCP timeout to 10s with 2 retry attempts - Added template security: 5s timeout, 10KB size limit - All security scans pass (CodeQL, Trivy)
This commit is contained in:
+162
-30
@@ -749,30 +749,58 @@ The animations tell you what's happening so you don't think it's broken.
|
||||
|
||||
## \ud83d\udcca Uptime Monitoring
|
||||
|
||||
**What it does:** Automatically checks if your websites are responding every minute.
|
||||
**What it does:** Continuously monitors your proxy hosts for availability with intelligent failure detection to minimize false positives.
|
||||
|
||||
**Why you care:** Get visibility into uptime history and response times for all your proxy hosts.
|
||||
**Why you care:** Get accurate visibility into uptime history, response times, and real outages without noise from transient network issues.
|
||||
|
||||
**What you do:** View the "Uptime" page in the sidebar. Uptime checks run automatically in the background.
|
||||
**What you do:** Enable uptime monitoring per proxy host or use bulk operations. View status on the "Uptime" page in the sidebar.
|
||||
|
||||
**Optional:** You can disable this feature in System Settings → Optional Features if you don't need it.
|
||||
Your uptime history will be preserved.
|
||||
|
||||
### Key Features
|
||||
|
||||
**Failure Debouncing**: Requires **2 consecutive failures** before marking a host as "down"
|
||||
- Prevents false alarms from transient network hiccups
|
||||
- Container restarts don't trigger unnecessary alerts
|
||||
- Single TCP timeouts are logged but don't change status
|
||||
|
||||
**Automatic Retries**: Up to 2 retry attempts per check with 2-second delay
|
||||
- Handles slow networks and warm-up periods
|
||||
- 10-second timeout per attempt (increased from 5s)
|
||||
- Total check time: up to 22 seconds for marginal hosts
|
||||
|
||||
**Concurrent Processing**: All host checks run in parallel
|
||||
- Fast overall check times even with many hosts
|
||||
- No single slow host blocks others
|
||||
- Synchronized completion prevents race conditions
|
||||
|
||||
**Status Consistency**: Checks complete before UI reads database
|
||||
- Eliminates stale status during page refreshes
|
||||
- No race conditions between checks and API calls
|
||||
- Reliable status display across rapid refreshes
|
||||
|
||||
### How Uptime Checks Work
|
||||
|
||||
Charon uses a **two-level check system** for efficient monitoring:
|
||||
Charon uses a **two-level check system** with enhanced reliability:
|
||||
|
||||
#### Level 1: Host-Level Pre-Check (TCP)
|
||||
#### Level 1: Host-Level Pre-Check (TCP with Retries)
|
||||
|
||||
**What it does:** Quickly tests if the backend host/container is reachable via TCP connection.
|
||||
**What it does:** Tests if the backend host/container is reachable via TCP connection with automatic retry on failure.
|
||||
|
||||
**How it works:**
|
||||
- Groups monitors by their backend IP address (e.g., `172.20.0.11`)
|
||||
- Attempts TCP connection to the actual backend port (e.g., port `5690` for Wizarr)
|
||||
- If successful → Proceeds to Level 2 checks
|
||||
- **First failure**: Increments failure counter, status unchanged, waits 2s and retries
|
||||
- **Retry success**: Resets failure counter to 0, marks host as "up"
|
||||
- **Second consecutive failure**: Marks host as "down" after reaching threshold
|
||||
- If failed → Marks all monitors on that host as "down" (skips Level 2)
|
||||
- If successful → Proceeds to Level 2 checks
|
||||
|
||||
**Why it matters:** Avoids redundant HTTP checks when an entire backend container is stopped or unreachable.
|
||||
**Why it matters:**
|
||||
- Avoids redundant HTTP checks when an entire backend container is stopped or unreachable
|
||||
- Prevents false "down" alerts from single network hiccups
|
||||
- Handles slow container startups gracefully
|
||||
|
||||
**Technical detail:** Uses the `forward_port` from your proxy host configuration, not the public URL port.
|
||||
This ensures correct connectivity checks for services on non-standard ports.
|
||||
@@ -795,19 +823,63 @@ This ensures correct connectivity checks for services on non-standard ports.
|
||||
### When Things Go Wrong
|
||||
|
||||
**Scenario 1: Backend container stopped**
|
||||
- Level 1: TCP connection fails ❌
|
||||
- Level 1: TCP connection fails (attempt 1) ❌
|
||||
- Level 1: TCP connection fails (attempt 2) ❌
|
||||
- Failure count: 2 → Host marked "down"
|
||||
- Level 2: Skipped
|
||||
- Status: "down" with message "Host unreachable"
|
||||
|
||||
**Scenario 2: Service crashed but container running**
|
||||
**Scenario 2: Transient network issue**
|
||||
- Level 1: TCP connection fails (attempt 1) ❌
|
||||
- Failure count: 1 (threshold not met)
|
||||
- Status: Remains "up"
|
||||
- Next check: Success ✅ → Failure count reset to 0
|
||||
|
||||
**Scenario 3: Service crashed but container running**
|
||||
- Level 1: TCP connection succeeds ✅
|
||||
- Level 2: HTTP request fails or returns 500 ❌
|
||||
- Status: "down" with specific HTTP error
|
||||
|
||||
**Scenario 3: Everything working**
|
||||
**Scenario 4: Everything working**
|
||||
- Level 1: TCP connection succeeds ✅
|
||||
- Level 2: HTTP request succeeds ✅
|
||||
- Status: "up" with latency measurement
|
||||
- Failure count: 0
|
||||
|
||||
### Troubleshooting False Positives
|
||||
|
||||
**Issue**: Host shows "down" but service is accessible
|
||||
|
||||
**Common causes**:
|
||||
1. **Timeout too short**: Increase from 10s if network is slow
|
||||
2. **Container warmup**: Service takes >10s to respond during startup
|
||||
3. **Firewall blocking**: Ensure Charon container can reach proxy host ports
|
||||
|
||||
**Check logs**:
|
||||
```bash
|
||||
docker logs charon 2>&1 | grep "Host TCP check completed"
|
||||
docker logs charon 2>&1 | grep "Retrying TCP check"
|
||||
docker logs charon 2>&1 | grep "failure_count"
|
||||
```
|
||||
|
||||
**Solution**: The improved debouncing should handle most transient issues automatically. If problems persist, see [Uptime Monitoring Troubleshooting Guide](features/uptime-monitoring.md#troubleshooting).
|
||||
|
||||
### Configuration
|
||||
|
||||
**Per-Host**: Edit any proxy host and toggle "Enable Uptime Monitoring"
|
||||
|
||||
**Bulk Operations**:
|
||||
1. Select multiple hosts (checkboxes)
|
||||
2. Click "Bulk Apply"
|
||||
3. Toggle "Uptime Monitoring" section
|
||||
4. Apply changes
|
||||
|
||||
**Default check interval**: 60 seconds
|
||||
**Default timeout per attempt**: 10 seconds
|
||||
**Default max retries**: 2 attempts
|
||||
**Failure threshold**: 2 consecutive failures
|
||||
|
||||
**For complete troubleshooting guide and advanced topics, see [Uptime Monitoring Guide](features/uptime-monitoring.md).**
|
||||
|
||||
---
|
||||
|
||||
@@ -938,43 +1010,103 @@ Uses WebSocket technology to stream logs with zero delay.
|
||||
|
||||
### Notification System
|
||||
|
||||
**What it does:** Sends alerts when security events match your configured criteria.
|
||||
**What it does:** Sends alerts when security events, uptime changes, or SSL certificate events occur through multiple channels with rich formatting support.
|
||||
|
||||
**Where to configure:** Cerberus Dashboard → "Notification Settings" button (top-right)
|
||||
**Where to configure:** Settings → Notifications
|
||||
|
||||
**Supported Services:**
|
||||
|
||||
| Service | JSON Templates | Rich Formatting | Notes |
|
||||
|---------|----------------|-----------------|-------|
|
||||
| Discord | ✅ Yes | Embeds, colors, fields | Webhook-based, rich embeds |
|
||||
| Slack | ✅ Yes | Block Kit, markdown | Incoming webhooks |
|
||||
| Gotify | ✅ Yes | Priority, extras | Self-hosted push notifications |
|
||||
| Generic | ✅ Yes | Custom JSON | Any webhook-compatible service |
|
||||
| Telegram | ❌ No | Markdown only | Bot API, URL parameters |
|
||||
|
||||
**Settings:**
|
||||
|
||||
- **Enable/Disable** — Master toggle for all notifications
|
||||
- **Minimum Log Level** — Only notify for warnings and errors (ignore info/debug)
|
||||
- **Provider Type** — Choose your notification service
|
||||
- **Template Style** — Minimal, Detailed, or Custom JSON
|
||||
- **Event Types:**
|
||||
- SSL certificate events (issued, renewed, failed)
|
||||
- Uptime monitoring (host down, host recovered)
|
||||
- WAF blocks (when the firewall stops an attack)
|
||||
- ACL denials (when access control rules block a request)
|
||||
- Rate limit hits (when traffic thresholds are exceeded)
|
||||
- **Webhook URL** — Send alerts to Discord, Slack, or custom integrations
|
||||
- **Email Recipients** — Comma-separated list of email addresses
|
||||
- **Webhook URL** — Service-specific webhook endpoint
|
||||
- **Custom JSON** — Full control over notification format
|
||||
|
||||
**Template Styles:**
|
||||
|
||||
**Minimal Template** — Clean, simple text notifications:
|
||||
```json
|
||||
{
|
||||
"content": "{{.Title}}: {{.Message}}"
|
||||
}
|
||||
```
|
||||
|
||||
**Detailed Template** — Rich formatting with all event details:
|
||||
```json
|
||||
{
|
||||
"embeds": [{
|
||||
"title": "{{.Title}}",
|
||||
"description": "{{.Message}}",
|
||||
"color": {{.Color}},
|
||||
"timestamp": "{{.Timestamp}}",
|
||||
"fields": [
|
||||
{"name": "Event Type", "value": "{{.EventType}}", "inline": true},
|
||||
{"name": "Host", "value": "{{.HostName}}", "inline": true}
|
||||
]
|
||||
}]
|
||||
}
|
||||
```
|
||||
|
||||
**Custom Template** — Design your own structure with template variables:
|
||||
- `{{.Title}}` — Event title (e.g., "SSL Certificate Renewed")
|
||||
- `{{.Message}}` — Event details
|
||||
- `{{.EventType}}` — Event classification (ssl_renewal, uptime_down, waf_block)
|
||||
- `{{.Severity}}` — Alert level (info, warning, error)
|
||||
- `{{.HostName}}` — Affected proxy host
|
||||
- `{{.Timestamp}}` — ISO 8601 formatted timestamp
|
||||
- `{{.Color}}` — Color code for Discord embeds
|
||||
- `{{.Priority}}` — Numeric priority for Gotify (1-10)
|
||||
|
||||
**Example use cases:**
|
||||
|
||||
- Get a Slack message when your site is under attack
|
||||
- Email yourself when ACL rules block legitimate traffic (false positive alert)
|
||||
- Send all WAF blocks to your SIEM system for analysis
|
||||
- Get a Discord notification with rich embed when SSL certificates renew
|
||||
- Receive Slack Block Kit messages when monitored hosts go down
|
||||
- Send all WAF blocks to your SIEM system with custom JSON format
|
||||
- Get high-priority Gotify alerts for critical security events
|
||||
- Email yourself when ACL rules block legitimate traffic (future feature)
|
||||
|
||||
**What you do:**
|
||||
|
||||
1. Go to Cerberus Dashboard
|
||||
2. Click "Notification Settings"
|
||||
3. Enable notifications
|
||||
4. Set minimum level to "warn" or "error"
|
||||
5. Choose which event types to monitor
|
||||
6. Add your webhook URL or email addresses
|
||||
7. Save
|
||||
1. Go to **Settings → Notifications**
|
||||
2. Click **"Add Provider"**
|
||||
3. Select service type (Discord, Slack, Gotify, etc.)
|
||||
4. Enter webhook URL
|
||||
5. Choose template style or create custom JSON
|
||||
6. Select event types to monitor
|
||||
7. Click **"Send Test"** to verify
|
||||
8. Save configuration
|
||||
|
||||
**Technical details:**
|
||||
|
||||
- Notifications respect the minimum log level (e.g., only send errors)
|
||||
- Webhook payloads include full event context (IP, request details, rule matched)
|
||||
- Email delivery requires SMTP configuration (future feature)
|
||||
- Templates support Go text/template syntax for advanced formatting
|
||||
- SSRF protection validates all webhook URLs before saving and sending
|
||||
- Webhook retries with exponential backoff on failure
|
||||
- Failed notifications are logged for troubleshooting
|
||||
- Custom templates are validated before saving
|
||||
|
||||
**For complete examples and service-specific guides, see [Notification Configuration Guide](features/notifications.md).**
|
||||
|
||||
**Minimum Log Level** (Legacy Setting):
|
||||
|
||||
For backward compatibility, you can still configure minimum log level for security event notifications:
|
||||
- Only notify for warnings and errors (ignore info/debug)
|
||||
- Applies to Cerberus security events only
|
||||
- Accessible via Cerberus Dashboard → "Notification Settings"
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user