feat: add JSON template support for all services and fix uptime monitoring reliability

BREAKING CHANGE: None - fully backward compatible

Changes:
- feat(notifications): extend JSON templates to Discord, Slack, Gotify, and generic
- fix(uptime): resolve race conditions and false positives with failure debouncing
- chore(tests): add comprehensive test coverage (86.2% backend, 87.61% frontend)
- docs: add feature guides and manual test plan

Technical Details:
- Added supportsJSONTemplates() helper for service capability detection
- Renamed sendCustomWebhook → sendJSONPayload for clarity
- Added FailureCount field requiring 2 consecutive failures before marking down
- Implemented WaitGroup synchronization and host-specific mutexes
- Increased TCP timeout to 10s with 2 retry attempts
- Added template security: 5s timeout, 10KB size limit
- All security scans pass (CodeQL, Trivy)
This commit is contained in:
GitHub Actions
2025-12-24 20:34:38 +00:00
parent 0133d64866
commit b5c066d25d
21 changed files with 4933 additions and 1656 deletions
+162 -30
View File
@@ -749,30 +749,58 @@ The animations tell you what's happening so you don't think it's broken.
## \ud83d\udcca Uptime Monitoring
**What it does:** Automatically checks if your websites are responding every minute.
**What it does:** Continuously monitors your proxy hosts for availability with intelligent failure detection to minimize false positives.
**Why you care:** Get visibility into uptime history and response times for all your proxy hosts.
**Why you care:** Get accurate visibility into uptime history, response times, and real outages without noise from transient network issues.
**What you do:** View the "Uptime" page in the sidebar. Uptime checks run automatically in the background.
**What you do:** Enable uptime monitoring per proxy host or use bulk operations. View status on the "Uptime" page in the sidebar.
**Optional:** You can disable this feature in System Settings → Optional Features if you don't need it.
Your uptime history will be preserved.
### Key Features
**Failure Debouncing**: Requires **2 consecutive failures** before marking a host as "down"
- Prevents false alarms from transient network hiccups
- Container restarts don't trigger unnecessary alerts
- Single TCP timeouts are logged but don't change status
**Automatic Retries**: Up to 2 retry attempts per check with 2-second delay
- Handles slow networks and warm-up periods
- 10-second timeout per attempt (increased from 5s)
- Total check time: up to 22 seconds for marginal hosts
**Concurrent Processing**: All host checks run in parallel
- Fast overall check times even with many hosts
- No single slow host blocks others
- Synchronized completion prevents race conditions
**Status Consistency**: Checks complete before UI reads database
- Eliminates stale status during page refreshes
- No race conditions between checks and API calls
- Reliable status display across rapid refreshes
### How Uptime Checks Work
Charon uses a **two-level check system** for efficient monitoring:
Charon uses a **two-level check system** with enhanced reliability:
#### Level 1: Host-Level Pre-Check (TCP)
#### Level 1: Host-Level Pre-Check (TCP with Retries)
**What it does:** Quickly tests if the backend host/container is reachable via TCP connection.
**What it does:** Tests if the backend host/container is reachable via TCP connection with automatic retry on failure.
**How it works:**
- Groups monitors by their backend IP address (e.g., `172.20.0.11`)
- Attempts TCP connection to the actual backend port (e.g., port `5690` for Wizarr)
- If successful → Proceeds to Level 2 checks
- **First failure**: Increments failure counter, status unchanged, waits 2s and retries
- **Retry success**: Resets failure counter to 0, marks host as "up"
- **Second consecutive failure**: Marks host as "down" after reaching threshold
- If failed → Marks all monitors on that host as "down" (skips Level 2)
- If successful → Proceeds to Level 2 checks
**Why it matters:** Avoids redundant HTTP checks when an entire backend container is stopped or unreachable.
**Why it matters:**
- Avoids redundant HTTP checks when an entire backend container is stopped or unreachable
- Prevents false "down" alerts from single network hiccups
- Handles slow container startups gracefully
**Technical detail:** Uses the `forward_port` from your proxy host configuration, not the public URL port.
This ensures correct connectivity checks for services on non-standard ports.
@@ -795,19 +823,63 @@ This ensures correct connectivity checks for services on non-standard ports.
### When Things Go Wrong
**Scenario 1: Backend container stopped**
- Level 1: TCP connection fails ❌
- Level 1: TCP connection fails (attempt 1)
- Level 1: TCP connection fails (attempt 2) ❌
- Failure count: 2 → Host marked "down"
- Level 2: Skipped
- Status: "down" with message "Host unreachable"
**Scenario 2: Service crashed but container running**
**Scenario 2: Transient network issue**
- Level 1: TCP connection fails (attempt 1) ❌
- Failure count: 1 (threshold not met)
- Status: Remains "up"
- Next check: Success ✅ → Failure count reset to 0
**Scenario 3: Service crashed but container running**
- Level 1: TCP connection succeeds ✅
- Level 2: HTTP request fails or returns 500 ❌
- Status: "down" with specific HTTP error
**Scenario 3: Everything working**
**Scenario 4: Everything working**
- Level 1: TCP connection succeeds ✅
- Level 2: HTTP request succeeds ✅
- Status: "up" with latency measurement
- Failure count: 0
### Troubleshooting False Positives
**Issue**: Host shows "down" but service is accessible
**Common causes**:
1. **Timeout too short**: Increase from 10s if network is slow
2. **Container warmup**: Service takes >10s to respond during startup
3. **Firewall blocking**: Ensure Charon container can reach proxy host ports
**Check logs**:
```bash
docker logs charon 2>&1 | grep "Host TCP check completed"
docker logs charon 2>&1 | grep "Retrying TCP check"
docker logs charon 2>&1 | grep "failure_count"
```
**Solution**: The improved debouncing should handle most transient issues automatically. If problems persist, see [Uptime Monitoring Troubleshooting Guide](features/uptime-monitoring.md#troubleshooting).
### Configuration
**Per-Host**: Edit any proxy host and toggle "Enable Uptime Monitoring"
**Bulk Operations**:
1. Select multiple hosts (checkboxes)
2. Click "Bulk Apply"
3. Toggle "Uptime Monitoring" section
4. Apply changes
**Default check interval**: 60 seconds
**Default timeout per attempt**: 10 seconds
**Default max retries**: 2 attempts
**Failure threshold**: 2 consecutive failures
**For complete troubleshooting guide and advanced topics, see [Uptime Monitoring Guide](features/uptime-monitoring.md).**
---
@@ -938,43 +1010,103 @@ Uses WebSocket technology to stream logs with zero delay.
### Notification System
**What it does:** Sends alerts when security events match your configured criteria.
**What it does:** Sends alerts when security events, uptime changes, or SSL certificate events occur through multiple channels with rich formatting support.
**Where to configure:** Cerberus Dashboard"Notification Settings" button (top-right)
**Where to configure:** Settings → Notifications
**Supported Services:**
| Service | JSON Templates | Rich Formatting | Notes |
|---------|----------------|-----------------|-------|
| Discord | ✅ Yes | Embeds, colors, fields | Webhook-based, rich embeds |
| Slack | ✅ Yes | Block Kit, markdown | Incoming webhooks |
| Gotify | ✅ Yes | Priority, extras | Self-hosted push notifications |
| Generic | ✅ Yes | Custom JSON | Any webhook-compatible service |
| Telegram | ❌ No | Markdown only | Bot API, URL parameters |
**Settings:**
- **Enable/Disable** — Master toggle for all notifications
- **Minimum Log Level** — Only notify for warnings and errors (ignore info/debug)
- **Provider Type** — Choose your notification service
- **Template Style** — Minimal, Detailed, or Custom JSON
- **Event Types:**
- SSL certificate events (issued, renewed, failed)
- Uptime monitoring (host down, host recovered)
- WAF blocks (when the firewall stops an attack)
- ACL denials (when access control rules block a request)
- Rate limit hits (when traffic thresholds are exceeded)
- **Webhook URL** — Send alerts to Discord, Slack, or custom integrations
- **Email Recipients** — Comma-separated list of email addresses
- **Webhook URL** — Service-specific webhook endpoint
- **Custom JSON** — Full control over notification format
**Template Styles:**
**Minimal Template** — Clean, simple text notifications:
```json
{
"content": "{{.Title}}: {{.Message}}"
}
```
**Detailed Template** — Rich formatting with all event details:
```json
{
"embeds": [{
"title": "{{.Title}}",
"description": "{{.Message}}",
"color": {{.Color}},
"timestamp": "{{.Timestamp}}",
"fields": [
{"name": "Event Type", "value": "{{.EventType}}", "inline": true},
{"name": "Host", "value": "{{.HostName}}", "inline": true}
]
}]
}
```
**Custom Template** — Design your own structure with template variables:
- `{{.Title}}` — Event title (e.g., "SSL Certificate Renewed")
- `{{.Message}}` — Event details
- `{{.EventType}}` — Event classification (ssl_renewal, uptime_down, waf_block)
- `{{.Severity}}` — Alert level (info, warning, error)
- `{{.HostName}}` — Affected proxy host
- `{{.Timestamp}}` — ISO 8601 formatted timestamp
- `{{.Color}}` — Color code for Discord embeds
- `{{.Priority}}` — Numeric priority for Gotify (1-10)
**Example use cases:**
- Get a Slack message when your site is under attack
- Email yourself when ACL rules block legitimate traffic (false positive alert)
- Send all WAF blocks to your SIEM system for analysis
- Get a Discord notification with rich embed when SSL certificates renew
- Receive Slack Block Kit messages when monitored hosts go down
- Send all WAF blocks to your SIEM system with custom JSON format
- Get high-priority Gotify alerts for critical security events
- Email yourself when ACL rules block legitimate traffic (future feature)
**What you do:**
1. Go to Cerberus Dashboard
2. Click "Notification Settings"
3. Enable notifications
4. Set minimum level to "warn" or "error"
5. Choose which event types to monitor
6. Add your webhook URL or email addresses
7. Save
1. Go to **Settings → Notifications**
2. Click **"Add Provider"**
3. Select service type (Discord, Slack, Gotify, etc.)
4. Enter webhook URL
5. Choose template style or create custom JSON
6. Select event types to monitor
7. Click **"Send Test"** to verify
8. Save configuration
**Technical details:**
- Notifications respect the minimum log level (e.g., only send errors)
- Webhook payloads include full event context (IP, request details, rule matched)
- Email delivery requires SMTP configuration (future feature)
- Templates support Go text/template syntax for advanced formatting
- SSRF protection validates all webhook URLs before saving and sending
- Webhook retries with exponential backoff on failure
- Failed notifications are logged for troubleshooting
- Custom templates are validated before saving
**For complete examples and service-specific guides, see [Notification Configuration Guide](features/notifications.md).**
**Minimum Log Level** (Legacy Setting):
For backward compatibility, you can still configure minimum log level for security event notifications:
- Only notify for warnings and errors (ignore info/debug)
- Applies to Cerberus security events only
- Accessible via Cerberus Dashboard → "Notification Settings"
---