feat: add JSON template support for all services and fix uptime monitoring reliability

BREAKING CHANGE: None - fully backward compatible Changes: - feat(notifications): extend JSON templates to Discord, Slack, Gotify, and generic - fix(uptime): resolve race conditions and false positives with failure debouncing - chore(tests): add comprehensive test coverage (86.2% backend, 87.61% frontend) - docs: add feature guides and manual test plan Technical Details: - Added supportsJSONTemplates() helper for service capability detection - Renamed sendCustomWebhook → sendJSONPayload for clarity - Added FailureCount field requiring 2 consecutive failures before marking down - Implemented WaitGroup synchronization and host-specific mutexes - Increased TCP timeout to 10s with 2 retry attempts - Added template security: 5s timeout, 10KB size limit - All security scans pass (CodeQL, Trivy)
2025-12-24 20:34:38 +00:00
parent 0133d64866
commit b5c066d25d
21 changed files with 4933 additions and 1656 deletions
@@ -749,30 +749,58 @@ The animations tell you what's happening so you don't think it's broken.

 ## \ud83d\udcca Uptime Monitoring

-**What it does:** Automatically checks if your websites are responding every minute.
+**What it does:** Continuously monitors your proxy hosts for availability with intelligent failure detection to minimize false positives.

-**Why you care:** Get visibility into uptime history and response times for all your proxy hosts.
+**Why you care:** Get accurate visibility into uptime history, response times, and real outages without noise from transient network issues.

-**What you do:** View the "Uptime" page in the sidebar. Uptime checks run automatically in the background.
+**What you do:** Enable uptime monitoring per proxy host or use bulk operations. View status on the "Uptime" page in the sidebar.

 **Optional:** You can disable this feature in System Settings → Optional Features if you don't need it.
 Your uptime history will be preserved.

+### Key Features
+
+**Failure Debouncing**: Requires **2 consecutive failures** before marking a host as "down"
+- Prevents false alarms from transient network hiccups
+- Container restarts don't trigger unnecessary alerts
+- Single TCP timeouts are logged but don't change status
+
+**Automatic Retries**: Up to 2 retry attempts per check with 2-second delay
+- Handles slow networks and warm-up periods
+- 10-second timeout per attempt (increased from 5s)
+- Total check time: up to 22 seconds for marginal hosts
+
+**Concurrent Processing**: All host checks run in parallel
+- Fast overall check times even with many hosts
+- No single slow host blocks others
+- Synchronized completion prevents race conditions
+
+**Status Consistency**: Checks complete before UI reads database
+- Eliminates stale status during page refreshes
+- No race conditions between checks and API calls
+- Reliable status display across rapid refreshes
+
 ### How Uptime Checks Work

-Charon uses a **two-level check system** for efficient monitoring:
+Charon uses a **two-level check system** with enhanced reliability:

-#### Level 1: Host-Level Pre-Check (TCP)
+#### Level 1: Host-Level Pre-Check (TCP with Retries)

-**What it does:** Quickly tests if the backend host/container is reachable via TCP connection.
+**What it does:** Tests if the backend host/container is reachable via TCP connection with automatic retry on failure.

 **How it works:**
 - Groups monitors by their backend IP address (e.g., `172.20.0.11`)
 - Attempts TCP connection to the actual backend port (e.g., port `5690` for Wizarr)
- If successful → Proceeds to Level 2 checks
+- **First failure**: Increments failure counter, status unchanged, waits 2s and retries
+- **Retry success**: Resets failure counter to 0, marks host as "up"
+- **Second consecutive failure**: Marks host as "down" after reaching threshold
 - If failed → Marks all monitors on that host as "down" (skips Level 2)
+- If successful → Proceeds to Level 2 checks

-**Why it matters:** Avoids redundant HTTP checks when an entire backend container is stopped or unreachable.
+**Why it matters:**
+- Avoids redundant HTTP checks when an entire backend container is stopped or unreachable
+- Prevents false "down" alerts from single network hiccups
+- Handles slow container startups gracefully

 **Technical detail:** Uses the `forward_port` from your proxy host configuration, not the public URL port.
 This ensures correct connectivity checks for services on non-standard ports.
@@ -795,19 +823,63 @@ This ensures correct connectivity checks for services on non-standard ports.
 ### When Things Go Wrong

 **Scenario 1: Backend container stopped**
- Level 1: TCP connection fails ❌
+- Level 1: TCP connection fails (attempt 1) ❌
+- Level 1: TCP connection fails (attempt 2) ❌
+- Failure count: 2 → Host marked "down"
 - Level 2: Skipped
 - Status: "down" with message "Host unreachable"

-**Scenario 2: Service crashed but container running**
+**Scenario 2: Transient network issue**
+- Level 1: TCP connection fails (attempt 1) ❌
+- Failure count: 1 (threshold not met)
+- Status: Remains "up"
+- Next check: Success ✅ → Failure count reset to 0
+
+**Scenario 3: Service crashed but container running**
 - Level 1: TCP connection succeeds ✅
 - Level 2: HTTP request fails or returns 500 ❌
 - Status: "down" with specific HTTP error

-**Scenario 3: Everything working**
+**Scenario 4: Everything working**
 - Level 1: TCP connection succeeds ✅
 - Level 2: HTTP request succeeds ✅
 - Status: "up" with latency measurement
+- Failure count: 0
+
+### Troubleshooting False Positives
+
+**Issue**: Host shows "down" but service is accessible
+
+**Common causes**:
+1. **Timeout too short**: Increase from 10s if network is slow
+2. **Container warmup**: Service takes >10s to respond during startup
+3. **Firewall blocking**: Ensure Charon container can reach proxy host ports
+
+**Check logs**:
+```bash
+docker logs charon 2>&1 | grep "Host TCP check completed"
+docker logs charon 2>&1 | grep "Retrying TCP check"
+docker logs charon 2>&1 | grep "failure_count"
+```
+
+**Solution**: The improved debouncing should handle most transient issues automatically. If problems persist, see [Uptime Monitoring Troubleshooting Guide](features/uptime-monitoring.md#troubleshooting).
+
+### Configuration
+
+**Per-Host**: Edit any proxy host and toggle "Enable Uptime Monitoring"
+
+**Bulk Operations**:
+1. Select multiple hosts (checkboxes)
+2. Click "Bulk Apply"
+3. Toggle "Uptime Monitoring" section
+4. Apply changes
+
+**Default check interval**: 60 seconds
+**Default timeout per attempt**: 10 seconds
+**Default max retries**: 2 attempts
+**Failure threshold**: 2 consecutive failures
+
+**For complete troubleshooting guide and advanced topics, see [Uptime Monitoring Guide](features/uptime-monitoring.md).**

 ---

@@ -938,43 +1010,103 @@ Uses WebSocket technology to stream logs with zero delay.

 ### Notification System

-**What it does:** Sends alerts when security events match your configured criteria.
+**What it does:** Sends alerts when security events, uptime changes, or SSL certificate events occur through multiple channels with rich formatting support.

-**Where to configure:** Cerberus Dashboard → "Notification Settings" button (top-right)
+**Where to configure:** Settings → Notifications
+
+**Supported Services:**
+
+| Service | JSON Templates | Rich Formatting | Notes |
+|---------|----------------|-----------------|-------|
+| Discord | ✅ Yes | Embeds, colors, fields | Webhook-based, rich embeds |
+| Slack | ✅ Yes | Block Kit, markdown | Incoming webhooks |
+| Gotify | ✅ Yes | Priority, extras | Self-hosted push notifications |
+| Generic | ✅ Yes | Custom JSON | Any webhook-compatible service |
+| Telegram | ❌ No | Markdown only | Bot API, URL parameters |

 **Settings:**

- **Enable/Disable** — Master toggle for all notifications
- **Minimum Log Level** — Only notify for warnings and errors (ignore info/debug)
+- **Provider Type** — Choose your notification service
+- **Template Style** — Minimal, Detailed, or Custom JSON
 - **Event Types:**
+  - SSL certificate events (issued, renewed, failed)
+  - Uptime monitoring (host down, host recovered)
  - WAF blocks (when the firewall stops an attack)
  - ACL denials (when access control rules block a request)
  - Rate limit hits (when traffic thresholds are exceeded)
- **Webhook URL** — Send alerts to Discord, Slack, or custom integrations
- **Email Recipients** — Comma-separated list of email addresses
+- **Webhook URL** — Service-specific webhook endpoint
+- **Custom JSON** — Full control over notification format
+
+**Template Styles:**
+
+**Minimal Template** — Clean, simple text notifications:
+```json
+{
+  "content": "{{.Title}}: {{.Message}}"
+}
+```
+
+**Detailed Template** — Rich formatting with all event details:
+```json
+{
+  "embeds": [{
+    "title": "{{.Title}}",
+    "description": "{{.Message}}",
+    "color": {{.Color}},
+    "timestamp": "{{.Timestamp}}",
+    "fields": [
+      {"name": "Event Type", "value": "{{.EventType}}", "inline": true},
+      {"name": "Host", "value": "{{.HostName}}", "inline": true}
+    ]
+  }]
+}
+```
+
+**Custom Template** — Design your own structure with template variables:
+- `{{.Title}}` — Event title (e.g., "SSL Certificate Renewed")
+- `{{.Message}}` — Event details
+- `{{.EventType}}` — Event classification (ssl_renewal, uptime_down, waf_block)
+- `{{.Severity}}` — Alert level (info, warning, error)
+- `{{.HostName}}` — Affected proxy host
+- `{{.Timestamp}}` — ISO 8601 formatted timestamp
+- `{{.Color}}` — Color code for Discord embeds
+- `{{.Priority}}` — Numeric priority for Gotify (1-10)

 **Example use cases:**

- Get a Slack message when your site is under attack
- Email yourself when ACL rules block legitimate traffic (false positive alert)
- Send all WAF blocks to your SIEM system for analysis
+- Get a Discord notification with rich embed when SSL certificates renew
+- Receive Slack Block Kit messages when monitored hosts go down
+- Send all WAF blocks to your SIEM system with custom JSON format
+- Get high-priority Gotify alerts for critical security events
+- Email yourself when ACL rules block legitimate traffic (future feature)

 **What you do:**

-1. Go to Cerberus Dashboard
-2. Click "Notification Settings"
-3. Enable notifications
-4. Set minimum level to "warn" or "error"
-5. Choose which event types to monitor
-6. Add your webhook URL or email addresses
-7. Save
+1. Go to **Settings → Notifications**
+2. Click **"Add Provider"**
+3. Select service type (Discord, Slack, Gotify, etc.)
+4. Enter webhook URL
+5. Choose template style or create custom JSON
+6. Select event types to monitor
+7. Click **"Send Test"** to verify
+8. Save configuration

 **Technical details:**

- Notifications respect the minimum log level (e.g., only send errors)
- Webhook payloads include full event context (IP, request details, rule matched)
- Email delivery requires SMTP configuration (future feature)
+- Templates support Go text/template syntax for advanced formatting
+- SSRF protection validates all webhook URLs before saving and sending
 - Webhook retries with exponential backoff on failure
+- Failed notifications are logged for troubleshooting
+- Custom templates are validated before saving
+
+**For complete examples and service-specific guides, see [Notification Configuration Guide](features/notifications.md).**
+
+**Minimum Log Level** (Legacy Setting):
+
+For backward compatibility, you can still configure minimum log level for security event notifications:
+- Only notify for warnings and errors (ignore info/debug)
+- Applies to Cerberus security events only
+- Accessible via Cerberus Dashboard → "Notification Settings"

 ---