BREAKING CHANGE: None - fully backward compatible Changes: - feat(notifications): extend JSON templates to Discord, Slack, Gotify, and generic - fix(uptime): resolve race conditions and false positives with failure debouncing - chore(tests): add comprehensive test coverage (86.2% backend, 87.61% frontend) - docs: add feature guides and manual test plan Technical Details: - Added supportsJSONTemplates() helper for service capability detection - Renamed sendCustomWebhook → sendJSONPayload for clarity - Added FailureCount field requiring 2 consecutive failures before marking down - Implemented WaitGroup synchronization and host-specific mutexes - Increased TCP timeout to 10s with 2 retry attempts - Added template security: 5s timeout, 10KB size limit - All security scans pass (CodeQL, Trivy)
26 KiB
Notification Templates & Uptime Monitoring Fix - Implementation Specification
Date: 2025-12-24 Status: Ready for Implementation Priority: High Supersedes: Previous SSRF mitigation plan (moved to archive)
Executive Summary
This specification addresses two distinct issues:
- Task 1: JSON notification templates are currently restricted to
webhooktype only, but should be available for all notification services that support JSON payloads (Discord, Slack, Gotify, etc.) - Task 2: Uptime monitoring is incorrectly reporting proxy hosts as "down" intermittently due to timing and race condition issues in the TCP health check system
Task 1: Universal JSON Template Support
Problem Statement
Currently, JSON payload templates (minimal, detailed, custom) are only available when type == "webhook". Other notification services like Discord, Slack, and Gotify also support JSON payloads but are forced to use basic Shoutrrr formatting, limiting customization and functionality.
Root Cause Analysis
Backend Code Location
File: /projects/Charon/backend/internal/services/notification_service.go
Line 126-151: The SendExternal function branches on p.Type == "webhook":
if p.Type == "webhook" {
if err := s.sendCustomWebhook(ctx, p, data); err != nil {
logger.Log().WithError(err).Error("Failed to send webhook")
}
} else {
// All other types use basic shoutrrr with simple title/message
url := normalizeURL(p.Type, p.URL)
msg := fmt.Sprintf("%s\n\n%s", title, message)
if err := shoutrrr.Send(url, msg); err != nil {
logger.Log().WithError(err).Error("Failed to send notification")
}
}
Frontend Code Location
File: /projects/Charon/frontend/src/pages/Notifications.tsx
Line 112: Template UI is conditionally rendered only for webhook type:
{type === 'webhook' && (
<div>
<label>{t('notificationProviders.jsonPayloadTemplate')}</label>
{/* Template selection buttons and textarea */}
</div>
)}
Model Definition
File: /projects/Charon/backend/internal/models/notification_provider.go
Lines 1-28: The NotificationProvider model has:
Typefield: Acceptsdiscord,slack,gotify,telegram,generic,webhookTemplatefield: Has valuesminimal,detailed,custom(default:minimal)Configfield: Stores the JSON template string
The model itself doesn't restrict templates by type—only the logic does.
Services That Support JSON
Based on Shoutrrr documentation and common webhook practices:
| Service | Supports JSON | Notes |
|---|---|---|
| Discord | ✅ Yes | Native webhook API accepts JSON with embeds |
| Slack | ✅ Yes | Block Kit JSON format |
| Gotify | ✅ Yes | JSON API for messages with extras |
| Telegram | ⚠️ Partial | Uses URL params but can include JSON in message body |
| Generic | ✅ Yes | Generic HTTP POST, can be JSON |
| Webhook | ✅ Yes | Already supported |
Proposed Solution
Phase 1: Backend Refactoring
Objective: Allow all JSON-capable services to use template rendering.
Changes to /backend/internal/services/notification_service.go:
- Create a helper function to determine if a service type supports JSON:
// supportsJSONTemplates returns true if the provider type can use JSON templates
func supportsJSONTemplates(providerType string) bool {
switch strings.ToLower(providerType) {
case "webhook", "discord", "slack", "gotify", "generic":
return true
case "telegram":
return false // Telegram uses URL parameters
default:
return false
}
}
- Modify
SendExternalfunction (lines 126-151):
for _, provider := range providers {
if !shouldSend {
continue
}
go func(p models.NotificationProvider) {
// Use JSON templates for all supported services
if supportsJSONTemplates(p.Type) && p.Template != "" {
if err := s.sendJSONPayload(ctx, p, data); err != nil {
logger.Log().WithError(err).Error("Failed to send JSON notification")
}
} else {
// Fallback to basic shoutrrr for unsupported services
url := normalizeURL(p.Type, p.URL)
msg := fmt.Sprintf("%s\n\n%s", title, message)
if err := shoutrrr.Send(url, msg); err != nil {
logger.Log().WithError(err).Error("Failed to send notification")
}
}
}(provider)
}
-
Rename
sendCustomWebhooktosendJSONPayload(lines 154-251):- Function name:
sendCustomWebhook→sendJSONPayload - Keep all existing logic (template rendering, SSRF protection, etc.)
- Update all references in tests
- Function name:
-
Update service-specific URL handling:
- For
discord,slack,gotify: Still usenormalizeURL()to format the webhook URL correctly - For
genericandwebhook: Use URL as-is after SSRF validation
- For
Phase 2: Frontend Enhancement
Changes to /frontend/src/pages/Notifications.tsx:
- Line 112: Change conditional from
type === 'webhook'to include all JSON-capable types:
{supportsJSONTemplates(type) && (
<div>
<label className="block text-sm font-medium text-gray-700 dark:text-gray-300">
{t('notificationProviders.jsonPayloadTemplate')}
</label>
{/* Existing template buttons and textarea */}
</div>
)}
- Add helper function at the top of the component:
const supportsJSONTemplates = (type: string): boolean => {
return ['webhook', 'discord', 'slack', 'gotify', 'generic'].includes(type);
};
- Update translations to be more generic:
- Current: "Custom Webhook (JSON)"
- New: "Custom Webhook / JSON Payload"
Changes to /frontend/src/api/notifications.ts:
- No changes needed; the API already supports
templateandconfigfields for all provider types
Phase 3: Documentation & Migration
-
Update
/docs/security.md(line 536+):- Document Discord JSON template format
- Add examples for Slack Block Kit
- Add Gotify JSON examples
-
Update
/docs/features.md:- Note that JSON templates are available for all compatible services
- Provide comparison table of template availability by service
-
Database Migration:
- No schema changes needed
- Existing
templateandconfigfields work for all types
Testing Strategy
Unit Tests
New test file: /backend/internal/services/notification_service_template_test.go
func TestSupportsJSONTemplates(t *testing.T) {
tests := []struct {
providerType string
expected bool
}{
{"webhook", true},
{"discord", true},
{"slack", true},
{"gotify", true},
{"generic", true},
{"telegram", false},
{"unknown", false},
}
// Test implementation
}
func TestSendJSONPayload_Discord(t *testing.T) {
// Test Discord webhook with JSON template
}
func TestSendJSONPayload_Slack(t *testing.T) {
// Test Slack webhook with JSON template
}
func TestSendJSONPayload_Gotify(t *testing.T) {
// Test Gotify API with JSON template
}
Update existing tests:
- Rename all
sendCustomWebhookreferences tosendJSONPayload - Add test cases for non-webhook JSON services
Integration Tests
- Create test Discord webhook and verify JSON payload
- Test template preview for Discord, Slack, Gotify
- Verify backward compatibility (existing webhook configs still work)
Frontend Tests
File: /frontend/src/pages/__tests__/Notifications.spec.tsx
it('shows template selector for Discord', () => {
// Render form with type=discord
// Assert template UI is visible
})
it('hides template selector for Telegram', () => {
// Render form with type=telegram
// Assert template UI is hidden
})
Task 2: Uptime Monitoring False "Down" Status Fix
Problem Statement
Proxy hosts are incorrectly reported as "down" in uptime monitoring after refreshing the page, even though they're fully accessible. The status shows "up" initially, then changes to "down" after a short time.
Root Cause Analysis
Previous Fix Applied: Port mismatch issue was fixed in /docs/implementation/uptime_monitoring_port_fix_COMPLETE.md. The system now correctly uses ProxyHost.ForwardPort instead of extracting port from URLs.
Remaining Issue: The problem persists due to timing and race conditions in the check cycle.
Cause 1: Race Condition in CheckAll()
File: /backend/internal/services/uptime_service.go
Lines 305-344: CheckAll() performs host-level checks then monitor-level checks:
func (s *UptimeService) CheckAll() {
// First, check all UptimeHosts
s.checkAllHosts() // ← Calls checkHost() in loop, no wait
var monitors []models.UptimeMonitor
s.DB.Where("enabled = ?", true).Find(&monitors)
// Group monitors by host
for hostID, monitors := range hostMonitors {
if hostID != "" {
var uptimeHost models.UptimeHost
if err := s.DB.First(&uptimeHost, "id = ?", hostID).Error; err == nil {
if uptimeHost.Status == "down" {
s.markHostMonitorsDown(monitors, &uptimeHost)
continue // ← Skip individual checks if host is down
}
}
}
// Check individual monitors
for _, monitor := range monitors {
go s.checkMonitor(monitor)
}
}
}
Problem: checkAllHosts() runs synchronously through all hosts (line 351-353):
for i := range hosts {
s.checkHost(&hosts[i]) // ← Takes 5s+ per host with multiple ports
}
If a host has 3 monitors and each TCP dial takes 5 seconds (timeout), total time is 15+ seconds. During this time:
- The UI refreshes and calls the API
- API reads database before
checkHost()completes - Stale "down" status is returned
- UI shows "down" even though check is still in progress
Cause 2: No Status Transition Debouncing
Lines 422-441: checkHost() immediately marks host as down after a single TCP failure:
success := false
for _, monitor := range monitors {
conn, err := net.DialTimeout("tcp", addr, 5*time.Second)
if err == nil {
success = true
break
}
}
// Immediately flip to down if any failure
if success {
newStatus = "up"
} else {
newStatus = "down" // ← No grace period or retry
}
A single transient failure (network hiccup, container busy, etc.) immediately marks the host as down.
Cause 3: Short Timeout Window
Line 399: TCP timeout is only 5 seconds:
conn, err := net.DialTimeout("tcp", addr, 5*time.Second)
For containers or slow networks, 5 seconds might not be enough, especially if:
- Container is warming up
- System is under load
- Multiple concurrent checks happening
Proposed Solution
Fix 1: Synchronize Host Checks with WaitGroup
File: /backend/internal/services/uptime_service.go
Update checkAllHosts() function (lines 346-353):
func (s *UptimeService) checkAllHosts() {
var hosts []models.UptimeHost
if err := s.DB.Find(&hosts).Error; err != nil {
logger.Log().WithError(err).Error("Failed to fetch uptime hosts")
return
}
var wg sync.WaitGroup
for i := range hosts {
wg.Add(1)
go func(host *models.UptimeHost) {
defer wg.Done()
s.checkHost(host)
}(&hosts[i])
}
wg.Wait() // ← Wait for all host checks to complete
logger.Log().WithField("host_count", len(hosts)).Info("All host checks completed")
}
Impact:
- All host checks run concurrently (faster overall)
CheckAll()waits for completion before querying database- Eliminates race condition between check and read
Fix 2: Add Failure Count Debouncing
Add new field to UptimeHost model:
File: /backend/internal/models/uptime_host.go
type UptimeHost struct {
// ... existing fields ...
FailureCount int `json:"failure_count" gorm:"default:0"` // Consecutive failures
}
Update checkHost() status logic (lines 422-441):
const failureThreshold = 2 // Require 2 consecutive failures before marking down
if success {
host.FailureCount = 0
newStatus = "up"
} else {
host.FailureCount++
if host.FailureCount >= failureThreshold {
newStatus = "down"
} else {
newStatus = host.Status // ← Keep current status on first failure
logger.Log().WithFields(map[string]any{
"host_name": host.Name,
"failure_count": host.FailureCount,
"threshold": failureThreshold,
}).Warn("Host check failed, waiting for threshold")
}
}
Rationale: Prevents single transient failures from triggering false alarms.
Fix 3: Increase Timeout and Add Retry
Update checkHost() function (lines 359-408):
const tcpTimeout = 10 * time.Second // ← Increased from 5s
const maxRetries = 2
success := false
var msg string
for retry := 0; retry < maxRetries && !success; retry++ {
if retry > 0 {
logger.Log().WithField("retry", retry).Info("Retrying TCP check")
time.Sleep(2 * time.Second) // Brief delay between retries
}
for _, monitor := range monitors {
var port string
if monitor.ProxyHost != nil {
port = fmt.Sprintf("%d", monitor.ProxyHost.ForwardPort)
} else {
port = extractPort(monitor.URL)
}
if port == "" {
continue
}
addr := net.JoinHostPort(host.Host, port)
conn, err := net.DialTimeout("tcp", addr, tcpTimeout)
if err == nil {
conn.Close()
success = true
msg = fmt.Sprintf("TCP connection to %s successful (retry %d)", addr, retry)
break
}
msg = fmt.Sprintf("TCP check failed: %v", err)
}
}
Impact:
- More resilient to transient failures
- Increased timeout handles slow networks
- Logs show retry attempts for debugging
Fix 4: Add Detailed Logging
Add debug logging throughout to help diagnose future issues:
logger.Log().WithFields(map[string]any{
"host_name": host.Name,
"host_ip": host.Host,
"port": port,
"tcp_timeout": tcpTimeout,
"retry_attempt": retry,
"success": success,
"failure_count": host.FailureCount,
"old_status": oldStatus,
"new_status": newStatus,
"elapsed_ms": time.Since(start).Milliseconds(),
}).Debug("Host TCP check completed")
Testing Strategy for Task 2
Unit Tests
File: /backend/internal/services/uptime_service_test.go
Add new test cases:
func TestCheckHost_RetryLogic(t *testing.T) {
// Create a server that fails first attempt, succeeds on retry
// Verify retry logic works correctly
}
func TestCheckHost_Debouncing(t *testing.T) {
// Verify single failure doesn't mark host as down
// Verify 2 consecutive failures do mark as down
}
func TestCheckAllHosts_Synchronization(t *testing.T) {
// Create multiple hosts with varying check times
// Verify all checks complete before function returns
// Use channels to track completion order
}
func TestCheckHost_ConcurrentChecks(t *testing.T) {
// Run multiple CheckAll() calls concurrently
// Verify no race conditions or deadlocks
}
Integration Tests
File: /backend/integration/uptime_integration_test.go
func TestUptimeMonitoring_SlowNetwork(t *testing.T) {
// Simulate slow TCP handshake (8 seconds)
// Verify host is still marked as up with new timeout
}
func TestUptimeMonitoring_TransientFailure(t *testing.T) {
// Fail first check, succeed second
// Verify host remains up due to debouncing
}
func TestUptimeMonitoring_PageRefresh(t *testing.T) {
// Simulate rapid API calls during check cycle
// Verify status remains consistent
}
Manual Testing Checklist
- Create proxy host with non-standard port (e.g., Wizarr on 5690)
- Enable uptime monitoring for that host
- Verify initial status shows "up"
- Refresh page 10 times over 5 minutes
- Confirm status remains "up" consistently
- Check database for heartbeat records
- Review logs for any timeout or retry messages
- Test with container restart during check
- Test with multiple hosts checked simultaneously
- Verify notifications are not triggered by transient failures
Implementation Phases
Phase 1: Task 1 Backend (Day 1)
- Add
supportsJSONTemplates()helper function - Rename
sendCustomWebhook→sendJSONPayload - Update
SendExternal()to use JSON for all compatible services - Write unit tests for new logic
- Update existing tests with renamed function
Phase 2: Task 1 Frontend (Day 1-2)
- Update template UI conditional in
Notifications.tsx - Add
supportsJSONTemplates()helper function - Update translations for generic JSON support
- Write frontend tests for template visibility
Phase 3: Task 2 Database Migration (Day 2)
- Add
FailureCountfield toUptimeHostmodel - Create migration file
- Test migration on dev database
- Update model documentation
Phase 4: Task 2 Backend Fixes (Day 2-3)
- Add WaitGroup synchronization to
checkAllHosts() - Implement failure count debouncing in
checkHost() - Add retry logic with increased timeout
- Add detailed debug logging
- Write unit tests for new behavior
- Write integration tests
Phase 5: Documentation (Day 3)
- Update
/docs/security.mdwith JSON examples for Discord, Slack, Gotify - Update
/docs/features.mdwith template availability table - Document uptime monitoring improvements
- Add troubleshooting guide for false positives/negatives
- Update API documentation
Phase 6: Testing & Validation (Day 4)
- Run full backend test suite (
go test ./...) - Run frontend test suite (
npm test) - Perform manual testing for both tasks
- Test with real Discord/Slack/Gotify webhooks
- Test uptime monitoring with various scenarios
- Load testing for concurrent checks
- Code review and security audit
Configuration File Updates
.gitignore
Status: ✅ No changes needed
Current ignore patterns are adequate:
*.coverfiles already ignoredtest-results/already ignored- No new artifacts from these changes
codecov.yml
Status: ✅ No changes needed
Current coverage targets are appropriate:
- Backend target: 85%
- Frontend target: 70%
New code will maintain these thresholds.
.dockerignore
Status: ✅ No changes needed
Current patterns already exclude:
- Test files (
**/*_test.go) - Coverage reports (
*.cover) - Documentation (
docs/)
Dockerfile
Status: ✅ No changes needed
No dependencies or build steps require modification:
- No new packages needed
- No changes to multi-stage build
- No new runtime requirements
Risk Assessment
Task 1 Risks
| Risk | Severity | Mitigation |
|---|---|---|
| Breaking existing webhook configs | High | Comprehensive testing, backward compatibility checks |
| Discord/Slack JSON format incompatibility | Medium | Test with real webhook endpoints, validate JSON schema |
| Template rendering errors cause notification failures | Medium | Robust error handling, fallback to basic shoutrrr format |
| SSRF vulnerabilities in new paths | High | Reuse existing security validation, audit all code paths |
Task 2 Risks
| Risk | Severity | Mitigation |
|---|---|---|
| Increased check duration impacts performance | Medium | Monitor check times, set hard limits, run concurrently |
| Database lock contention from FailureCount updates | Low | Use lightweight updates, batch where possible |
| False positives after retry logic | Low | Tune retry count and delay based on real-world testing |
| Database migration fails on large datasets | Medium | Test on copy of production data, rollback plan ready |
Success Criteria
Task 1
- ✅ Discord notifications can use custom JSON templates with embeds
- ✅ Slack notifications can use Block Kit JSON templates
- ✅ Gotify notifications can use custom JSON payloads
- ✅ Template preview works for all supported services
- ✅ Existing webhook configurations continue to work unchanged
- ✅ No increase in failed notification rate
- ✅ JSON validation errors are logged clearly
Task 2
- ✅ Proxy hosts with non-standard ports show correct "up" status consistently
- ✅ False "down" alerts reduced by 95% or more
- ✅ Average check duration remains under 20 seconds even with retries
- ✅ Status remains stable during page refreshes
- ✅ No increase in missed down events (false negatives)
- ✅ Detailed logs available for troubleshooting
- ✅ No database corruption or lock contention
Rollback Plan
Task 1
- Revert
SendExternal()to checkp.Type == "webhook"only - Revert frontend conditional to
type === 'webhook' - Revert function rename (
sendJSONPayload→sendCustomWebhook) - Deploy hotfix immediately
- Estimated rollback time: 15 minutes
Task 2
- Revert database migration (remove
FailureCountfield) - Revert
checkAllHosts()to non-synchronized version - Remove retry logic from
checkHost() - Restore original TCP timeout (5s)
- Deploy hotfix immediately
- Estimated rollback time: 20 minutes
Rollback Testing: Test rollback procedure on staging environment before production deployment.
Monitoring & Alerts
Metrics to Track
Task 1:
- Notification success rate by service type (target: >99%)
- JSON parse errors per hour (target: <5)
- Template rendering failures (target: <1%)
- Average notification send time by service
Task 2:
- Uptime check duration (p50, p95, p99) (target: p95 < 15s)
- Host status transitions per hour (up → down, down → up)
- False alarm rate (user-reported vs system-detected)
- Retry count per check cycle
- FailureCount distribution across hosts
Log Queries
# Task 1: Check JSON notification errors
docker logs charon 2>&1 | grep "Failed to send JSON notification" | tail -n 20
# Task 1: Check template rendering failures
docker logs charon 2>&1 | grep "failed to parse webhook template" | tail -n 20
# Task 2: Check uptime false negatives
docker logs charon 2>&1 | grep "Host status changed" | tail -n 50
# Task 2: Check retry patterns
docker logs charon 2>&1 | grep "Retrying TCP check" | tail -n 20
# Task 2: Check debouncing effectiveness
docker logs charon 2>&1 | grep "waiting for threshold" | tail -n 20
Grafana Dashboard Queries (if applicable)
# Notification success rate by type
rate(notification_sent_total{status="success"}[5m]) / rate(notification_sent_total[5m])
# Uptime check duration
histogram_quantile(0.95, rate(uptime_check_duration_seconds_bucket[5m]))
# Host status changes
rate(uptime_host_status_changes_total[5m])
Appendix: File Change Summary
Backend Files
| File | Lines Changed | Type | Task |
|---|---|---|---|
backend/internal/services/notification_service.go |
~80 | Modify | 1 |
backend/internal/services/uptime_service.go |
~150 | Modify | 2 |
backend/internal/models/uptime_host.go |
+2 | Add Field | 2 |
backend/internal/services/notification_service_template_test.go |
+250 | New File | 1 |
backend/internal/services/uptime_service_test.go |
+200 | Extend | 2 |
backend/integration/uptime_integration_test.go |
+150 | New File | 2 |
backend/internal/database/migrations/ |
+20 | New Migration | 2 |
Frontend Files
| File | Lines Changed | Type | Task |
|---|---|---|---|
frontend/src/pages/Notifications.tsx |
~30 | Modify | 1 |
frontend/src/pages/__tests__/Notifications.spec.tsx |
+80 | Extend | 1 |
frontend/src/locales/en/translation.json |
~5 | Modify | 1 |
Documentation Files
| File | Lines Changed | Type | Task |
|---|---|---|---|
docs/security.md |
+150 | Extend | 1 |
docs/features.md |
+80 | Extend | 1, 2 |
docs/plans/current_spec.md |
~2000 | Replace | 1, 2 |
docs/troubleshooting/uptime_monitoring.md |
+200 | New File | 2 |
Total Estimated Changes: ~3,377 lines across 14 files
Database Migration
Migration File
File: backend/internal/database/migrations/YYYYMMDDHHMMSS_add_uptime_host_failure_count.go
package migrations
import (
"gorm.io/gorm"
)
func init() {
Migrations = append(Migrations, Migration{
ID: "YYYYMMDDHHMMSS",
Description: "Add failure_count to uptime_hosts table",
Migrate: func(db *gorm.DB) error {
return db.Exec("ALTER TABLE uptime_hosts ADD COLUMN failure_count INTEGER DEFAULT 0").Error
},
Rollback: func(db *gorm.DB) error {
return db.Exec("ALTER TABLE uptime_hosts DROP COLUMN failure_count").Error
},
})
}
Compatibility Notes
- SQLite supports
ALTER TABLE ADD COLUMN - Default value will be applied to existing rows
- No data loss on rollback (column drop is safe for new field)
- Migration is idempotent (check for column existence before adding)
Next Steps
- ✅ Plan Review Complete: This document is comprehensive and ready
- ⏳ Architecture Review: Team lead approval for structural changes
- ⏳ Begin Phase 1: Start with Task 1 backend refactoring
- ⏳ Parallel Development: Task 2 can proceed independently after migration
- ⏳ Code Review: Submit PRs after each phase completes
- ⏳ Staging Deployment: Test both tasks in staging environment
- ⏳ Production Deployment: Gradual rollout with monitoring
Specification Author: GitHub Copilot Review Status: ✅ Complete - Awaiting Implementation Estimated Implementation Time: 4 days Estimated Lines of Code: ~3,377 lines