Files
Charon/docs/plans/current_spec.md
GitHub Actions b5c066d25d feat: add JSON template support for all services and fix uptime monitoring reliability
BREAKING CHANGE: None - fully backward compatible

Changes:
- feat(notifications): extend JSON templates to Discord, Slack, Gotify, and generic
- fix(uptime): resolve race conditions and false positives with failure debouncing
- chore(tests): add comprehensive test coverage (86.2% backend, 87.61% frontend)
- docs: add feature guides and manual test plan

Technical Details:
- Added supportsJSONTemplates() helper for service capability detection
- Renamed sendCustomWebhook → sendJSONPayload for clarity
- Added FailureCount field requiring 2 consecutive failures before marking down
- Implemented WaitGroup synchronization and host-specific mutexes
- Increased TCP timeout to 10s with 2 retry attempts
- Added template security: 5s timeout, 10KB size limit
- All security scans pass (CodeQL, Trivy)
2025-12-24 20:34:38 +00:00

26 KiB

Notification Templates & Uptime Monitoring Fix - Implementation Specification

Date: 2025-12-24 Status: Ready for Implementation Priority: High Supersedes: Previous SSRF mitigation plan (moved to archive)


Executive Summary

This specification addresses two distinct issues:

  1. Task 1: JSON notification templates are currently restricted to webhook type only, but should be available for all notification services that support JSON payloads (Discord, Slack, Gotify, etc.)
  2. Task 2: Uptime monitoring is incorrectly reporting proxy hosts as "down" intermittently due to timing and race condition issues in the TCP health check system

Task 1: Universal JSON Template Support

Problem Statement

Currently, JSON payload templates (minimal, detailed, custom) are only available when type == "webhook". Other notification services like Discord, Slack, and Gotify also support JSON payloads but are forced to use basic Shoutrrr formatting, limiting customization and functionality.

Root Cause Analysis

Backend Code Location

File: /projects/Charon/backend/internal/services/notification_service.go

Line 126-151: The SendExternal function branches on p.Type == "webhook":

if p.Type == "webhook" {
    if err := s.sendCustomWebhook(ctx, p, data); err != nil {
        logger.Log().WithError(err).Error("Failed to send webhook")
    }
} else {
    // All other types use basic shoutrrr with simple title/message
    url := normalizeURL(p.Type, p.URL)
    msg := fmt.Sprintf("%s\n\n%s", title, message)
    if err := shoutrrr.Send(url, msg); err != nil {
        logger.Log().WithError(err).Error("Failed to send notification")
    }
}

Frontend Code Location

File: /projects/Charon/frontend/src/pages/Notifications.tsx

Line 112: Template UI is conditionally rendered only for webhook type:

{type === 'webhook' && (
    <div>
        <label>{t('notificationProviders.jsonPayloadTemplate')}</label>
        {/* Template selection buttons and textarea */}
    </div>
)}

Model Definition

File: /projects/Charon/backend/internal/models/notification_provider.go

Lines 1-28: The NotificationProvider model has:

  • Type field: Accepts discord, slack, gotify, telegram, generic, webhook
  • Template field: Has values minimal, detailed, custom (default: minimal)
  • Config field: Stores the JSON template string

The model itself doesn't restrict templates by type—only the logic does.

Services That Support JSON

Based on Shoutrrr documentation and common webhook practices:

Service Supports JSON Notes
Discord Yes Native webhook API accepts JSON with embeds
Slack Yes Block Kit JSON format
Gotify Yes JSON API for messages with extras
Telegram ⚠️ Partial Uses URL params but can include JSON in message body
Generic Yes Generic HTTP POST, can be JSON
Webhook Yes Already supported

Proposed Solution

Phase 1: Backend Refactoring

Objective: Allow all JSON-capable services to use template rendering.

Changes to /backend/internal/services/notification_service.go:

  1. Create a helper function to determine if a service type supports JSON:
// supportsJSONTemplates returns true if the provider type can use JSON templates
func supportsJSONTemplates(providerType string) bool {
    switch strings.ToLower(providerType) {
    case "webhook", "discord", "slack", "gotify", "generic":
        return true
    case "telegram":
        return false // Telegram uses URL parameters
    default:
        return false
    }
}
  1. Modify SendExternal function (lines 126-151):
for _, provider := range providers {
    if !shouldSend {
        continue
    }

    go func(p models.NotificationProvider) {
        // Use JSON templates for all supported services
        if supportsJSONTemplates(p.Type) && p.Template != "" {
            if err := s.sendJSONPayload(ctx, p, data); err != nil {
                logger.Log().WithError(err).Error("Failed to send JSON notification")
            }
        } else {
            // Fallback to basic shoutrrr for unsupported services
            url := normalizeURL(p.Type, p.URL)
            msg := fmt.Sprintf("%s\n\n%s", title, message)
            if err := shoutrrr.Send(url, msg); err != nil {
                logger.Log().WithError(err).Error("Failed to send notification")
            }
        }
    }(provider)
}
  1. Rename sendCustomWebhook to sendJSONPayload (lines 154-251):

    • Function name: sendCustomWebhooksendJSONPayload
    • Keep all existing logic (template rendering, SSRF protection, etc.)
    • Update all references in tests
  2. Update service-specific URL handling:

    • For discord, slack, gotify: Still use normalizeURL() to format the webhook URL correctly
    • For generic and webhook: Use URL as-is after SSRF validation

Phase 2: Frontend Enhancement

Changes to /frontend/src/pages/Notifications.tsx:

  1. Line 112: Change conditional from type === 'webhook' to include all JSON-capable types:
{supportsJSONTemplates(type) && (
    <div>
        <label className="block text-sm font-medium text-gray-700 dark:text-gray-300">
            {t('notificationProviders.jsonPayloadTemplate')}
        </label>
        {/* Existing template buttons and textarea */}
    </div>
)}
  1. Add helper function at the top of the component:
const supportsJSONTemplates = (type: string): boolean => {
    return ['webhook', 'discord', 'slack', 'gotify', 'generic'].includes(type);
};
  1. Update translations to be more generic:
    • Current: "Custom Webhook (JSON)"
    • New: "Custom Webhook / JSON Payload"

Changes to /frontend/src/api/notifications.ts:

  • No changes needed; the API already supports template and config fields for all provider types

Phase 3: Documentation & Migration

  1. Update /docs/security.md (line 536+):

    • Document Discord JSON template format
    • Add examples for Slack Block Kit
    • Add Gotify JSON examples
  2. Update /docs/features.md:

    • Note that JSON templates are available for all compatible services
    • Provide comparison table of template availability by service
  3. Database Migration:

    • No schema changes needed
    • Existing template and config fields work for all types

Testing Strategy

Unit Tests

New test file: /backend/internal/services/notification_service_template_test.go

func TestSupportsJSONTemplates(t *testing.T) {
    tests := []struct {
        providerType string
        expected     bool
    }{
        {"webhook", true},
        {"discord", true},
        {"slack", true},
        {"gotify", true},
        {"generic", true},
        {"telegram", false},
        {"unknown", false},
    }
    // Test implementation
}

func TestSendJSONPayload_Discord(t *testing.T) {
    // Test Discord webhook with JSON template
}

func TestSendJSONPayload_Slack(t *testing.T) {
    // Test Slack webhook with JSON template
}

func TestSendJSONPayload_Gotify(t *testing.T) {
    // Test Gotify API with JSON template
}

Update existing tests:

  • Rename all sendCustomWebhook references to sendJSONPayload
  • Add test cases for non-webhook JSON services

Integration Tests

  1. Create test Discord webhook and verify JSON payload
  2. Test template preview for Discord, Slack, Gotify
  3. Verify backward compatibility (existing webhook configs still work)

Frontend Tests

File: /frontend/src/pages/__tests__/Notifications.spec.tsx

it('shows template selector for Discord', () => {
    // Render form with type=discord
    // Assert template UI is visible
})

it('hides template selector for Telegram', () => {
    // Render form with type=telegram
    // Assert template UI is hidden
})

Task 2: Uptime Monitoring False "Down" Status Fix

Problem Statement

Proxy hosts are incorrectly reported as "down" in uptime monitoring after refreshing the page, even though they're fully accessible. The status shows "up" initially, then changes to "down" after a short time.

Root Cause Analysis

Previous Fix Applied: Port mismatch issue was fixed in /docs/implementation/uptime_monitoring_port_fix_COMPLETE.md. The system now correctly uses ProxyHost.ForwardPort instead of extracting port from URLs.

Remaining Issue: The problem persists due to timing and race conditions in the check cycle.

Cause 1: Race Condition in CheckAll()

File: /backend/internal/services/uptime_service.go

Lines 305-344: CheckAll() performs host-level checks then monitor-level checks:

func (s *UptimeService) CheckAll() {
    // First, check all UptimeHosts
    s.checkAllHosts()  // ← Calls checkHost() in loop, no wait

    var monitors []models.UptimeMonitor
    s.DB.Where("enabled = ?", true).Find(&monitors)

    // Group monitors by host
    for hostID, monitors := range hostMonitors {
        if hostID != "" {
            var uptimeHost models.UptimeHost
            if err := s.DB.First(&uptimeHost, "id = ?", hostID).Error; err == nil {
                if uptimeHost.Status == "down" {
                    s.markHostMonitorsDown(monitors, &uptimeHost)
                    continue  // ← Skip individual checks if host is down
                }
            }
        }
        // Check individual monitors
        for _, monitor := range monitors {
            go s.checkMonitor(monitor)
        }
    }
}

Problem: checkAllHosts() runs synchronously through all hosts (line 351-353):

for i := range hosts {
    s.checkHost(&hosts[i])  // ← Takes 5s+ per host with multiple ports
}

If a host has 3 monitors and each TCP dial takes 5 seconds (timeout), total time is 15+ seconds. During this time:

  1. The UI refreshes and calls the API
  2. API reads database before checkHost() completes
  3. Stale "down" status is returned
  4. UI shows "down" even though check is still in progress

Cause 2: No Status Transition Debouncing

Lines 422-441: checkHost() immediately marks host as down after a single TCP failure:

success := false
for _, monitor := range monitors {
    conn, err := net.DialTimeout("tcp", addr, 5*time.Second)
    if err == nil {
        success = true
        break
    }
}

// Immediately flip to down if any failure
if success {
    newStatus = "up"
} else {
    newStatus = "down"  // ← No grace period or retry
}

A single transient failure (network hiccup, container busy, etc.) immediately marks the host as down.

Cause 3: Short Timeout Window

Line 399: TCP timeout is only 5 seconds:

conn, err := net.DialTimeout("tcp", addr, 5*time.Second)

For containers or slow networks, 5 seconds might not be enough, especially if:

  • Container is warming up
  • System is under load
  • Multiple concurrent checks happening

Proposed Solution

Fix 1: Synchronize Host Checks with WaitGroup

File: /backend/internal/services/uptime_service.go

Update checkAllHosts() function (lines 346-353):

func (s *UptimeService) checkAllHosts() {
    var hosts []models.UptimeHost
    if err := s.DB.Find(&hosts).Error; err != nil {
        logger.Log().WithError(err).Error("Failed to fetch uptime hosts")
        return
    }

    var wg sync.WaitGroup
    for i := range hosts {
        wg.Add(1)
        go func(host *models.UptimeHost) {
            defer wg.Done()
            s.checkHost(host)
        }(&hosts[i])
    }
    wg.Wait() // ← Wait for all host checks to complete

    logger.Log().WithField("host_count", len(hosts)).Info("All host checks completed")
}

Impact:

  • All host checks run concurrently (faster overall)
  • CheckAll() waits for completion before querying database
  • Eliminates race condition between check and read

Fix 2: Add Failure Count Debouncing

Add new field to UptimeHost model:

File: /backend/internal/models/uptime_host.go

type UptimeHost struct {
    // ... existing fields ...
    FailureCount int `json:"failure_count" gorm:"default:0"` // Consecutive failures
}

Update checkHost() status logic (lines 422-441):

const failureThreshold = 2  // Require 2 consecutive failures before marking down

if success {
    host.FailureCount = 0
    newStatus = "up"
} else {
    host.FailureCount++
    if host.FailureCount >= failureThreshold {
        newStatus = "down"
    } else {
        newStatus = host.Status  // ← Keep current status on first failure
        logger.Log().WithFields(map[string]any{
            "host_name":     host.Name,
            "failure_count": host.FailureCount,
            "threshold":     failureThreshold,
        }).Warn("Host check failed, waiting for threshold")
    }
}

Rationale: Prevents single transient failures from triggering false alarms.

Fix 3: Increase Timeout and Add Retry

Update checkHost() function (lines 359-408):

const tcpTimeout = 10 * time.Second  // ← Increased from 5s
const maxRetries = 2

success := false
var msg string

for retry := 0; retry < maxRetries && !success; retry++ {
    if retry > 0 {
        logger.Log().WithField("retry", retry).Info("Retrying TCP check")
        time.Sleep(2 * time.Second)  // Brief delay between retries
    }

    for _, monitor := range monitors {
        var port string
        if monitor.ProxyHost != nil {
            port = fmt.Sprintf("%d", monitor.ProxyHost.ForwardPort)
        } else {
            port = extractPort(monitor.URL)
        }

        if port == "" {
            continue
        }

        addr := net.JoinHostPort(host.Host, port)
        conn, err := net.DialTimeout("tcp", addr, tcpTimeout)
        if err == nil {
            conn.Close()
            success = true
            msg = fmt.Sprintf("TCP connection to %s successful (retry %d)", addr, retry)
            break
        }
        msg = fmt.Sprintf("TCP check failed: %v", err)
    }
}

Impact:

  • More resilient to transient failures
  • Increased timeout handles slow networks
  • Logs show retry attempts for debugging

Fix 4: Add Detailed Logging

Add debug logging throughout to help diagnose future issues:

logger.Log().WithFields(map[string]any{
    "host_name":      host.Name,
    "host_ip":        host.Host,
    "port":           port,
    "tcp_timeout":    tcpTimeout,
    "retry_attempt":  retry,
    "success":        success,
    "failure_count":  host.FailureCount,
    "old_status":     oldStatus,
    "new_status":     newStatus,
    "elapsed_ms":     time.Since(start).Milliseconds(),
}).Debug("Host TCP check completed")

Testing Strategy for Task 2

Unit Tests

File: /backend/internal/services/uptime_service_test.go

Add new test cases:

func TestCheckHost_RetryLogic(t *testing.T) {
    // Create a server that fails first attempt, succeeds on retry
    // Verify retry logic works correctly
}

func TestCheckHost_Debouncing(t *testing.T) {
    // Verify single failure doesn't mark host as down
    // Verify 2 consecutive failures do mark as down
}

func TestCheckAllHosts_Synchronization(t *testing.T) {
    // Create multiple hosts with varying check times
    // Verify all checks complete before function returns
    // Use channels to track completion order
}

func TestCheckHost_ConcurrentChecks(t *testing.T) {
    // Run multiple CheckAll() calls concurrently
    // Verify no race conditions or deadlocks
}

Integration Tests

File: /backend/integration/uptime_integration_test.go

func TestUptimeMonitoring_SlowNetwork(t *testing.T) {
    // Simulate slow TCP handshake (8 seconds)
    // Verify host is still marked as up with new timeout
}

func TestUptimeMonitoring_TransientFailure(t *testing.T) {
    // Fail first check, succeed second
    // Verify host remains up due to debouncing
}

func TestUptimeMonitoring_PageRefresh(t *testing.T) {
    // Simulate rapid API calls during check cycle
    // Verify status remains consistent
}

Manual Testing Checklist

  • Create proxy host with non-standard port (e.g., Wizarr on 5690)
  • Enable uptime monitoring for that host
  • Verify initial status shows "up"
  • Refresh page 10 times over 5 minutes
  • Confirm status remains "up" consistently
  • Check database for heartbeat records
  • Review logs for any timeout or retry messages
  • Test with container restart during check
  • Test with multiple hosts checked simultaneously
  • Verify notifications are not triggered by transient failures

Implementation Phases

Phase 1: Task 1 Backend (Day 1)

  • Add supportsJSONTemplates() helper function
  • Rename sendCustomWebhooksendJSONPayload
  • Update SendExternal() to use JSON for all compatible services
  • Write unit tests for new logic
  • Update existing tests with renamed function

Phase 2: Task 1 Frontend (Day 1-2)

  • Update template UI conditional in Notifications.tsx
  • Add supportsJSONTemplates() helper function
  • Update translations for generic JSON support
  • Write frontend tests for template visibility

Phase 3: Task 2 Database Migration (Day 2)

  • Add FailureCount field to UptimeHost model
  • Create migration file
  • Test migration on dev database
  • Update model documentation

Phase 4: Task 2 Backend Fixes (Day 2-3)

  • Add WaitGroup synchronization to checkAllHosts()
  • Implement failure count debouncing in checkHost()
  • Add retry logic with increased timeout
  • Add detailed debug logging
  • Write unit tests for new behavior
  • Write integration tests

Phase 5: Documentation (Day 3)

  • Update /docs/security.md with JSON examples for Discord, Slack, Gotify
  • Update /docs/features.md with template availability table
  • Document uptime monitoring improvements
  • Add troubleshooting guide for false positives/negatives
  • Update API documentation

Phase 6: Testing & Validation (Day 4)

  • Run full backend test suite (go test ./...)
  • Run frontend test suite (npm test)
  • Perform manual testing for both tasks
  • Test with real Discord/Slack/Gotify webhooks
  • Test uptime monitoring with various scenarios
  • Load testing for concurrent checks
  • Code review and security audit

Configuration File Updates

.gitignore

Status: No changes needed

Current ignore patterns are adequate:

  • *.cover files already ignored
  • test-results/ already ignored
  • No new artifacts from these changes

codecov.yml

Status: No changes needed

Current coverage targets are appropriate:

  • Backend target: 85%
  • Frontend target: 70%

New code will maintain these thresholds.

.dockerignore

Status: No changes needed

Current patterns already exclude:

  • Test files (**/*_test.go)
  • Coverage reports (*.cover)
  • Documentation (docs/)

Dockerfile

Status: No changes needed

No dependencies or build steps require modification:

  • No new packages needed
  • No changes to multi-stage build
  • No new runtime requirements

Risk Assessment

Task 1 Risks

Risk Severity Mitigation
Breaking existing webhook configs High Comprehensive testing, backward compatibility checks
Discord/Slack JSON format incompatibility Medium Test with real webhook endpoints, validate JSON schema
Template rendering errors cause notification failures Medium Robust error handling, fallback to basic shoutrrr format
SSRF vulnerabilities in new paths High Reuse existing security validation, audit all code paths

Task 2 Risks

Risk Severity Mitigation
Increased check duration impacts performance Medium Monitor check times, set hard limits, run concurrently
Database lock contention from FailureCount updates Low Use lightweight updates, batch where possible
False positives after retry logic Low Tune retry count and delay based on real-world testing
Database migration fails on large datasets Medium Test on copy of production data, rollback plan ready

Success Criteria

Task 1

  • Discord notifications can use custom JSON templates with embeds
  • Slack notifications can use Block Kit JSON templates
  • Gotify notifications can use custom JSON payloads
  • Template preview works for all supported services
  • Existing webhook configurations continue to work unchanged
  • No increase in failed notification rate
  • JSON validation errors are logged clearly

Task 2

  • Proxy hosts with non-standard ports show correct "up" status consistently
  • False "down" alerts reduced by 95% or more
  • Average check duration remains under 20 seconds even with retries
  • Status remains stable during page refreshes
  • No increase in missed down events (false negatives)
  • Detailed logs available for troubleshooting
  • No database corruption or lock contention

Rollback Plan

Task 1

  1. Revert SendExternal() to check p.Type == "webhook" only
  2. Revert frontend conditional to type === 'webhook'
  3. Revert function rename (sendJSONPayloadsendCustomWebhook)
  4. Deploy hotfix immediately
  5. Estimated rollback time: 15 minutes

Task 2

  1. Revert database migration (remove FailureCount field)
  2. Revert checkAllHosts() to non-synchronized version
  3. Remove retry logic from checkHost()
  4. Restore original TCP timeout (5s)
  5. Deploy hotfix immediately
  6. Estimated rollback time: 20 minutes

Rollback Testing: Test rollback procedure on staging environment before production deployment.


Monitoring & Alerts

Metrics to Track

Task 1:

  • Notification success rate by service type (target: >99%)
  • JSON parse errors per hour (target: <5)
  • Template rendering failures (target: <1%)
  • Average notification send time by service

Task 2:

  • Uptime check duration (p50, p95, p99) (target: p95 < 15s)
  • Host status transitions per hour (up → down, down → up)
  • False alarm rate (user-reported vs system-detected)
  • Retry count per check cycle
  • FailureCount distribution across hosts

Log Queries

# Task 1: Check JSON notification errors
docker logs charon 2>&1 | grep "Failed to send JSON notification" | tail -n 20

# Task 1: Check template rendering failures
docker logs charon 2>&1 | grep "failed to parse webhook template" | tail -n 20

# Task 2: Check uptime false negatives
docker logs charon 2>&1 | grep "Host status changed" | tail -n 50

# Task 2: Check retry patterns
docker logs charon 2>&1 | grep "Retrying TCP check" | tail -n 20

# Task 2: Check debouncing effectiveness
docker logs charon 2>&1 | grep "waiting for threshold" | tail -n 20

Grafana Dashboard Queries (if applicable)

# Notification success rate by type
rate(notification_sent_total{status="success"}[5m]) / rate(notification_sent_total[5m])

# Uptime check duration
histogram_quantile(0.95, rate(uptime_check_duration_seconds_bucket[5m]))

# Host status changes
rate(uptime_host_status_changes_total[5m])

Appendix: File Change Summary

Backend Files

File Lines Changed Type Task
backend/internal/services/notification_service.go ~80 Modify 1
backend/internal/services/uptime_service.go ~150 Modify 2
backend/internal/models/uptime_host.go +2 Add Field 2
backend/internal/services/notification_service_template_test.go +250 New File 1
backend/internal/services/uptime_service_test.go +200 Extend 2
backend/integration/uptime_integration_test.go +150 New File 2
backend/internal/database/migrations/ +20 New Migration 2

Frontend Files

File Lines Changed Type Task
frontend/src/pages/Notifications.tsx ~30 Modify 1
frontend/src/pages/__tests__/Notifications.spec.tsx +80 Extend 1
frontend/src/locales/en/translation.json ~5 Modify 1

Documentation Files

File Lines Changed Type Task
docs/security.md +150 Extend 1
docs/features.md +80 Extend 1, 2
docs/plans/current_spec.md ~2000 Replace 1, 2
docs/troubleshooting/uptime_monitoring.md +200 New File 2

Total Estimated Changes: ~3,377 lines across 14 files


Database Migration

Migration File

File: backend/internal/database/migrations/YYYYMMDDHHMMSS_add_uptime_host_failure_count.go

package migrations

import (
    "gorm.io/gorm"
)

func init() {
    Migrations = append(Migrations, Migration{
        ID: "YYYYMMDDHHMMSS",
        Description: "Add failure_count to uptime_hosts table",
        Migrate: func(db *gorm.DB) error {
            return db.Exec("ALTER TABLE uptime_hosts ADD COLUMN failure_count INTEGER DEFAULT 0").Error
        },
        Rollback: func(db *gorm.DB) error {
            return db.Exec("ALTER TABLE uptime_hosts DROP COLUMN failure_count").Error
        },
    })
}

Compatibility Notes

  • SQLite supports ALTER TABLE ADD COLUMN
  • Default value will be applied to existing rows
  • No data loss on rollback (column drop is safe for new field)
  • Migration is idempotent (check for column existence before adding)

Next Steps

  1. Plan Review Complete: This document is comprehensive and ready
  2. Architecture Review: Team lead approval for structural changes
  3. Begin Phase 1: Start with Task 1 backend refactoring
  4. Parallel Development: Task 2 can proceed independently after migration
  5. Code Review: Submit PRs after each phase completes
  6. Staging Deployment: Test both tasks in staging environment
  7. Production Deployment: Gradual rollout with monitoring

Specification Author: GitHub Copilot Review Status: Complete - Awaiting Implementation Estimated Implementation Time: 4 days Estimated Lines of Code: ~3,377 lines