11 KiB
Current Plan: Aggregated Host Statuses
This feature adds a backend endpoint that returns aggregated health information for upstream hosts and a frontend Dashboard widget to display the aggregated view. The goal is to provide host-level health at-a-glance to help identify server-wide outages and quickly navigate to affected services.
Summary
- Endpoint:
GET /api/v1/uptime/hosts/aggregated(authenticated) - Backend: Service method + handler + route + GORM query, small in-memory cache, server-side filters
- Frontend: API client, custom React Query hook,
HostStatusesWidgetin Dashboard, demo/test pages - Acceptance: Auth respects accessible hosts, accurate counts, performance (fast aggregate queries)
HandOff JSON contract (Truth)
Request: GET /api/v1/uptime/hosts/aggregated
- Query Params (optional):
status(string): filter results by host status: up|down|pending|maintenanceq(string): search text (host or name)sort_by(string):monitor_count|down_count|avg_latency|last_check(default:down_count)order(string):asc|desc(default:desc)page(int): pagination page (default 1)per_page(int): items per page (default 50)
Response: 200 JSON
{
"aggregated_hosts": [
{
"id": "uuid",
"host": "10.0.0.12",
"name": "web-01",
"status": "down",
"monitor_count": 3,
"counts": { "up": 1, "down": 2, "pending": 0, "maintenance": 0 },
"avg_latency_ms": 257,
"last_check": "2025-12-05T09:54:54Z",
"last_status_change": "2025-12-05T09:53:44Z",
"affected_monitors": [
{ "id": "mon-1", "name": "example-api", "status": "down", "last_check": "2025-12-05T09:54:54Z" },
{ "id": "mon-2", "name": "webapp", "status": "down", "last_check": "2025-12-05T09:52:14Z" }
],
"uptime_24h": 99.3
}
],
"meta": { "page": 1, "per_page": 50, "total": 1 }
}
Notes:
- All timestamps are ISO 8601 UTC.
- Field names use snake_case (server -> frontend contract per project guidelines).
- Only accessible hosts are returned to the authenticated caller (utilize existing auth handlers).
Backend Requirements
-
Database
- Ensure index on
uptime_monitors(uptime_host_id),uptime_monitors(status), anduptime_monitors(last_check). - No model changes required for
UptimeHostorUptimeMonitorunless we want anavg_latencycolumn cached (optional).
- Ensure index on
-
Service (in
internal/services/uptime_service.go)- Add method:
GetAggregatedHostStatuses(filters AggregationFilter) ([]AggregatedHost, error). - Implementation detail:
- Query should join
uptime_hostsanduptime_monitorsand run aGROUP BY uptime_host_id. - Use a SELECT that computes: monitor_count, up_count, down_count, pending_count, maintenance_count, avg_latency, last_check (MAX), last_status_change (MAX).
- Provide a parameter to include a limited list of affected monitors (eg. top N by last_check) and optional
uptime_24hcalculation where a heartbeat history exists. - Return GORM structs matching the
AggregatedHostDTO.
- Query should join
- Add method:
-
Handler (in
internal/api/handlers/uptime_handler.go)- Add
func (h *UptimeHandler) AggregatedHosts(c *gin.Context)that:- Binds query params; validates and normalizes them.
- Calls
service.GetAggregatedHostStatuses(filters). - Filters the results using
authMiddleware(maintain accessible hosts list orauthHandler.GetAccessibleHostslogic). - Caches the result for
CHARON_UPTIME_AGGREGATION_TTL(default 30s). Cache strategy: package global inserviceswith simplesync.Map+ TTL. - Produces a 200 JSON with the contract above.
- Add unit tests and integration tests verifying results and auth scoping.
- Add
-
Routes
- Register under protected group in
internal/api/routes/routes.go:protected.GET('/uptime/hosts/aggregated', uptimeHandler.AggregatedHosts)
- Register under protected group in
-
Observability
- Add a Prometheus counter/metric:
charon_uptime_aggregated_requests_total(labels: status, cache_hit true/false). - Add logs for aggregation errors.
- Add a Prometheus counter/metric:
-
Security
- Ensure only authenticated users can access aggregated endpoint.
- Respect
authHandler.GetAccessibleHosts(or similar) to filter hosts the user should see.
-
Tests
- Unit tests for service logic calculating aggregates (mock DB / in-memory DB fixtures).
- Handler integration tests using the testdb and router that verify JSON response structure, pagination, filters, and auth filtering.
- Perf tests: basic benchmark to ensure aggregation query completes within acceptable time for 10k monitors (e.g. < 200ms unless run on dev env; document specifics).
Frontend Requirements
-
API client changes (
frontend/src/api/uptime.ts)- Add
export const getAggregatedHosts = async (params?: AggregationQueryParams) => client.get<AggregatedHost[]>('/uptime/hosts/aggregated', { params }).then(r => r.data) - Add new TypeScript types for
AggregatedHost,AggregatedHostCounts,AffectedMonitor.
- Add
-
React Query Hook (
frontend/src/hooks/useAggregatedHosts.ts)useAggregatedHostsshould accept params similar to query params (filters), and acceptenabledflag.- Use TanStack Query with
refetchInterval: 30_000andstaleTime: 30_000to match backend TTL.
-
Dashboard Widget (
frontend/src/components/Dashboard/HostStatusesWidget.tsx)- Shows high-level summary: total hosts, down_count, up_count, pending.
- Clickable host rows navigate to the uptime or host detail page.
- Visuals: small status badge, host name, counts, avg latency, last check time.
- Accessible: all interactive elements keyboard and screen-reader navigable.
- Fallback: if the aggregated endpoint is not found or returns 403, display a short explanatory message with a link to uptime page.
-
Dashboard Page Update (
frontend/src/pages/Dashboard.tsx)- Add
HostStatusesWidgetto the Dashboard layout (prefer 2nd column nearUptimeWidget).
- Add
-
Tests
- Unit tests for
HostStatusesWidgetrendering different states. - Mock API responses for
useAggregatedHostsusing the existing test utilities. - Add Storybook story if used in repo (optional).
- Unit tests for
-
Styling
- Keep styling consistent with
UptimeWidget(dark-card, status badges, mini bars).
- Keep styling consistent with
Acceptance Criteria
-
API
GET /api/v1/uptime/hosts/aggregatedreturns aggregated host objects in the correct format.- Query params
status,q,sort_by,order,page,per_pagework as expected. - The endpoint respects user-specific host access permissions.
- Endpoint adheres to TTL caching; cache invalidation occurs after TTL or when underlying monitor status change triggers invalidation.
-
Backend Tests
- Unit tests cover all aggregation branches and logic (e.g. zero-monitor host, mixed statuses, all down host).
- Integration tests validate auth-scoped responses.
-
Frontend UI
- Widget displays host-level counts and shows a list of top N hosts with status badges.
- Clicking a host navigates to the uptime or host detail page.
- Widget refreshes according to TTL and reacts to manual refreshes.
- UI has automated tests covering rendering with typical API responses, filtering and pagination UI behavior.
-
Performance
- Aggregation query responds within acceptable time for typical deployments (document target; e.g. < 200ms for 5k monitors), or we add a follow-up plan to add precomputation.
Example API Contract (Sample Request + Response)
Request:
GET /api/v1/uptime/hosts/aggregated?sort_by=down_count&order=desc&page=1&per_page=20
Authorization: Bearer <token>
Response:
{
"aggregated_hosts": [
{
"id": "39b6f7c2-2a5c-47d7-9c9d-1d7f1977dabc",
"host": "10.0.10.12",
"name": "production-web-1",
"status": "down",
"monitor_count": 3,
"counts": {"up": 1, "down": 2, "pending": 0, "maintenance": 0},
"avg_latency_ms": 257,
"last_check": "2025-12-05T09:54:54Z",
"last_status_change": "2025-12-05T09:53:44Z",
"affected_monitors": [
{"id":"m-01","name":"api.example","status":"down","last_check":"2025-12-05T09:54:54Z","latency":105},
{"id":"m-02","name":"www.example","status":"down","last_check":"2025-12-05T09:52:14Z","latency":401}
],
"uptime_24h": 98.77
}
],
"meta": {"page":1,"per_page":20,"total":1}
}
Error cases
- 401 Unauthorized — Invalid or missing token.
- 403 Forbidden — Caller lacks host access.
- 500 Internal Server Error — DB / aggregation error.
Observability & Operational Notes
- Metrics:
charon_uptime_aggregated_requests_total,charon_uptime_aggregated_cache_hits_total. - Cache TTL: default 30s via
CHARON_UPTIME_AGGREGATION_TTLenv var. - Logging: Rate-limited errors and aggregation durations logged to the general logger.
Follow-ups & Optional Enhancements
- Add an endpoint-level
sinceparameter that returns delta/trend information (e.g. change in down_count in last 24 hours). - Background precompute task (materialized aggregated table) for very large installations.
- Add a configuration to show
affected_monitorscollapsed/expanded per host for faster page loads.
Short List of Files To Change
-
Backend:
- backend/internal/services/uptime_service.go (add aggregation method)
- backend/internal/api/handlers/uptime_handler.go (add handler method)
- backend/internal/api/routes/routes.go (register new route)
- backend/internal/services/uptime_service_test.go (add tests)
- backend/internal/api/handlers/uptime_handler_test.go (add handler tests)
- backend/internal/models/uptime.go / uptime_host.go (index recommendations or small schema updates if needed)
-
Frontend:
- frontend/src/api/uptime.ts (add
getAggregatedHosts) - frontend/src/hooks/useAggregatedHosts.ts (new hook)
- frontend/src/components/Dashboard/HostStatusesWidget.tsx (new widget)
- frontend/src/pages/Dashboard.tsx (add widget)
- frontend/src/components/tests/HostStatusesWidget.test.tsx (new tests)
- frontend/src/api/uptime.ts (add
If you want, I can now scaffold the backend service method + handler and the frontend API client and widget as a follow-up PR.