Files

GitHub Actions f094123123 fix: add SQLite database recovery and WAL mode for corruption resilience

- Add scripts/db-recovery.sh for database integrity check and recovery
- Enable WAL mode verification with logging on startup
- Add structured error logging to uptime handlers with monitor context
- Add comprehensive database maintenance documentation

Fixes heartbeat history showing "No History Available" due to database
corruption affecting 6 out of 14 monitors.

2025-12-17 14:51:20 +00:00

16 KiB

Raw Blame History

Uptime Feature Trace Analysis - Bug Investigation

Issue: 6 out of 14 proxy hosts show "No History Available" in uptime heartbeat graphs Date: December 17, 2025 Status: 🔴 ROOT CAUSE IDENTIFIED - SQLite Database Corruption

Executive Summary

This is NOT a logic bug. The root cause is SQLite database corruption affecting specific records in the uptime_heartbeats table. The error database disk image is malformed is consistently returned when querying heartbeat history for exactly 6 specific monitor IDs.

1. Evidence from Container Logs

Error Pattern Observed

2025/12/17 07:44:04 /app/backend/internal/services/uptime_service.go:877 database disk image is malformed
[8.185ms] [rows:0] SELECT * FROM `uptime_heartbeats` WHERE monitor_id = "2b8cea58-b8f9-43fc-abe0-f6a0baba2351" ORDER BY created_at desc LIMIT 60

Affected Monitor IDs (6 total)

Monitor UUID	Status Code	Error
`2b8cea58-b8f9-43fc-abe0-f6a0baba2351`	500	database disk image is malformed
`5523d6b3-e2bf-4727-a071-6546f58e8839`	500	database disk image is malformed
`264fb47b-9814-479a-bb40-0397f21026fe`	500	database disk image is malformed
`97ecc308-ca86-41f9-ba59-5444409dee8e`	500	database disk image is malformed
`cad93a3d-6ad4-4cba-a95c-5bb9b46168cd`	500	database disk image is malformed
`cdc4d769-8703-4881-8202-4b2493bccf58`	500	database disk image is malformed

Working Monitor IDs (8 total - return HTTP 200)

fdbc17bd-a00a-4bde-b2f9-e6db69a55c0a
869aee1a-37f0-437c-b151-72074629af3e
dc254e9c-28b5-4b59-ae9a-3c0378420a5a
33371a73-09a2-4c50-b327-69fab5324728
412f9c0b-8498-4045-97c9-021d6fc2ed7e
bef3866b-dbde-4159-9c40-1fb002ed0396
84329e2b-7f7e-4c8b-a1a6-ca52d3b7e565
edd36d10-0e5b-496c-acea-4e4cf7103369
0b426c10-82b8-4cc4-af0e-2dd5f1082fb2

2. Complete File Map - Uptime Feature

Frontend Layer (`frontend/src/`)

File	Purpose
pages/Uptime.tsx	Main Uptime page component, displays MonitorCard grid
api/uptime.ts	API client functions: `getMonitors()`, `getMonitorHistory()`, `updateMonitor()`, `deleteMonitor()`, `checkMonitor()`
components/UptimeWidget.tsx	Dashboard widget showing uptime summary
No dedicated hook	Uses inline `useQuery` in components

Backend Layer (`backend/internal/`)

File	Purpose
api/routes/routes.go	Route registration for `/uptime/*` endpoints
api/handlers/uptime_handler.go	HTTP handlers: `List()`, `GetHistory()`, `Update()`, `Delete()`, `Sync()`, `CheckMonitor()`
services/uptime_service.go	Business logic: monitor checking, notification batching, history retrieval
models/uptime.go	GORM models: `UptimeMonitor`, `UptimeHeartbeat`
models/uptime_host.go	GORM models: `UptimeHost`, `UptimeNotificationEvent`

3. Data Flow Analysis

Request Flow: UI → API → DB → Response

┌─────────────────────────────────────────────────────────────────────────┐
│ FRONTEND                                                                │
├─────────────────────────────────────────────────────────────────────────┤
│ 1. Uptime.tsx loads → useQuery(['monitors'], getMonitors)               │
│ 2. For each monitor, MonitorCard renders                                │
│ 3. MonitorCard calls useQuery(['uptimeHistory', monitor.id],            │
│    () => getMonitorHistory(monitor.id, 60))                             │
└───────────────────────────────┬─────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ API CLIENT (frontend/src/api/uptime.ts)                                 │
├─────────────────────────────────────────────────────────────────────────┤
│ getMonitorHistory(id: string, limit: number = 50):                      │
│   client.get<UptimeHeartbeat[]>                                         │
│     (`/uptime/monitors/${id}/history?limit=${limit}`)                   │
└───────────────────────────────┬─────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ BACKEND ROUTES (backend/internal/api/routes/routes.go)                  │
├─────────────────────────────────────────────────────────────────────────┤
│ protected.GET("/uptime/monitors/:id/history", uptimeHandler.GetHistory) │
└───────────────────────────────┬─────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ HANDLER (backend/internal/api/handlers/uptime_handler.go)               │
├─────────────────────────────────────────────────────────────────────────┤
│ func (h *UptimeHandler) GetHistory(c *gin.Context) {                    │
│     id := c.Param("id")                                                 │
│     limit, _ := strconv.Atoi(c.DefaultQuery("limit", "50"))             │
│     history, err := h.service.GetMonitorHistory(id, limit)              │
│     if err != nil {                                                     │
│         c.JSON(500, gin.H{"error": "Failed to get history"}) ◄─ ERROR   │
│         return                                                          │
│     }                                                                   │
│     c.JSON(200, history)                                                │
│ }                                                                       │
└───────────────────────────────┬─────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ SERVICE (backend/internal/services/uptime_service.go:875-879)           │
├─────────────────────────────────────────────────────────────────────────┤
│ func (s *UptimeService) GetMonitorHistory(id string, limit int)         │
│     ([]models.UptimeHeartbeat, error) {                                 │
│     var heartbeats []models.UptimeHeartbeat                             │
│     result := s.DB.Where("monitor_id = ?", id)                          │
│                   .Order("created_at desc")                             │
│                   .Limit(limit)                                         │
│                   .Find(&heartbeats)     ◄─ GORM QUERY                  │
│     return heartbeats, result.Error      ◄─ ERROR RETURNED HERE         │
│ }                                                                       │
└───────────────────────────────┬─────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ DATABASE (SQLite via GORM)                                              │
├─────────────────────────────────────────────────────────────────────────┤
│ SELECT * FROM uptime_heartbeats                                         │
│ WHERE monitor_id = "..."                                                │
│ ORDER BY created_at desc                                                │
│ LIMIT 60                                                                │
│                                                                         │
│ ERROR: "database disk image is malformed"                               │
└─────────────────────────────────────────────────────────────────────────┘

4. Database Schema

UptimeMonitor Table

type UptimeMonitor struct {
    ID             string    `gorm:"primaryKey" json:"id"`      // UUID
    ProxyHostID    *uint     `json:"proxy_host_id"`             // Optional FK
    RemoteServerID *uint     `json:"remote_server_id"`          // Optional FK
    UptimeHostID   *string   `json:"uptime_host_id"`            // FK to UptimeHost
    Name           string    `json:"name"`
    Type           string    `json:"type"`                      // http, tcp, ping
    URL            string    `json:"url"`
    UpstreamHost   string    `json:"upstream_host"`
    Interval       int       `json:"interval"`                  // seconds
    Enabled        bool      `json:"enabled"`
    Status         string    `json:"status"`                    // up, down, pending
    LastCheck      time.Time `json:"last_check"`
    Latency        int64     `json:"latency"`                   // ms
    FailureCount   int       `json:"failure_count"`
    MaxRetries     int       `json:"max_retries"`
    // ... timestamps
}

UptimeHeartbeat Table (where corruption exists)

type UptimeHeartbeat struct {
    ID        uint      `gorm:"primaryKey" json:"id"`          // Auto-increment
    MonitorID string    `json:"monitor_id" gorm:"index"`       // UUID FK
    Status    string    `json:"status"`                        // up, down
    Latency   int64     `json:"latency"`
    Message   string    `json:"message"`
    CreatedAt time.Time `json:"created_at" gorm:"index"`
}

5. Root Cause Identification

Primary Issue: SQLite Database Corruption

The error database disk image is malformed is a SQLite-specific error indicating:

Corruption in the database file's B-tree structure
Possible causes:
1. Disk I/O errors during write operations
2. Unexpected container shutdown mid-transaction
3. File system issues in Docker volume
4. Database file written by multiple processes (concurrent access without WAL)
5. Full disk causing incomplete writes

Why Only Some Monitors Are Affected

The corruption appears to be localized to specific B-tree pages that contain the heartbeat records for those 6 monitors. SQLite's error occurs when:

The query touches corrupted pages
The index on monitor_id or created_at has corruption
The data pages for those specific rows are damaged

Evidence Supporting This Conclusion

Consistent 500 errors for the same 6 monitor IDs
Other queries succeed (listing monitors returns 200)
Error occurs at the GORM layer (service.go:877)
Query itself is correct (same pattern works for 8 other monitors)
No ID mismatch - UUIDs are correctly passed from frontend to backend

6. Recommended Actions

Immediate Actions

Stop the container gracefully to prevent further corruption:
```
docker stop charon
```

Backup the current database before any repair:

docker cp charon:/app/data/charon.db ./charon.db.backup.$(date +%Y%m%d)

Check database integrity from within container:

docker exec -it charon sqlite3 /app/data/charon.db "PRAGMA integrity_check;"

Attempt database recovery:

# Export all data that can be read
sqlite3 /app/data/charon.db ".dump" > dump.sql
# Create new database
sqlite3 /app/data/charon_new.db < dump.sql
# Replace original
mv /app/data/charon_new.db /app/data/charon.db

If Recovery Fails

Delete corrupted heartbeat records (lossy but restores functionality):

DELETE FROM uptime_heartbeats WHERE monitor_id IN (
    '2b8cea58-b8f9-43fc-abe0-f6a0baba2351',
    '5523d6b3-e2bf-4727-a071-6546f58e8839',
    '264fb47b-9814-479a-bb40-0397f21026fe',
    '97ecc308-ca86-41f9-ba59-5444409dee8e',
    'cad93a3d-6ad4-4cba-a95c-5bb9b46168cd',
    'cdc4d769-8703-4881-8202-4b2493bccf58'
);
VACUUM;

Long-Term Prevention

Enable WAL mode for better crash resilience (in DB initialization):
```
db.Exec("PRAGMA journal_mode=WAL;")
```
Add periodic VACUUM to compact database and rebuild indexes
Consider heartbeat table rotation - archive old heartbeats to prevent unbounded growth

7. Code Quality Notes

No Logic Bugs Found

After tracing the complete data flow:

✅ Frontend correctly passes monitor UUID
✅ API route correctly extracts :id param
✅ Handler correctly calls service with UUID
✅ Service correctly queries by monitor_id
✅ GORM model has correct field types and indexes

Potential Improvement: Error Handling

The handler currently returns generic "Failed to get history" for all errors:

// Current (hides root cause)
if err != nil {
    c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to get history"})
    return
}

// Better (exposes root cause in logs, generic to user)
if err != nil {
    logger.Log().WithError(err).WithField("monitor_id", id).Error("GetHistory failed")
    c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to get history"})
    return
}

8. Summary

Question	Answer
Is this a frontend bug?	❌ No
Is this a backend logic bug?	❌ No
Is this an ID mismatch?	❌ No (UUIDs are consistent)
Is this a timing issue?	❌ No
Is this database corruption?	✅ YES
Affected component	SQLite `uptime_heartbeats` table
Root cause	Disk image malformed (B-tree corruption)
Immediate fix	Database recovery/rebuild
Permanent fix	Enable WAL mode, graceful shutdowns

Investigation completed: December 17, 2025 Investigator: GitHub Copilot

16 KiB Raw Blame History