diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index a964e8da..55d2aa54 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -306,11 +306,13 @@ graph TB **Key Modules:** #### API Layer (`internal/api/`) + - **Handlers:** Process HTTP requests, validate input, return responses - **Middleware:** CORS, GZIP, authentication, logging, metrics, panic recovery - **Routes:** Route registration and grouping (public vs authenticated) **Example Endpoints:** + - `GET /api/v1/proxy-hosts` - List all proxy hosts - `POST /api/v1/proxy-hosts` - Create new proxy host - `PUT /api/v1/proxy-hosts/:id` - Update proxy host @@ -318,6 +320,7 @@ graph TB - `WS /api/v1/logs` - WebSocket for real-time logs #### Service Layer (`internal/services/`) + - **ProxyService:** CRUD operations for proxy hosts, validation logic - **CertificateService:** ACME certificate provisioning and renewal - **DockerService:** Container discovery and monitoring @@ -327,12 +330,14 @@ graph TB **Design Pattern:** Services contain business logic and call multiple repositories/managers #### Caddy Manager (`internal/caddy/`) + - **Manager:** Orchestrates Caddy configuration updates - **Config Builder:** Generates Caddy JSON from database models - **Reload Logic:** Atomic config application with rollback on failure - **Security Integration:** Injects Cerberus middleware into Caddy pipelines **Responsibilities:** + 1. Generate Caddy JSON configuration from database state 2. Validate configuration before applying 3. Trigger Caddy reload via JSON API @@ -340,22 +345,26 @@ graph TB 5. Integrate security layers (WAF, ACL, Rate Limiting) #### Security Suite (`internal/cerberus/`) + - **ACL (Access Control Lists):** IP-based allow/deny rules, GeoIP blocking - **WAF (Web Application Firewall):** Coraza engine with OWASP CRS - **CrowdSec:** Behavior-based threat detection with global intelligence - **Rate Limiter:** Per-IP request throttling **Integration Points:** + - Middleware injection into Caddy request pipeline - Database-driven rule configuration - Metrics collection for security events #### Database Layer (`internal/database/`) + - **Migrations:** Automatic schema versioning with GORM AutoMigrate - **Seeding:** Default settings and admin user creation - **Connection Management:** SQLite with WAL mode and connection pooling **Schema Overview:** + - **ProxyHost:** Domain, upstream target, SSL config - **RemoteServer:** Upstream server definitions - **CaddyConfig:** Generated Caddy configuration (audit trail) @@ -372,6 +381,7 @@ graph TB **Component Architecture:** #### Pages (`src/pages/`) + - **Dashboard:** System overview, recent activity, quick actions - **ProxyHosts:** List, create, edit, delete proxy configurations - **Certificates:** Manage SSL/TLS certificates, view expiry @@ -380,17 +390,20 @@ graph TB - **Users:** User management (admin only) #### Components (`src/components/`) + - **Forms:** Reusable form inputs with validation - **Modals:** Dialog components for CRUD operations - **Tables:** Data tables with sorting, filtering, pagination - **Layout:** Header, sidebar, navigation #### API Client (`src/api/`) + - Centralized API calls with error handling - Request/response type definitions - Authentication token management **Example:** + ```typescript export const getProxyHosts = async (): Promise => { const response = await fetch('/api/v1/proxy-hosts', { @@ -402,11 +415,13 @@ export const getProxyHosts = async (): Promise => { ``` #### State Management + - **React Context:** Global state for auth, theme, language - **Local State:** Component-specific state with `useState` - **Custom Hooks:** Encapsulate API calls and side effects **Example Hook:** + ```typescript export const useProxyHosts = () => { const [hosts, setHosts] = useState([]); @@ -425,11 +440,13 @@ export const useProxyHosts = () => { **Purpose:** High-performance reverse proxy with automatic HTTPS **Integration:** + - Embedded as a library in the Go backend - Configured via JSON API (not Caddyfile) - Listens on ports 80 (HTTP) and 443 (HTTPS) **Features Used:** + - Dynamic configuration updates without restarts - Automatic HTTPS with Let's Encrypt and ZeroSSL - DNS challenge support for wildcard certificates @@ -437,6 +454,7 @@ export const useProxyHosts = () => { - Request logging and metrics **Configuration Flow:** + 1. User creates proxy host via frontend 2. Backend validates and saves to database 3. Caddy Manager generates JSON configuration @@ -461,12 +479,14 @@ For each proxy host, Charon generates **two routes** with the same domain: - Handlers: Full Cerberus security suite This pattern is **intentional and valid**: + - Emergency route provides break-glass access to security controls - Main route protects application with enterprise security features - Caddy processes routes in order (emergency matches first) - Validator allows duplicate hosts when one has paths and one doesn't **Example:** + ```json // Emergency Route (evaluated first) { @@ -488,6 +508,7 @@ This pattern is **intentional and valid**: **Purpose:** Persistent data storage **Why SQLite:** + - Embedded (no external database server) - Serverless (perfect for single-user/small team) - ACID compliant with WAL mode @@ -495,16 +516,19 @@ This pattern is **intentional and valid**: - Backup-friendly (single file) **Configuration:** + - **WAL Mode:** Allows concurrent reads during writes - **Foreign Keys:** Enforced referential integrity - **Pragma Settings:** Performance optimizations **Backup Strategy:** + - Automated daily backups to `data/backups/` - Retention: 7 daily, 4 weekly, 12 monthly backups - Backup during low-traffic periods **Migrations:** + - GORM AutoMigrate for schema changes - Manual migrations for complex data transformations - Rollback support via backup restoration @@ -537,6 +561,7 @@ graph LR **Purpose:** Prevent brute-force attacks and API abuse **Implementation:** + - Per-IP request counters with sliding window - Configurable thresholds (e.g., 100 req/min, 1000 req/hour) - HTTP 429 response when limit exceeded @@ -547,12 +572,14 @@ graph LR **Purpose:** Behavior-based threat detection **Features:** + - Local log analysis (brute-force, port scans, exploits) - Global threat intelligence (crowd-sourced IP reputation) - Automatic IP banning with configurable duration - Decision management API (view, create, delete bans) **Modes:** + - **Local Only:** No external API calls - **API Mode:** Sync with CrowdSec cloud for global intelligence @@ -561,12 +588,14 @@ graph LR **Purpose:** IP-based access control **Features:** + - Per-proxy-host allow/deny rules - CIDR range support (e.g., `192.168.1.0/24`) - Geographic blocking via GeoIP2 (MaxMind) - Admin whitelist (emergency access) **Evaluation Order:** + 1. Check admin whitelist (always allow) 2. Check deny list (explicit block) 3. Check allow list (explicit allow) @@ -579,6 +608,7 @@ graph LR **Engine:** Coraza with OWASP Core Rule Set (CRS) **Detection Categories:** + - SQL Injection (SQLi) - Cross-Site Scripting (XSS) - Remote Code Execution (RCE) @@ -587,12 +617,14 @@ graph LR - Command Injection **Modes:** + - **Monitor:** Log but don't block (testing) - **Block:** Return HTTP 403 for violations ### Layer 5: Application Security **Additional Protections:** + - **SSRF Prevention:** Block requests to private IP ranges in webhooks/URL validation - **HTTP Security Headers:** CSP, HSTS, X-Frame-Options, X-Content-Type-Options - **Input Validation:** Server-side validation for all user inputs @@ -610,6 +642,7 @@ graph LR 3. **Direct Database Access:** Manual SQLite update as last resort **Emergency Token:** + - 64-character hex token set via `CHARON_EMERGENCY_TOKEN` - Grants temporary admin access - Rotated after each use @@ -635,6 +668,7 @@ Charon operates with **two distinct traffic flows** on separate ports, each with - **Testing:** Playwright E2E tests verify UI/UX functionality on this port **Why No Middleware?** + - Management interface must remain accessible even when security modules are misconfigured - Emergency endpoints (`/api/v1/emergency/*`) require unrestricted access for system recovery - Separation of concerns: admin access control is handled by JWT, not proxy-level security @@ -797,6 +831,7 @@ sequenceDiagram **Rationale:** Simplicity over scalability - target audience is home users and small teams **Container Contents:** + - Frontend static files (Vite build output) - Go backend binary - Embedded Caddy server @@ -911,11 +946,13 @@ services: ### High Availability Considerations **Current Limitations:** + - SQLite does not support clustering - Single point of failure (one container) - Not designed for horizontal scaling **Future Options:** + - PostgreSQL backend for HA deployments - Read replicas for load balancing - Container orchestration (Kubernetes, Docker Swarm) @@ -927,6 +964,7 @@ services: ### Local Development Setup 1. **Prerequisites:** + ```bash - Go 1.26+ (backend development) - Node.js 23+ and npm (frontend development) @@ -935,12 +973,14 @@ services: ``` 2. **Clone Repository:** + ```bash git clone https://github.com/Wikid82/Charon.git cd Charon ``` 3. **Backend Development:** + ```bash cd backend go mod download @@ -949,6 +989,7 @@ services: ``` 4. **Frontend Development:** + ```bash cd frontend npm install @@ -957,6 +998,7 @@ services: ``` 5. **Full-Stack Development (Docker):** + ```bash docker-compose -f .docker/compose/docker-compose.dev.yml up # Frontend + Backend + Caddy in one container @@ -965,12 +1007,14 @@ services: ### Git Workflow **Branch Strategy:** + - `main`: Stable production branch - `feature/*`: New feature development - `fix/*`: Bug fixes - `chore/*`: Maintenance tasks **Commit Convention:** + - `feat:` New user-facing feature - `fix:` Bug fix in application code - `chore:` Infrastructure, CI/CD, dependencies @@ -979,6 +1023,7 @@ services: - `test:` Adding or updating tests **Example:** + ``` feat: add DNS-01 challenge support for Cloudflare @@ -1031,6 +1076,7 @@ Closes #123 **Purpose:** Validate critical user flows in a real browser **Scope:** + - User authentication - Proxy host CRUD operations - Certificate provisioning @@ -1038,6 +1084,7 @@ Closes #123 - Real-time log streaming **Execution:** + ```bash # Run against Docker container npx playwright test --project=chromium @@ -1050,10 +1097,12 @@ npx playwright test --debug ``` **Coverage Modes:** + - **Docker Mode:** Integration testing, no coverage (0% reported) - **Vite Dev Mode:** Coverage collection with V8 inspector **Why Two Modes?** + - Playwright coverage requires source maps and raw source files - Docker serves pre-built production files (no source maps) - Vite dev server exposes source files for coverage instrumentation @@ -1067,6 +1116,7 @@ npx playwright test --debug **Coverage Target:** 85% minimum **Execution:** + ```bash # Run all tests go test ./... @@ -1079,11 +1129,13 @@ go test -cover ./... ``` **Test Organization:** + - `*_test.go` files alongside source code - Table-driven tests for comprehensive coverage - Mocks for external dependencies (database, HTTP clients) **Example:** + ```go func TestCreateProxyHost(t *testing.T) { tests := []struct { @@ -1123,6 +1175,7 @@ func TestCreateProxyHost(t *testing.T) { **Coverage Target:** 85% minimum **Execution:** + ```bash # Run all tests npm test @@ -1135,6 +1188,7 @@ npm run test:coverage ``` **Test Organization:** + - `*.test.tsx` files alongside components - Mock API calls with MSW (Mock Service Worker) - Snapshot tests for UI consistency @@ -1146,12 +1200,14 @@ npm run test:coverage **Location:** `backend/integration/` **Scope:** + - API endpoint end-to-end flows - Database migrations - Caddy manager integration - CrowdSec API calls **Execution:** + ```bash go test ./integration/... ``` @@ -1161,6 +1217,7 @@ go test ./integration/... **Automated Hooks (via `.pre-commit-config.yaml`):** **Fast Stage (< 5 seconds):** + - Trailing whitespace removal - EOF fixer - YAML syntax check @@ -1168,11 +1225,13 @@ go test ./integration/... - Markdown link validation **Manual Stage (run explicitly):** + - Backend coverage tests (60-90s) - Frontend coverage tests (30-60s) - TypeScript type checking (10-20s) **Why Manual?** + - Coverage tests are slow and would block commits - Developers run them on-demand before pushing - CI enforces coverage on pull requests @@ -1180,10 +1239,12 @@ go test ./integration/... ### Continuous Integration (GitHub Actions) **Workflow Triggers:** + - `push` to `main`, `feature/*`, `fix/*` - `pull_request` to `main` **CI Jobs:** + 1. **Lint:** golangci-lint, ESLint, markdownlint, hadolint 2. **Test:** Go tests, Vitest, Playwright 3. **Security:** Trivy, CodeQL, Grype, Govulncheck @@ -1205,6 +1266,7 @@ go test ./integration/... - **PRERELEASE:** `-beta.1`, `-rc.1`, etc. **Examples:** + - `1.0.0` - Stable release - `1.1.0` - New feature (DNS provider support) - `1.1.1` - Bug fix (GORM query fix) @@ -1215,12 +1277,14 @@ go test ./integration/... ### Build Pipeline (Multi-Platform) **Platforms Supported:** + - `linux/amd64` - `linux/arm64` **Build Process:** 1. **Frontend Build:** + ```bash cd frontend npm ci --only=production @@ -1229,6 +1293,7 @@ go test ./integration/... ``` 2. **Backend Build:** + ```bash cd backend go build -o charon cmd/api/main.go @@ -1236,6 +1301,7 @@ go test ./integration/... ``` 3. **Docker Image Build:** + ```bash docker buildx build \ --platform linux/amd64,linux/arm64 \ @@ -1292,6 +1358,7 @@ go test ./integration/... - Level: SLSA Build L3 (hermetic builds) **Verification Example:** + ```bash # Verify image signature cosign verify \ @@ -1309,6 +1376,7 @@ grype ghcr.io/wikid82/charon@sha256: ### Rollback Strategy **Container Rollback:** + ```bash # List available versions docker images wikid82/charon @@ -1319,6 +1387,7 @@ docker-compose up -d --pull always wikid82/charon:1.1.1 ``` **Database Rollback:** + ```bash # Restore from backup docker exec charon /app/scripts/restore-backup.sh \ @@ -1355,11 +1424,13 @@ docker exec charon /app/scripts/restore-backup.sh \ ### API Extensibility **REST API Design:** + - Version prefix: `/api/v1/` - Future versions: `/api/v2/` (backward-compatible) - Deprecation policy: 2 major versions supported **WebHooks (Future):** + - Event notifications for external systems - Triggers: Proxy host created, certificate renewed, security event - Payload: JSON with event type and data @@ -1369,6 +1440,7 @@ docker exec charon /app/scripts/restore-backup.sh \ **Current:** Cerberus security middleware injected into Caddy pipeline **Future:** + - User-defined middleware (rate limiting rules, custom headers) - JavaScript/Lua scripting for request transformation - Plugin marketplace for community contributions @@ -1452,6 +1524,7 @@ docker exec charon /app/scripts/restore-backup.sh \ **GitHub Copilot Instructions:** All agents (`Planning`, `Backend_Dev`, `Frontend_Dev`, `DevOps`) must reference `ARCHITECTURE.md` when: + - Creating new components - Modifying core systems - Changing integration points diff --git a/CHANGELOG.md b/CHANGELOG.md index a78f0d11..780670df 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -30,16 +30,19 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - See [Notification Guide](docs/features/notifications.md) for setup instructions ### CI/CD + - **Supply Chain**: Optimized verification workflow to prevent redundant builds - Change: Removed direct Push/PR triggers; now waits for 'Docker Build' via `workflow_run` ### Security + - **Supply Chain**: Enhanced PR verification workflow stability and accuracy - **Vulnerability Reporting**: Eliminated false negatives ("0 vulnerabilities") by enforcing strict failure conditions - **Tooling**: Switched to manual Grype installation ensuring usage of latest stable binary - **Observability**: Improved debugging visibility for vulnerability scans and SARIF generation ### Performance + - **E2E Tests**: Reduced feature flag API calls by 90% through conditional polling optimization (Phase 2) - Conditional skip: Exits immediately if flags already in expected state (~50% of cases) - Request coalescing: Shares in-flight API requests between parallel test workers @@ -51,6 +54,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - Prevents timeout errors in Firefox/WebKit caused by strict label matching ### Fixed + - **TCP Monitor Creation**: Fixed misleading form UX that caused silent HTTP 500 errors when creating TCP monitors - Corrected URL placeholder to show `host:port` format instead of the incorrect `tcp://host:port` prefix - Added dynamic per-type placeholder and helper text (HTTP monitors show a full URL example; TCP monitors show `host:port`) @@ -72,6 +76,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - **Test Performance**: Reduced system settings test execution time by 31% (from 23 minutes to 16 minutes) ### Changed + - **Testing Infrastructure**: Enhanced E2E test helpers with better synchronization and error handling - **CI**: Optimized E2E workflow shards [Reduced from 4 to 3] diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 963bd4d2..422b8534 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -45,8 +45,6 @@ brew install lefthook go install github.com/evilmartians/lefthook@latest ``` - - ```bash # Option 1: Homebrew (macOS/Linux) brew install golangci-lint @@ -84,17 +82,20 @@ For local development, install go 1.26.0+ from [go.dev/dl](https://go.dev/dl/). When the project's Go version is updated (usually by Renovate): 1. **Pull the latest changes** + ```bash git pull ``` 2. **Update your local Go installation** + ```bash # Run the Go update skill (downloads and installs the new version) .github/skills/scripts/skill-runner.sh utility-update-go-version ``` 3. **Rebuild your development tools** + ```bash # This fixes lefthook hook errors and IDE issues ./scripts/rebuild-go-tools.sh diff --git a/README.md b/README.md index 64f23ed8..776b95a6 100644 --- a/README.md +++ b/README.md @@ -94,6 +94,7 @@ services: retries: 3 start_period: 40s ``` + > **Docker Socket Access:** Charon runs as a non-root user. If you mount the Docker socket for container discovery, the container needs permission to read it. Find your socket's group ID and add it to the compose file: > > ```bash @@ -107,26 +108,34 @@ services: > - "998" > ``` -### 2️⃣ Generate encryption key: +### 2️⃣ Generate encryption key + ```bash openssl rand -base64 32 ``` -### 3️⃣ Start Charon: + +### 3️⃣ Start Charon + ```bash docker-compose up -d ``` -### 4️⃣ Access the dashboard: + +### 4️⃣ Access the dashboard + Open your browser and navigate to `http://localhost:8080` to access the dashboard and create your admin account. + ```code http://localhost:8080 ``` -### Getting Started: -Full setup instructions and documentation are available at [https://wikid82.github.io/Charon/docs/getting-started.html](https://wikid82.github.io/Charon/docs/getting-started.html). +### Getting Started + +Full setup instructions and documentation are available at [https://wikid82.github.io/Charon/docs/getting-started.html](https://wikid82.github.io/Charon/docs/getting-started.html). --- ## ✨ Top 10 Features ### 🎯 **Point & Click Management** + No config files. No terminal commands. Just click, type your domain name, and you're live. If you can use a website, you can run Charon. ### 🔐 **Automatic HTTPS Certificates** @@ -160,6 +169,7 @@ See exactly what's happening with live request logs, uptime monitoring, and inst ### 📥 **Migration Made Easy** Already invested in another reverse proxy? Bring your work with you by importing your existing configurations with one click: + - **Caddyfile** — Migrate from other Caddy setups - **Nginx** — Import from Nginx based configurations (Coming Soon) - **Traefik** - Import from Traefik based configurations (Coming Soon) diff --git a/SECURITY.md b/SECURITY.md index 51679df7..96c6c2bc 100644 --- a/SECURITY.md +++ b/SECURITY.md @@ -41,16 +41,19 @@ container image. The binaries were compiled against Go 1.25.6, which contains th Charon's own application code, compiled with Go 1.26.1, is unaffected. **Who** + - Discovered by: Automated scan (Grype) - Reported: 2026-03-20 - Affects: CrowdSec Agent component within the container; not directly exposed through Charon's primary application interface **Where** + - Component: CrowdSec Agent (bundled `cscli` and `crowdsec` binaries) - Versions affected: Charon container images with CrowdSec binaries compiled against Go < 1.25.7 **When** + - Discovered: 2026-03-20 - Disclosed (if public): Not yet publicly disclosed - Target fix: When `golang:1.26.2-alpine` is published on Docker Hub @@ -82,16 +85,19 @@ configuration includes the `DEFAULT` keyword, potentially allowing downgrade to suites. Affects Alpine 3.23.3 packages `libcrypto3` and `libssl3` at version 3.5.5-r0. **Who** + - Discovered by: Automated scan (Grype) - Reported: 2026-03-20 - Affects: Container runtime environment; Caddy reverse proxy TLS negotiation could be affected if default key group configuration is used **Where** + - Component: Alpine 3.23.3 base image (`libcrypto3` 3.5.5-r0, `libssl3` 3.5.5-r0) - Versions affected: Alpine 3.23.3 prior to a patched `openssl` APK release **When** + - Discovered: 2026-03-20 - Disclosed (if public): 2026-03-13 (OpenSSL advisory) - Target fix: When Alpine Security publishes a patched `openssl` APK @@ -103,7 +109,7 @@ does not use the `DEFAULT` keyword, which limits practical exploitability. The p present in the base image regardless of Caddy's configuration. **Planned Remediation** -Monitor https://security.alpinelinux.org/vuln/CVE-2026-2673 for a patched Alpine APK. Once +Monitor for a patched Alpine APK. Once available, update the pinned `ALPINE_IMAGE` digest in the Dockerfile, or add an explicit `RUN apk upgrade --no-cache libcrypto3 libssl3` to the runtime stage. @@ -126,16 +132,19 @@ tracked separately above). All issues resolve when CrowdSec is rebuilt against G Charon's own application code is unaffected. **Who** + - Discovered by: Automated scan (Trivy, Grype) - Reported: 2025-12-01 (original cluster); expanded 2026-03-20 - Affects: CrowdSec Agent component within the container; not directly exposed through Charon's primary application interface **Where** + - Component: CrowdSec Agent (bundled `cscli` and `crowdsec` binaries) - Versions affected: All Charon versions shipping CrowdSec binaries compiled against Go < 1.26.2 **When** + - Discovered: 2025-12-01 - Disclosed (if public): Not yet publicly disclosed - Target fix: When `golang:1.26.2-alpine` is published on Docker Hub @@ -168,16 +177,19 @@ loop with no termination condition when given a specially crafted input, causing (CWE-1284). **Who** + - Discovered by: 7aSecurity audit (commissioned by OSTIF) - Reported: 2026-02-17 - Affects: Any component in the container that calls `crc32_combine`-family functions with attacker-controlled input; not directly exposed through Charon's application interface **Where** + - Component: Alpine 3.23.3 base image (`zlib` package, version 1.3.1-r2) - Versions affected: zlib < 1.3.2; all current Charon images using Alpine 3.23.3 **When** + - Discovered: 2026-02-17 (NVD published 2026-02-17) - Disclosed (if public): 2026-02-17 - Target fix: When Alpine 3.23 publishes a patched `zlib` APK (requires zlib 1.3.2) @@ -188,7 +200,7 @@ to the `crc32_combine`-family functions. This code path is not invoked by Charon or backend API. The vulnerability is non-blocking under the project's CI severity policy. **Planned Remediation** -Monitor https://security.alpinelinux.org/vuln/CVE-2026-27171 for a patched Alpine APK. Once +Monitor for a patched Alpine APK. Once available, update the pinned `ALPINE_IMAGE` digest in the Dockerfile, or add an explicit `RUN apk upgrade --no-cache zlib` to the runtime stage. Remove the `.trivyignore` entry at that time. @@ -211,14 +223,17 @@ Seven HIGH-severity CVEs in Debian Trixie base image system libraries (`glibc`, available from the Debian Security Team. **Who** + - Discovered by: Automated scan (Trivy) - Reported: 2026-02-04 **Where** + - Component: Debian Trixie base image (`libc6`, `libc-bin`, `libtasn1-6`, `libtiff`) - Versions affected: Charon container images built on Debian Trixie base (prior to Alpine migration) **When** + - Discovered: 2026-02-04 - Patched: 2026-03-20 - Time to patch: 44 days @@ -256,14 +271,17 @@ by CrowdSec for expression evaluation. Malicious regular expressions in CrowdSec parsers could cause CPU exhaustion and service degradation through exponential backtracking. **Who** + - Discovered by: Automated scan (Trivy) - Reported: 2026-01-11 **Where** + - Component: CrowdSec (via `expr-lang/expr` dependency) - Versions affected: CrowdSec versions using `expr-lang/expr` < v1.17.7 **When** + - Discovered: 2026-01-11 - Patched: 2026-01-11 - Time to patch: 0 days diff --git a/VERSION.md b/VERSION.md index 90129050..311c0601 100644 --- a/VERSION.md +++ b/VERSION.md @@ -24,8 +24,10 @@ Example: `0.1.0-alpha`, `1.0.0-beta.1`, `2.0.0-rc.2` 1. **Create and push a release tag**: ```bash + git tag -a v1.0.0 -m "Release v1.0.0" git push origin v1.0.0 + ``` 2. **GitHub Actions automatically**: @@ -51,10 +53,12 @@ Use it only when you need local/version-file parity checks: echo "1.0.0" > .version ``` -2. **Validate `.version` matches the latest tag**: +1. **Validate `.version` matches the latest tag**: ```bash + bash scripts/check-version-match-tag.sh + ``` ### Deterministic Rollout Verification Gates (Mandatory) diff --git a/docs/SECURITY_PRACTICES.md b/docs/SECURITY_PRACTICES.md index 69b44bb9..fc7961cc 100644 --- a/docs/SECURITY_PRACTICES.md +++ b/docs/SECURITY_PRACTICES.md @@ -53,6 +53,7 @@ logger.Infof("API Key: %s", apiKey) ``` Charon's masking rules: + - Empty: `[empty]` - Short (< 16 chars): `[REDACTED]` - Normal (≥ 16 chars): `abcd...xyz9` (first 4 + last 4) @@ -68,6 +69,7 @@ if !validateAPIKeyFormat(apiKey) { ``` Requirements: + - Length: 16-128 characters - Charset: Alphanumeric + underscore + hyphen - No spaces or special characters @@ -99,6 +101,7 @@ Rotate secrets regularly: ### What to Log ✅ **Safe to log**: + - Timestamps - User IDs (not usernames if PII) - IP addresses (consider GDPR implications) @@ -108,6 +111,7 @@ Rotate secrets regularly: - Performance metrics ❌ **Never log**: + - Passwords or password hashes - API keys or tokens (use masking) - Session IDs (full values) @@ -139,6 +143,7 @@ logger.Infof("Login attempt: username=%s password=%s", username, password) ### Log Aggregation If using external log services (CloudWatch, Splunk, Datadog): + - Ensure logs are encrypted in transit (TLS) - Ensure logs are encrypted at rest - Redact sensitive data before shipping @@ -333,6 +338,7 @@ limiter := rate.NewLimiter(rate.Every(36*time.Second), 100) ``` **Critical endpoints** (require stricter limits): + - Login: 5 attempts per 15 minutes - Password reset: 3 attempts per hour - API key generation: 5 per day @@ -369,6 +375,7 @@ return c.JSON(401, gin.H{"error": "invalid API key: abc123"}) **Applicable if**: Processing data of EU residents **Requirements**: + 1. **Data minimization**: Collect only necessary data 2. **Purpose limitation**: Use data only for stated purposes 3. **Storage limitation**: Delete data when no longer needed @@ -376,6 +383,7 @@ return c.JSON(401, gin.H{"error": "invalid API key: abc123"}) 5. **Breach notification**: Report breaches within 72 hours **Implementation**: + - ✅ Charon masks API keys in logs (prevents exposure of personal data) - ✅ Secure file permissions (0600) protect sensitive data - ✅ Log retention policies prevent indefinite storage @@ -390,12 +398,14 @@ return c.JSON(401, gin.H{"error": "invalid API key: abc123"}) **Applicable if**: Processing, storing, or transmitting credit card data **Requirements**: + 1. **Requirement 3.4**: Render PAN unreadable (encryption, masking) 2. **Requirement 8.2**: Strong authentication 3. **Requirement 10.2**: Audit trails 4. **Requirement 10.7**: Retain audit logs for 1 year **Implementation**: + - ✅ Charon uses masking for sensitive credentials (same principle for PAN) - ✅ Secure file permissions align with access control requirements - ⚠️ Charon doesn't handle payment cards directly (delegated to payment processors) @@ -409,12 +419,14 @@ return c.JSON(401, gin.H{"error": "invalid API key: abc123"}) **Applicable if**: SaaS providers, cloud services **Trust Service Criteria**: + 1. **CC6.1**: Logical access controls (authentication, authorization) 2. **CC6.6**: Encryption of data in transit 3. **CC6.7**: Encryption of data at rest 4. **CC7.2**: Monitoring and detection (logging, alerting) **Implementation**: + - ✅ API key validation ensures strong credentials (CC6.1) - ✅ File permissions (0600) protect data at rest (CC6.7) - ✅ Masked logging enables monitoring without exposing secrets (CC7.2) @@ -429,12 +441,14 @@ return c.JSON(401, gin.H{"error": "invalid API key: abc123"}) **Applicable to**: Any organization implementing ISMS **Key Controls**: + 1. **A.9.4.3**: Password management systems 2. **A.10.1.1**: Cryptographic controls 3. **A.12.4.1**: Event logging 4. **A.18.1.5**: Protection of personal data **Implementation**: + - ✅ API key format validation (minimum 16 chars, charset restrictions) - ✅ Key rotation procedures documented - ✅ Secure storage with file permissions (0600) @@ -491,6 +505,7 @@ grep -i "api[_-]key\|token\|password" playwright-report/index.html **Recommended schedule**: Annual or after major releases **Focus areas**: + 1. Authentication bypass 2. Authorization vulnerabilities 3. SQL injection diff --git a/docs/actions/trivy-scan-failure.md b/docs/actions/trivy-scan-failure.md index ec608bde..8a77fdea 100644 --- a/docs/actions/trivy-scan-failure.md +++ b/docs/actions/trivy-scan-failure.md @@ -1,6 +1,6 @@ **Status**: ✅ RESOLVED (January 30, 2026) -https://github.com/Wikid82/Charon/actions/runs/21503634925/job/61955008214 + Run # Normalize image name for reference 🔍 Extracting binary from: ghcr.io/wikid82/charon:feature/beta-release @@ -27,6 +27,7 @@ Add a check to ensure steps.pr-info.outputs.pr_number is set before constructing Suggested code improvement for the “Extract charon binary from container” step: YAML + - name: Extract charon binary from container if: steps.check-artifact.outputs.artifact_exists == 'true' id: extract @@ -44,6 +45,7 @@ YAML echo "🔍 Extracting binary from: ${IMAGE_REF}" ... This ensures the workflow does not attempt to use an invalid image tag when the PR number is missing. Adjust similar logic throughout the workflow to handle missing variables gracefully. + ## Resolution Fixed by adding proper validation for PR number before constructing Docker image reference, ensuring IMAGE_REF is never constructed with empty/missing variables. Branch name sanitization also implemented to handle slashes in feature branch names. diff --git a/docs/analysis/crowdsec_integration_failure_analysis.md b/docs/analysis/crowdsec_integration_failure_analysis.md index db28150c..ea054851 100644 --- a/docs/analysis/crowdsec_integration_failure_analysis.md +++ b/docs/analysis/crowdsec_integration_failure_analysis.md @@ -2,7 +2,7 @@ **Date:** 2026-01-28 **PR:** #550 - Alpine to Debian Trixie Migration -**CI Run:** https://github.com/Wikid82/Charon/actions/runs/21456678628/job/61799104804 +**CI Run:** **Branch:** feature/beta-release --- @@ -18,16 +18,19 @@ The CrowdSec integration tests are failing after migrating the Dockerfile from A ### 1. **CrowdSec Builder Stage Compatibility** **Alpine vs Debian Differences:** + - **Alpine** uses `musl libc`, **Debian** uses `glibc` - Different package managers: `apk` (Alpine) vs `apt` (Debian) - Different package names and availability **Current Dockerfile (lines 218-270):** + ```dockerfile FROM --platform=$BUILDPLATFORM golang:1.25.7-trixie AS crowdsec-builder ``` **Dependencies Installed:** + ```dockerfile RUN apt-get update && apt-get install -y --no-install-recommends \ git clang lld \ @@ -36,6 +39,7 @@ RUN xx-apt install -y gcc libc6-dev ``` **Possible Issues:** + - **Missing build dependencies**: CrowdSec might require additional packages on Debian that were implicitly available on Alpine - **Git clone failures**: Network issues or GitHub rate limiting - **Dependency resolution**: `go mod tidy` might behave differently @@ -44,6 +48,7 @@ RUN xx-apt install -y gcc libc6-dev ### 2. **CrowdSec Binary Path Issues** **Runtime Image (lines 359-365):** + ```dockerfile # Copy CrowdSec binaries from the crowdsec-builder stage (built with Go 1.25.5+) COPY --from=crowdsec-builder /crowdsec-out/crowdsec /usr/local/bin/crowdsec @@ -52,17 +57,20 @@ COPY --from=crowdsec-builder /crowdsec-out/config /etc/crowdsec.dist ``` **Possible Issues:** + - If the builder stage fails, these COPY commands will fail - If fallback stage is used (for non-amd64), paths might be wrong ### 3. **CrowdSec Configuration Issues** **Entrypoint Script CrowdSec Init (docker-entrypoint.sh):** + - Symlink creation from `/etc/crowdsec` to `/app/data/crowdsec/config` - Configuration file generation and substitution - Hub index updates **Possible Issues:** + - Symlink already exists as directory instead of symlink - Permission issues with non-root user - Configuration templates missing or incompatible @@ -70,12 +78,14 @@ COPY --from=crowdsec-builder /crowdsec-out/config /etc/crowdsec.dist ### 4. **Test Script Environment Issues** **Integration Test (crowdsec_integration.sh):** + - Builds the image with `docker build -t charon:local .` - Starts container and waits for API - Tests CrowdSec Hub connectivity - Tests preset pull/apply functionality **Possible Issues:** + - Build step timing out or failing silently - Container failing to start properly - CrowdSec processes not starting @@ -88,6 +98,7 @@ COPY --from=crowdsec-builder /crowdsec-out/config /etc/crowdsec.dist ### Step 1: Check Build Logs Review the CI build logs for the CrowdSec builder stage: + - Look for `git clone` errors - Check for `go get` or `go mod tidy` failures - Verify `xx-go build` completes successfully @@ -96,6 +107,7 @@ Review the CI build logs for the CrowdSec builder stage: ### Step 2: Verify CrowdSec Binaries Check if CrowdSec binaries are actually present: + ```bash docker run --rm charon:local which crowdsec docker run --rm charon:local which cscli @@ -105,6 +117,7 @@ docker run --rm charon:local cscli version ### Step 3: Check CrowdSec Configuration Verify configuration is properly initialized: + ```bash docker run --rm charon:local ls -la /etc/crowdsec docker run --rm charon:local ls -la /app/data/crowdsec @@ -114,6 +127,7 @@ docker run --rm charon:local cat /etc/crowdsec/config.yaml ### Step 4: Test CrowdSec Locally Run the integration test locally: + ```bash # Build image docker build --no-cache -t charon:local . @@ -129,6 +143,7 @@ docker build --no-cache -t charon:local . ### Fix 1: Add Missing Build Dependencies If the build is failing due to missing dependencies, add them to the CrowdSec builder: + ```dockerfile RUN apt-get update && apt-get install -y --no-install-recommends \ git clang lld \ @@ -139,6 +154,7 @@ RUN apt-get update && apt-get install -y --no-install-recommends \ ### Fix 2: Add Build Stage Debugging Add debugging output to identify where the build fails: + ```dockerfile # After git clone RUN echo "CrowdSec source cloned successfully" && ls -la @@ -153,6 +169,7 @@ RUN echo "Build complete" && ls -la /crowdsec-out/ ### Fix 3: Use CrowdSec Fallback If the build continues to fail, ensure the fallback stage is working: + ```dockerfile # In final stage, use conditional COPY COPY --from=crowdsec-fallback /crowdsec-out/bin/crowdsec /usr/local/bin/crowdsec || \ @@ -162,6 +179,7 @@ COPY --from=crowdsec-builder /crowdsec-out/crowdsec /usr/local/bin/crowdsec ### Fix 4: Verify cscli Before Test Add a verification step in the entrypoint: + ```bash if ! command -v cscli >/dev/null; then echo "ERROR: CrowdSec not installed properly" diff --git a/docs/decisions/sprint1-timeout-remediation-findings.md b/docs/decisions/sprint1-timeout-remediation-findings.md index faebbe9f..6eeb84b8 100644 --- a/docs/decisions/sprint1-timeout-remediation-findings.md +++ b/docs/decisions/sprint1-timeout-remediation-findings.md @@ -11,11 +11,13 @@ **File**: `tests/settings/system-settings.spec.ts` **Changes Made**: + 1. **Removed** `waitForFeatureFlagPropagation()` call from `beforeEach` hook (lines 35-46) - This was causing 10s × 31 tests = 310s of polling overhead per shard - Commented out with clear explanation linking to remediation plan 2. **Added** `test.afterEach()` hook with direct API state restoration: + ```typescript test.afterEach(async ({ page }) => { await test.step('Restore default feature flag state', async () => { @@ -34,12 +36,14 @@ ``` **Rationale**: + - Tests already verify feature flag state individually after toggle actions - Initial state verification in beforeEach was redundant - Explicit cleanup in afterEach ensures test isolation without polling overhead - Direct API mutation for state restoration is faster than polling **Expected Impact**: + - 310s saved per shard (10s × 31 tests) - Elimination of inter-test dependencies - No state leakage between tests @@ -51,12 +55,14 @@ **Changes Made**: 1. **Added module-level cache** for in-flight requests: + ```typescript // Cache for in-flight requests (per-worker isolation) const inflightRequests = new Map>>(); ``` 2. **Implemented cache key generation** with sorted keys and worker isolation: + ```typescript function generateCacheKey( expectedFlags: Record, @@ -81,6 +87,7 @@ - Removes promise from cache after completion (success or failure) 4. **Added cleanup function**: + ```typescript export function clearFeatureFlagCache(): void { inflightRequests.clear(); @@ -89,16 +96,19 @@ ``` **Why Sorted Keys?** + - `{a:true, b:false}` vs `{b:false, a:true}` are semantically identical - Without sorting, they generate different cache keys → cache misses - Sorting ensures consistent key regardless of property order **Why Worker Isolation?** + - Playwright workers run in parallel across different browser contexts - Each worker needs its own cache to avoid state conflicts - Worker index provides unique namespace per parallel process **Expected Impact**: + - 30-40% reduction in duplicate API calls (revised from original 70-80% estimate) - Cache hit rate should be >30% based on similar flag state checks - Reduced API server load during parallel test execution @@ -108,21 +118,26 @@ **Status**: Partially Investigated **Issue**: + - Test: `tests/dns-provider-types.spec.ts` (line 260) - Symptom: Label locator `/script.*path/i` passes in Chromium, fails in Firefox/WebKit - Test code: + ```typescript const scriptField = page.getByLabel(/script.*path/i); await expect(scriptField).toBeVisible({ timeout: 10000 }); ``` **Investigation Steps Completed**: + 1. ✅ Confirmed E2E environment is running and healthy 2. ✅ Attempted to run DNS provider type tests in Chromium 3. ⏸️ Further investigation deferred due to test execution issues **Investigation Steps Remaining** (per spec): + 1. Run with Playwright Inspector to compare accessibility trees: + ```bash npx playwright test tests/dns-provider-types.spec.ts --project=chromium --headed --debug npx playwright test tests/dns-provider-types.spec.ts --project=firefox --headed --debug @@ -137,6 +152,7 @@ 5. If not fixable: Use the helper function approach from Phase 2 **Recommendation**: + - Complete investigation in separate session with headed browser mode - DO NOT add `.or()` chains unless investigation proves it's necessary - Create formal Decision Record once root cause is identified @@ -144,31 +160,37 @@ ## Validation Checkpoints ### Checkpoint 1: Execution Time + **Status**: ⏸️ In Progress **Target**: <15 minutes (900s) for full test suite **Command**: + ```bash time npx playwright test tests/settings/system-settings.spec.ts --project=chromium ``` **Results**: + - Test execution interrupted during validation - Observed: Tests were picking up multiple spec files from security/ folder - Need to investigate test file patterns or run with more specific filtering **Action Required**: + - Re-run with corrected test file path or filtering - Ensure only system-settings tests are executed - Measure execution time and compare to baseline ### Checkpoint 2: Test Isolation + **Status**: ⏳ Pending **Target**: All tests pass with `--repeat-each=5 --workers=4` **Command**: + ```bash npx playwright test tests/settings/system-settings.spec.ts --project=chromium --repeat-each=5 --workers=4 ``` @@ -176,11 +198,13 @@ npx playwright test tests/settings/system-settings.spec.ts --project=chromium -- **Status**: Not executed yet ### Checkpoint 3: Cross-browser + **Status**: ⏳ Pending **Target**: Firefox/WebKit pass rate >85% **Command**: + ```bash npx playwright test tests/settings/system-settings.spec.ts --project=firefox --project=webkit ``` @@ -188,11 +212,13 @@ npx playwright test tests/settings/system-settings.spec.ts --project=firefox --p **Status**: Not executed yet ### Checkpoint 4: DNS provider tests (secondary issue) + **Status**: ⏳ Pending **Target**: Firefox tests pass or investigation complete **Command**: + ```bash npx playwright test tests/dns-provider-types.spec.ts --project=firefox ``` @@ -204,11 +230,13 @@ npx playwright test tests/dns-provider-types.spec.ts --project=firefox ### Decision: Use Direct API Mutation for State Restoration **Context**: + - Tests need to restore default feature flag state after modifications - Original approach used polling-based verification in beforeEach - Alternative approaches: polling in afterEach vs direct API mutation **Options Evaluated**: + 1. **Polling in afterEach** - Verify state propagated after mutation - Pros: Confirms state is actually restored - Cons: Adds 500ms-2s per test (polling overhead) @@ -219,12 +247,14 @@ npx playwright test tests/dns-provider-types.spec.ts --project=firefox - Why chosen: Feature flag updates are synchronous in backend **Rationale**: + - Feature flag updates via PUT /api/v1/feature-flags are processed synchronously - Database write is immediate (SQLite WAL mode) - No async propagation delay in single-process test environment - Subsequent tests will verify state on first read, catching any issues **Impact**: + - Test runtime reduced by 15-60s per test file (31 tests × 500ms-2s polling) - Risk: If state restoration fails, next test will fail loudly (detectable) - Acceptable trade-off for 10-20% execution time improvement @@ -234,15 +264,18 @@ npx playwright test tests/dns-provider-types.spec.ts --project=firefox ### Decision: Cache Key Sorting for Semantic Equality **Context**: + - Multiple tests may check the same feature flag state but with different property order - Without normalization, `{a:true, b:false}` and `{b:false, a:true}` generate different keys **Rationale**: + - JavaScript objects have insertion order, but semantically these are identical states - Sorting keys ensures cache hits for semantically identical flag states - Minimal performance cost (~1ms for sorting 3-5 keys) **Impact**: + - Estimated 10-15% cache hit rate improvement - No downside - pure optimization diff --git a/docs/development/go_version_upgrades.md b/docs/development/go_version_upgrades.md index d3444c21..58a1da52 100644 --- a/docs/development/go_version_upgrades.md +++ b/docs/development/go_version_upgrades.md @@ -78,6 +78,7 @@ git pull origin development ``` This script: + - Detects the required Go version from `go.work` - Downloads it from golang.org - Installs it to `~/sdk/go{version}/` @@ -103,6 +104,7 @@ Even if you used Option A (which rebuilds automatically), you can always manuall ``` This rebuilds: + - **golangci-lint** — Pre-commit linter (critical) - **gopls** — IDE language server (critical) - **govulncheck** — Security scanner @@ -132,11 +134,13 @@ Current Go version: go version go1.26.0 linux/amd64 Your IDE caches the old Go language server (gopls). Reload to use the new one: **VS Code:** + - Press `Cmd/Ctrl+Shift+P` - Type "Developer: Reload Window" - Press Enter **GoLand or IntelliJ IDEA:** + - File → Invalidate Caches → Restart - Wait for indexing to complete @@ -243,6 +247,7 @@ go install golang.org/x/tools/gopls@latest ### How often do Go versions change? Go releases **two major versions per year**: + - February (e.g., Go 1.26.0) - August (e.g., Go 1.27.0) @@ -255,6 +260,7 @@ Plus occasional patch releases (e.g., Go 1.26.1) for security fixes. **Usually no**, but it doesn't hurt. Patch releases (like 1.26.0 → 1.26.1) rarely break tool compatibility. **Rebuild if:** + - Pre-commit hooks start failing - IDE shows unexpected errors - Tools report version mismatches @@ -262,6 +268,7 @@ Plus occasional patch releases (e.g., Go 1.26.1) for security fixes. ### Why don't CI builds have this problem? CI environments are **ephemeral** (temporary). Every workflow run: + 1. Starts with a fresh container 2. Installs Go from scratch 3. Installs tools from scratch @@ -295,12 +302,14 @@ But for Charon development, you only need **one version** (whatever's in `go.wor **Short answer:** Your local tools will be out of sync, but CI will still work. **What breaks:** + - Pre-commit hooks fail (but will auto-rebuild) - IDE shows phantom errors - Manual `go test` might fail locally - CI is unaffected (it always uses the correct version) **When to catch up:** + - Before opening a PR (CI checks will fail if your code uses old Go features) - When local development becomes annoying @@ -326,6 +335,7 @@ But they only take ~400MB each, so cleanup is optional. Renovate updates **Dockerfile** and **go.work**, but it can't update tools on *your* machine. **Think of it like this:** + - Renovate: "Hey team, we're now using Go 1.26.0" - Your machine: "Cool, but my tools are still Go 1.25.6. Let me rebuild them." @@ -334,18 +344,22 @@ The rebuild script bridges that gap. ### What's the difference between `go.work`, `go.mod`, and my system Go? **`go.work`** — Workspace file (multi-module projects like Charon) + - Specifies minimum Go version for the entire project - Used by Renovate to track upgrades **`go.mod`** — Module file (individual Go modules) + - Each module (backend, tools) has its own `go.mod` - Inherits Go version from `go.work` **System Go** (`go version`) — What's installed on your machine + - Must be >= the version in `go.work` - Tools are compiled with whatever version this is **Example:** + ``` go.work says: "Use Go 1.26.0 or newer" go.mod says: "I'm part of the workspace, use its Go version" @@ -364,12 +378,14 @@ Charon's pre-commit hook automatically detects and fixes tool version mismatches **How it works:** 1. **Check versions:** + ```bash golangci-lint version → "built with go1.25.6" go version → "go version go1.26.0" ``` 2. **Detect mismatch:** + ``` ⚠️ golangci-lint Go version mismatch: golangci-lint: 1.25.6 @@ -377,6 +393,7 @@ Charon's pre-commit hook automatically detects and fixes tool version mismatches ``` 3. **Auto-rebuild:** + ``` 🔧 Rebuilding golangci-lint with current Go version... ✅ golangci-lint rebuilt successfully @@ -406,11 +423,13 @@ If you want manual control, edit `scripts/pre-commit-hooks/golangci-lint-fast.sh ## Need Help? **Open a [Discussion](https://github.com/Wikid82/charon/discussions)** if: + - These instructions didn't work for you - You're seeing errors not covered in troubleshooting - You have suggestions for improving this guide **Open an [Issue](https://github.com/Wikid82/charon/issues)** if: + - The rebuild script crashes - Pre-commit auto-rebuild isn't working - CI is failing for Go version reasons diff --git a/docs/development/running-e2e.md b/docs/development/running-e2e.md index d599f546..a1d831a2 100644 --- a/docs/development/running-e2e.md +++ b/docs/development/running-e2e.md @@ -3,16 +3,20 @@ This document explains how to run Playwright tests using a real browser (headed) on Linux machines and in the project's Docker E2E environment. ## Key points + - Playwright's interactive Test UI (--ui) requires an X server (a display). On headless CI or servers, use Xvfb. - Prefer the project's E2E Docker image for integration-like runs; use the local `--ui` flow for manual debugging. ## Quick commands (local Linux) + - Headless (recommended for CI / fast runs): + ```bash npm run e2e ``` - Headed UI on a headless machine (auto-starts Xvfb): + ```bash npm run e2e:ui:headless-server # or, if you prefer manual control: @@ -20,37 +24,46 @@ This document explains how to run Playwright tests using a real browser (headed) ``` - Headed UI on a workstation with an X server already running: + ```bash npx playwright test --ui ``` - Open the running Docker E2E app in your system browser (one-step via VS Code task): - Run the VS Code task: **Open: App in System Browser (Docker E2E)** - - This will rebuild the E2E container (if needed), wait for http://localhost:8080 to respond, and open your system browser automatically. + - This will rebuild the E2E container (if needed), wait for to respond, and open your system browser automatically. - Open the running Docker E2E app in VS Code Simple Browser: - Run the VS Code task: **Open: App in Simple Browser (Docker E2E)** - Then use the command palette: `Simple Browser: Open URL` → paste `http://localhost:8080` ## Using the project's E2E Docker image (recommended for parity with CI) + 1. Rebuild/start the E2E container (this sets up the full test environment): + ```bash .github/skills/scripts/skill-runner.sh docker-rebuild-e2e ``` + If you need a clean rebuild after integration alignment changes: + ```bash .github/skills/scripts/skill-runner.sh docker-rebuild-e2e --clean --no-cache ``` -2. Run the UI against the container (you still need an X server on your host): + +1. Run the UI against the container (you still need an X server on your host): + ```bash PLAYWRIGHT_BASE_URL=http://localhost:8080 npm run e2e:ui:headless-server ``` ## CI guidance + - Do not run Playwright `--ui` in CI. Use headless runs or the E2E Docker image and collect traces/videos for failures. - For coverage, use the provided skill: `.github/skills/scripts/skill-runner.sh test-e2e-playwright-coverage` ## Troubleshooting + - Playwright error: "Looks like you launched a headed browser without having a XServer running." → run `npm run e2e:ui:headless-server` or install Xvfb. - If `npm run e2e:ui:headless-server` fails with an exit code like `148`: - Inspect Xvfb logs: `tail -n 200 /tmp/xvfb.playwright.log` @@ -59,11 +72,13 @@ This document explains how to run Playwright tests using a real browser (headed) - If running inside Docker, prefer the skill-runner which provisions the required services; the UI still needs host X (or use VNC). ## Developer notes (what we changed) + - Added `scripts/run-e2e-ui.sh` — wrapper that auto-starts Xvfb when DISPLAY is unset. - Added `npm run e2e:ui:headless-server` to run the Playwright UI on headless machines. - Playwright config now auto-starts Xvfb when `--ui` is requested locally and prints an actionable error if Xvfb is not available. ## Security & hygiene + - Playwright auth artifacts are ignored by git (`playwright/.auth/`). Do not commit credentials. --- diff --git a/docs/features/api.md b/docs/features/api.md index 089ab019..95a0c68c 100644 --- a/docs/features/api.md +++ b/docs/features/api.md @@ -23,6 +23,7 @@ Authorization: Bearer your-api-token-here ``` Tokens support granular permissions: + - **Read-only**: View configurations without modification - **Full access**: Complete CRUD operations - **Scoped**: Limit to specific resource types diff --git a/docs/features/caddyfile-import.md b/docs/features/caddyfile-import.md index 1d27562f..7e5cec26 100644 --- a/docs/features/caddyfile-import.md +++ b/docs/features/caddyfile-import.md @@ -52,6 +52,7 @@ Caddyfile import parses your existing Caddy configuration files and converts the Choose one of three methods: **Paste Content:** + ``` example.com { reverse_proxy localhost:3000 @@ -63,10 +64,12 @@ api.example.com { ``` **Upload File:** + - Click **Choose File** - Select your Caddyfile **Fetch from URL:** + - Enter URL to raw Caddyfile content - Useful for version-controlled configurations diff --git a/docs/features/dns-challenge.md b/docs/features/dns-challenge.md index bd696891..ba3bba18 100644 --- a/docs/features/dns-challenge.md +++ b/docs/features/dns-challenge.md @@ -447,6 +447,7 @@ Charon displays instructions to remove the TXT record after certificate issuance **Symptom**: Certificate request stuck at "Waiting for Propagation" or validation fails. **Causes**: + - DNS TTL is high (cached old records) - DNS provider has slow propagation - Regional DNS inconsistency @@ -497,6 +498,7 @@ Charon displays instructions to remove the TXT record after certificate issuance **Symptom**: Connection test passes, but record creation fails. **Causes**: + - API token has read-only permissions - Zone/domain not accessible with current credentials - Rate limiting or account restrictions @@ -513,6 +515,7 @@ Charon displays instructions to remove the TXT record after certificate issuance **Symptom**: "Record already exists" error during certificate request. **Causes**: + - Previous challenge attempt left orphaned record - Manual DNS record with same name exists - Another ACME client managing the same domain @@ -551,6 +554,7 @@ Charon displays instructions to remove the TXT record after certificate issuance **Symptom**: "Too many requests" or "Rate limit exceeded" errors. **Causes**: + - Too many certificate requests in short period - DNS provider API rate limits - Let's Encrypt rate limits diff --git a/docs/features/docker-integration.md b/docs/features/docker-integration.md index a0f892af..d5b9e343 100644 --- a/docs/features/docker-integration.md +++ b/docs/features/docker-integration.md @@ -47,6 +47,7 @@ Docker auto-discovery eliminates manual IP address hunting and port memorization For Charon to discover containers, it needs Docker API access. **Docker Compose:** + ```yaml services: charon: @@ -56,6 +57,7 @@ services: ``` **Docker Run:** + ```bash docker run -v /var/run/docker.sock:/var/run/docker.sock:ro charon ``` diff --git a/docs/features/plugin-security.md b/docs/features/plugin-security.md index 067e1907..a3b7b723 100644 --- a/docs/features/plugin-security.md +++ b/docs/features/plugin-security.md @@ -35,18 +35,21 @@ CHARON_PLUGIN_SIGNATURES='{"pluginname": "sha256:..."}' ### Examples **Permissive mode (default)**: + ```bash # Unset — all plugins load without verification unset CHARON_PLUGIN_SIGNATURES ``` **Strict block-all**: + ```bash # Empty object — no external plugins will load export CHARON_PLUGIN_SIGNATURES='{}' ``` **Allowlist specific plugins**: + ```bash # Only powerdns and custom-provider plugins are allowed export CHARON_PLUGIN_SIGNATURES='{"powerdns": "sha256:a1b2c3d4...", "custom-provider": "sha256:e5f6g7h8..."}' @@ -63,6 +66,7 @@ sha256sum myplugin.so | awk '{print "sha256:" $1}' ``` **Example output**: + ``` sha256:a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9t0u1v2w3x4y5z6a7b8c9d0e1f2 ``` @@ -96,6 +100,7 @@ services: ``` This prevents runtime modification of plugin files, mitigating: + - Time-of-check to time-of-use (TOCTOU) attacks - Malicious plugin replacement after signature verification @@ -113,6 +118,7 @@ services: ``` Or in Dockerfile: + ```dockerfile FROM charon:latest USER charon @@ -128,6 +134,7 @@ Plugin directories must **not** be world-writable. Charon enforces this at start | `0777` (world-writable) | ❌ Rejected — plugin loading disabled | **Set secure permissions**: + ```bash chmod 755 /path/to/plugins chmod 644 /path/to/plugins/*.so # Or 755 for executable @@ -192,22 +199,26 @@ After updating plugins, always update your `CHARON_PLUGIN_SIGNATURES` with the n ### Checking if a Plugin Loaded **Check startup logs**: + ```bash docker compose logs charon | grep -i plugin ``` **Expected success output**: + ``` INFO Loaded DNS provider plugin type=powerdns name="PowerDNS" version="1.0.0" INFO Loaded 1 external DNS provider plugins (0 failed) ``` **If using allowlist**: + ``` INFO Plugin signature allowlist enabled with 2 entries ``` **Via API**: + ```bash curl http://localhost:8080/api/admin/plugins \ -H "Authorization: Bearer YOUR-TOKEN" @@ -220,6 +231,7 @@ curl http://localhost:8080/api/admin/plugins \ **Cause**: The plugin filename (without `.so`) is not in `CHARON_PLUGIN_SIGNATURES`. **Solution**: Add the plugin to your allowlist: + ```bash # Get the signature sha256sum powerdns.so | awk '{print "sha256:" $1}' @@ -233,6 +245,7 @@ export CHARON_PLUGIN_SIGNATURES='{"powerdns": "sha256:YOUR_HASH_HERE"}' **Cause**: The plugin file's SHA-256 hash doesn't match the allowlist. **Solution**: + 1. Verify you have the correct plugin file 2. Re-compute the signature: `sha256sum plugin.so` 3. Update `CHARON_PLUGIN_SIGNATURES` with the correct hash @@ -242,6 +255,7 @@ export CHARON_PLUGIN_SIGNATURES='{"powerdns": "sha256:YOUR_HASH_HERE"}' **Cause**: The plugin directory is world-writable (mode `0777` or similar). **Solution**: + ```bash chmod 755 /path/to/plugins chmod 644 /path/to/plugins/*.so @@ -252,11 +266,13 @@ chmod 644 /path/to/plugins/*.so **Cause**: Malformed JSON in the environment variable. **Solution**: Validate your JSON: + ```bash echo '{"powerdns": "sha256:abc123"}' | jq . ``` Common issues: + - Missing quotes around keys or values - Trailing commas - Single quotes instead of double quotes @@ -266,6 +282,7 @@ Common issues: **Cause**: File permissions too restrictive or ownership mismatch. **Solution**: + ```bash # Check current permissions ls -la /path/to/plugins/ @@ -278,27 +295,32 @@ chown charon:charon /path/to/plugins/*.so ### Debugging Checklist 1. **Is the plugin directory configured?** + ```bash echo $CHARON_PLUGINS_DIR ``` 2. **Does the plugin file exist?** + ```bash ls -la $CHARON_PLUGINS_DIR/*.so ``` 3. **Are directory permissions secure?** + ```bash stat -c "%a %n" $CHARON_PLUGINS_DIR # Should be 755 or stricter ``` 4. **Is the signature correct?** + ```bash sha256sum $CHARON_PLUGINS_DIR/myplugin.so ``` 5. **Is the JSON valid?** + ```bash echo "$CHARON_PLUGIN_SIGNATURES" | jq . ``` diff --git a/docs/features/proxy-headers.md b/docs/features/proxy-headers.md index a6c514cc..d77730fe 100644 --- a/docs/features/proxy-headers.md +++ b/docs/features/proxy-headers.md @@ -69,22 +69,26 @@ X-Forwarded-Host preserves the original domain: Your backend must trust proxy headers from Charon. Common configurations: **Node.js/Express:** + ```javascript app.set('trust proxy', true); ``` **Django:** + ```python SECURE_PROXY_SSL_HEADER = ('HTTP_X_FORWARDED_PROTO', 'https') USE_X_FORWARDED_HOST = True ``` **Rails:** + ```ruby config.action_dispatch.trusted_proxies = [IPAddr.new('10.0.0.0/8')] ``` **PHP/Laravel:** + ```php // In TrustProxies middleware protected $proxies = '*'; diff --git a/docs/getting-started.md b/docs/getting-started.md index baf71292..88e69be2 100644 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -229,16 +229,19 @@ The emergency token is a security feature that allows bypassing all security mod Choose your platform: **Linux/macOS (recommended):** + ```bash openssl rand -hex 32 ``` **Windows PowerShell:** + ```powershell [Convert]::ToBase64String([System.Security.Cryptography.RandomNumberGenerator]::GetBytes(32)) ``` **Node.js (all platforms):** + ```bash node -e "console.log(require('crypto').randomBytes(32).toString('hex'))" ``` @@ -252,11 +255,13 @@ CHARON_EMERGENCY_TOKEN= ``` **Example:** + ```bash CHARON_EMERGENCY_TOKEN=7b3b8a36a6fad839f1b3122131ed4b1f05453118a91b53346482415796e740e2 ``` **Verify:** + ```bash # Token should be exactly 64 characters echo -n "$(grep CHARON_EMERGENCY_TOKEN .env | cut -d= -f2)" | wc -c @@ -287,20 +292,23 @@ For continuous integration, store the token in GitHub Secrets: ### Security Best Practices ✅ **DO:** + - Generate tokens using cryptographically secure methods - Store in `.env` (gitignored) or secrets management - Rotate quarterly or after security events - Use minimum 64 characters ❌ **DON'T:** + - Commit tokens to repository (even in examples) - Share tokens via email or chat - Use weak or predictable values - Reuse tokens across environments --- -2. **Settings table** for `security.crowdsec.enabled = "true"` -3. **Starts CrowdSec** if either condition is true + +1. **Settings table** for `security.crowdsec.enabled = "true"` +2. **Starts CrowdSec** if either condition is true **How it works:** @@ -582,7 +590,7 @@ Click "Watch" → "Custom" → Select "Security advisories" on the [Charon repos **2. Notifications and Automatic Updates with Dockhand** - - Dockhand is a free service that monitors Docker images for updates and can send notifications or trigger auto-updates. https://github.com/Finsys/dockhand +- Dockhand is a free service that monitors Docker images for updates and can send notifications or trigger auto-updates. **Best Practices:** diff --git a/docs/github-setup.md b/docs/github-setup.md index 9f211530..09265e0c 100644 --- a/docs/github-setup.md +++ b/docs/github-setup.md @@ -68,6 +68,7 @@ E2E tests require an emergency token to be configured in GitHub Secrets. This to ### Why This Is Needed The emergency token is used by E2E tests to: + - Disable security modules (ACL, WAF, CrowdSec) after testing them - Prevent cascading test failures due to leftover security state - Ensure tests can always access the API regardless of security configuration @@ -77,16 +78,19 @@ The emergency token is used by E2E tests to: 1. **Generate emergency token:** **Linux/macOS:** + ```bash openssl rand -hex 32 ``` **Windows PowerShell:** + ```powershell [Convert]::ToBase64String([System.Security.Cryptography.RandomNumberGenerator]::GetBytes(32)) ``` **Node.js (all platforms):** + ```bash node -e "console.log(require('crypto').randomBytes(32).toString('hex'))" ``` @@ -141,11 +145,13 @@ If the secret is missing or invalid, the workflow will fail with a clear error m ### Security Best Practices ✅ **DO:** + - Use cryptographically secure generation methods - Rotate quarterly or after security events - Store separately for local dev (`.env`) and CI/CD (GitHub Secrets) ❌ **DON'T:** + - Share tokens via email or chat - Commit tokens to repository (even in example files) - Reuse tokens across different environments @@ -154,11 +160,13 @@ If the secret is missing or invalid, the workflow will fail with a clear error m ### Troubleshooting **Error: "CHARON_EMERGENCY_TOKEN not set"** + - Check secret name is exactly `CHARON_EMERGENCY_TOKEN` (case-sensitive) - Verify secret is repository-level, not environment-level - Re-run workflow after adding secret **Error: "Token too short"** + - Hex method must generate exactly 64 characters - Verify you copied the entire token value - Regenerate if needed diff --git a/docs/guides/crowdsec-setup.md b/docs/guides/crowdsec-setup.md index c93b1b84..c6c889e8 100644 --- a/docs/guides/crowdsec-setup.md +++ b/docs/guides/crowdsec-setup.md @@ -88,6 +88,7 @@ In CrowdSec terms: > **✅ Good News: Charon Handles This For You!** > > When you enable CrowdSec for the first time, Charon automatically: +> > 1. Starts the CrowdSec engine > 2. Registers a bouncer and generates a valid API key > 3. Saves the key so it survives container restarts @@ -317,11 +318,13 @@ Replace `YOUR_ENROLLMENT_KEY` with the key from your Console. **Solution:** 1. Check if you're manually setting an API key: + ```bash grep -i "crowdsec_api_key" docker-compose.yml ``` 2. If you find one, **remove it**: + ```yaml # REMOVE this line: - CHARON_SECURITY_CROWDSEC_API_KEY=anything @@ -330,6 +333,7 @@ Replace `YOUR_ENROLLMENT_KEY` with the key from your Console. 3. Follow the [Manual Bouncer Registration](#manual-bouncer-registration) steps above 4. Restart the container: + ```bash docker restart charon ``` @@ -347,6 +351,7 @@ Replace `YOUR_ENROLLMENT_KEY` with the key from your Console. 1. Wait 60 seconds after container start 2. Check if CrowdSec is running: + ```bash docker exec charon cscli lapi status ``` @@ -354,6 +359,7 @@ Replace `YOUR_ENROLLMENT_KEY` with the key from your Console. 3. If you see "connection refused," try toggling CrowdSec OFF then ON in the GUI 4. Check the logs: + ```bash docker logs charon | grep -i crowdsec ``` @@ -431,6 +437,7 @@ If you already run CrowdSec separately (not inside Charon), you can connect to i **Steps:** 1. Register a bouncer on your external CrowdSec: + ```bash cscli bouncers add charon-bouncer ``` @@ -438,6 +445,7 @@ If you already run CrowdSec separately (not inside Charon), you can connect to i 2. Save the API key that's generated (you won't see it again!) 3. In your docker-compose.yml: + ```yaml environment: - CHARON_SECURITY_CROWDSEC_API_URL=http://your-crowdsec-server:8080 @@ -445,6 +453,7 @@ If you already run CrowdSec separately (not inside Charon), you can connect to i ``` 4. Restart Charon: + ```bash docker restart charon ``` diff --git a/docs/maintenance/README.md b/docs/maintenance/README.md index 5ca7e03f..4a5e166b 100644 --- a/docs/maintenance/README.md +++ b/docs/maintenance/README.md @@ -9,6 +9,7 @@ This directory contains operational maintenance guides for keeping Charon runnin **When to use:** Docker build fails with GeoLite2-Country.mmdb checksum mismatch **Topics covered:** + - Automated weekly checksum verification workflow - Manual checksum update procedures (5 minutes) - Verification script for checking upstream changes @@ -16,6 +17,7 @@ This directory contains operational maintenance guides for keeping Charon runnin - Alternative sources if upstream mirrors are unavailable **Quick fix:** + ```bash # Download and update checksum automatically NEW_CHECKSUM=$(curl -fsSL "https://github.com/P3TERX/GeoLite.mmdb/raw/download/GeoLite2-Country.mmdb" | sha256sum | cut -d' ' -f1) @@ -34,6 +36,7 @@ Found a maintenance issue not covered here? Please: 3. **Update this index** with a link to your guide **Format:** + ```markdown ### [Guide Title](filename.md) diff --git a/docs/maintenance/geolite2-checksum-update.md b/docs/maintenance/geolite2-checksum-update.md index d319f171..b6758e9b 100644 --- a/docs/maintenance/geolite2-checksum-update.md +++ b/docs/maintenance/geolite2-checksum-update.md @@ -15,6 +15,7 @@ Charon uses the [MaxMind GeoLite2-Country database](https://dev.maxmind.com/geoi Update the checksum when: 1. **Docker build fails** with the following error: + ``` sha256sum: /app/data/geoip/GeoLite2-Country.mmdb: FAILED sha256sum: WARNING: 1 computed checksum did NOT match @@ -29,6 +30,7 @@ Update the checksum when: ## Automated Workflow (Recommended) Charon includes a GitHub Actions workflow that automatically: + - Checks for upstream GeoLite2 database changes weekly - Calculates the new checksum - Creates a pull request with the update @@ -39,6 +41,7 @@ Charon includes a GitHub Actions workflow that automatically: **Schedule:** Mondays at 2 AM UTC (weekly) **Manual Trigger:** + ```bash gh workflow run update-geolite2.yml ``` @@ -75,16 +78,19 @@ sha256sum /tmp/geolite2-test.mmdb **File:** [`Dockerfile`](../../Dockerfile) (line ~352) **Find this line:** + ```dockerfile ARG GEOLITE2_COUNTRY_SHA256= ``` **Replace with the new checksum:** + ```dockerfile ARG GEOLITE2_COUNTRY_SHA256=436135ee98a521da715a6d483951f3dbbd62557637f2d50d1987fc048874bd5d ``` **Using sed (automated):** + ```bash NEW_CHECKSUM=$(curl -fsSL "https://github.com/P3TERX/GeoLite.mmdb/raw/download/GeoLite2-Country.mmdb" | sha256sum | cut -d' ' -f1) @@ -119,6 +125,7 @@ docker run --rm charon:test-checksum /app/charon --version ``` **Expected output:** + ``` ✅ GeoLite2-Country.mmdb: OK ✅ Successfully tagged charon:test-checksum @@ -171,11 +178,13 @@ fi ``` **Make executable:** + ```bash chmod +x scripts/verify-geolite2-checksum.sh ``` **Run verification:** + ```bash ./scripts/verify-geolite2-checksum.sh ``` @@ -187,22 +196,26 @@ chmod +x scripts/verify-geolite2-checksum.sh ### Issue: Build Still Fails After Update **Symptoms:** + - Checksum verification fails - "FAILED" error persists **Solutions:** 1. **Clear Docker build cache:** + ```bash docker builder prune -af ``` 2. **Verify the checksum was committed:** + ```bash git show HEAD:Dockerfile | grep "GEOLITE2_COUNTRY_SHA256" ``` 3. **Re-download and verify upstream file:** + ```bash curl -fsSL "https://github.com/P3TERX/GeoLite.mmdb/raw/download/GeoLite2-Country.mmdb" -o /tmp/test.mmdb sha256sum /tmp/test.mmdb @@ -212,28 +225,31 @@ chmod +x scripts/verify-geolite2-checksum.sh ### Issue: Upstream File Unavailable (404) **Symptoms:** + - `curl` returns 404 Not Found - Automated workflow fails with `download_failed` error **Investigation Steps:** 1. **Check upstream repository:** - - Visit: https://github.com/P3TERX/GeoLite.mmdb + - Visit: - Verify the file still exists at the raw URL - Check for repository status or announcements 2. **Check MaxMind status:** - - Visit: https://status.maxmind.com/ + - Visit: - Check for service outages or maintenance **Temporary Solutions:** 1. **Use cached Docker layer** (if available): + ```bash docker build --cache-from ghcr.io/wikid82/charon:latest -t charon:latest . ``` 2. **Use local copy** (temporary): + ```bash # Download from a working container docker run --rm ghcr.io/wikid82/charon:latest cat /app/data/geoip/GeoLite2-Country.mmdb > /tmp/GeoLite2-Country.mmdb @@ -249,12 +265,14 @@ chmod +x scripts/verify-geolite2-checksum.sh ### Issue: Checksum Mismatch on Re-download **Symptoms:** + - Checksum calculated locally differs from what's in the Dockerfile - Checksum changes between downloads **Investigation Steps:** 1. **Verify file integrity:** + ```bash # Download multiple times and compare for i in {1..3}; do @@ -267,12 +285,14 @@ chmod +x scripts/verify-geolite2-checksum.sh - Try from different network locations 3. **Verify no MITM proxy:** + ```bash # Download via HTTPS and verify certificate curl -v -fsSL "https://github.com/P3TERX/GeoLite.mmdb/raw/download/GeoLite2-Country.mmdb" -o /tmp/test.mmdb 2>&1 | grep "CN=" ``` **If confirmed as supply chain attack:** + - **STOP** and do not proceed - Report to security team - See [Security Incident Response](../security-incident-response.md) @@ -280,6 +300,7 @@ chmod +x scripts/verify-geolite2-checksum.sh ### Issue: Multi-Platform Build Fails (arm64) **Symptoms:** + - `linux/amd64` build succeeds - `linux/arm64` build fails with checksum error @@ -290,12 +311,14 @@ chmod +x scripts/verify-geolite2-checksum.sh - Should be identical across all platforms 2. **Check buildx platform emulation:** + ```bash docker buildx ls docker buildx inspect ``` 3. **Test arm64 build explicitly:** + ```bash docker buildx build --platform linux/arm64 --load -t test-arm64 . ``` @@ -308,8 +331,8 @@ chmod +x scripts/verify-geolite2-checksum.sh - **Implementation Plan:** [`docs/plans/current_spec.md`](../plans/current_spec.md) - **QA Report:** [`docs/reports/qa_report.md`](../reports/qa_report.md) - **Dockerfile:** [`Dockerfile`](../../Dockerfile) (line ~352) -- **MaxMind GeoLite2:** https://dev.maxmind.com/geoip/geolite2-free-geolocation-data -- **P3TERX Mirror:** https://github.com/P3TERX/GeoLite.mmdb +- **MaxMind GeoLite2:** +- **P3TERX Mirror:** --- @@ -321,9 +344,10 @@ chmod +x scripts/verify-geolite2-checksum.sh **Solution:** Updated one line in `Dockerfile` (line 352) with the correct checksum and implemented an automated workflow to prevent future occurrences. -**Build Failure URL:** https://github.com/Wikid82/Charon/actions/runs/21584236523/job/62188372617 +**Build Failure URL:** **Related PRs:** + - Fix implementation: (link to PR) - Automated workflow addition: (link to PR) diff --git a/docs/patches/e2e_workflow_timeout_fix.patch.md b/docs/patches/e2e_workflow_timeout_fix.patch.md index 1998f991..dda0bc2a 100644 --- a/docs/patches/e2e_workflow_timeout_fix.patch.md +++ b/docs/patches/e2e_workflow_timeout_fix.patch.md @@ -6,8 +6,9 @@ index efbcccda..64fcc121 100644 if: | ((inputs.browser || 'all') == 'chromium' || (inputs.browser || 'all') == 'all') && ((inputs.test_category || 'all') == 'security' || (inputs.test_category || 'all') == 'all') -- timeout-minutes: 40 -+ timeout-minutes: 60 + +- timeout-minutes: 40 +- timeout-minutes: 60 env: CHARON_EMERGENCY_TOKEN: ${{ secrets.CHARON_EMERGENCY_TOKEN }} CHARON_EMERGENCY_SERVER_ENABLED: "true" @@ -15,42 +16,45 @@ index efbcccda..64fcc121 100644 npx playwright test \ --project=chromium \ -+ --output=playwright-output/security-chromium \ +- --output=playwright-output/security-chromium \ tests/security-enforcement/ \ tests/security/ \ tests/integration/multi-feature-workflows.spec.ts || STATUS=$? + @@ -370,6 +371,25 @@ jobs: path: test-results/**/*.zip retention-days: 7 -+ - name: Collect diagnostics -+ if: always() -+ run: | -+ mkdir -p diagnostics -+ uptime > diagnostics/uptime.txt -+ free -m > diagnostics/free-m.txt -+ df -h > diagnostics/df-h.txt -+ ps aux > diagnostics/ps-aux.txt -+ docker ps -a > diagnostics/docker-ps.txt || true -+ docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true -+ -+ - name: Upload diagnostics -+ if: always() -+ uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0 -+ with: -+ name: e2e-diagnostics-chromium-security -+ path: diagnostics/ -+ retention-days: 7 -+ +- - name: Collect diagnostics +- if: always() +- run: | +- mkdir -p diagnostics +- uptime > diagnostics/uptime.txt +- free -m > diagnostics/free-m.txt +- df -h > diagnostics/df-h.txt +- ps aux > diagnostics/ps-aux.txt +- docker ps -a > diagnostics/docker-ps.txt || true +- docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true +- +- - name: Upload diagnostics +- if: always() +- uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0 +- with: +- name: e2e-diagnostics-chromium-security +- path: diagnostics/ +- retention-days: 7 +- - name: Collect Docker logs on failure if: failure() run: | + @@ -394,7 +414,7 @@ jobs: if: | ((inputs.browser || 'all') == 'firefox' || (inputs.browser || 'all') == 'all') && ((inputs.test_category || 'all') == 'security' || (inputs.test_category || 'all') == 'all') -- timeout-minutes: 40 -+ timeout-minutes: 60 + +- timeout-minutes: 40 +- timeout-minutes: 60 env: CHARON_EMERGENCY_TOKEN: ${{ secrets.CHARON_EMERGENCY_TOKEN }} CHARON_EMERGENCY_SERVER_ENABLED: "true" @@ -58,42 +62,45 @@ index efbcccda..64fcc121 100644 npx playwright test \ --project=firefox \ -+ --output=playwright-output/security-firefox \ +- --output=playwright-output/security-firefox \ tests/security-enforcement/ \ tests/security/ \ tests/integration/multi-feature-workflows.spec.ts || STATUS=$? + @@ -559,6 +580,25 @@ jobs: path: test-results/**/*.zip retention-days: 7 -+ - name: Collect diagnostics -+ if: always() -+ run: | -+ mkdir -p diagnostics -+ uptime > diagnostics/uptime.txt -+ free -m > diagnostics/free-m.txt -+ df -h > diagnostics/df-h.txt -+ ps aux > diagnostics/ps-aux.txt -+ docker ps -a > diagnostics/docker-ps.txt || true -+ docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true -+ -+ - name: Upload diagnostics -+ if: always() -+ uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0 -+ with: -+ name: e2e-diagnostics-firefox-security -+ path: diagnostics/ -+ retention-days: 7 -+ +- - name: Collect diagnostics +- if: always() +- run: | +- mkdir -p diagnostics +- uptime > diagnostics/uptime.txt +- free -m > diagnostics/free-m.txt +- df -h > diagnostics/df-h.txt +- ps aux > diagnostics/ps-aux.txt +- docker ps -a > diagnostics/docker-ps.txt || true +- docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true +- +- - name: Upload diagnostics +- if: always() +- uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0 +- with: +- name: e2e-diagnostics-firefox-security +- path: diagnostics/ +- retention-days: 7 +- - name: Collect Docker logs on failure if: failure() run: | + @@ -583,7 +623,7 @@ jobs: if: | ((inputs.browser || 'all') == 'webkit' || (inputs.browser || 'all') == 'all') && ((inputs.test_category || 'all') == 'security' || (inputs.test_category || 'all') == 'all') -- timeout-minutes: 40 -+ timeout-minutes: 60 + +- timeout-minutes: 40 +- timeout-minutes: 60 env: CHARON_EMERGENCY_TOKEN: ${{ secrets.CHARON_EMERGENCY_TOKEN }} CHARON_EMERGENCY_SERVER_ENABLED: "true" @@ -101,42 +108,45 @@ index efbcccda..64fcc121 100644 npx playwright test \ --project=webkit \ -+ --output=playwright-output/security-webkit \ +- --output=playwright-output/security-webkit \ tests/security-enforcement/ \ tests/security/ \ tests/integration/multi-feature-workflows.spec.ts || STATUS=$? + @@ -748,6 +789,25 @@ jobs: path: test-results/**/*.zip retention-days: 7 -+ - name: Collect diagnostics -+ if: always() -+ run: | -+ mkdir -p diagnostics -+ uptime > diagnostics/uptime.txt -+ free -m > diagnostics/free-m.txt -+ df -h > diagnostics/df-h.txt -+ ps aux > diagnostics/ps-aux.txt -+ docker ps -a > diagnostics/docker-ps.txt || true -+ docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true -+ -+ - name: Upload diagnostics -+ if: always() -+ uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0 -+ with: -+ name: e2e-diagnostics-webkit-security -+ path: diagnostics/ -+ retention-days: 7 -+ +- - name: Collect diagnostics +- if: always() +- run: | +- mkdir -p diagnostics +- uptime > diagnostics/uptime.txt +- free -m > diagnostics/free-m.txt +- df -h > diagnostics/df-h.txt +- ps aux > diagnostics/ps-aux.txt +- docker ps -a > diagnostics/docker-ps.txt || true +- docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true +- +- - name: Upload diagnostics +- if: always() +- uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0 +- with: +- name: e2e-diagnostics-webkit-security +- path: diagnostics/ +- retention-days: 7 +- - name: Collect Docker logs on failure if: failure() run: | + @@ -779,7 +839,7 @@ jobs: if: | ((inputs.browser || 'all') == 'chromium' || (inputs.browser || 'all') == 'all') && ((inputs.test_category || 'all') == 'non-security' || (inputs.test_category || 'all') == 'all') -- timeout-minutes: 30 -+ timeout-minutes: 60 + +- timeout-minutes: 30 +- timeout-minutes: 60 env: CHARON_EMERGENCY_TOKEN: ${{ secrets.CHARON_EMERGENCY_TOKEN }} CHARON_EMERGENCY_SERVER_ENABLED: "true" @@ -144,57 +154,61 @@ index efbcccda..64fcc121 100644 npx playwright test \ --project=chromium \ --shard=${{ matrix.shard }}/${{ matrix.total-shards }} \ -+ --output=playwright-output/chromium-shard-${{ matrix.shard }} \ +- --output=playwright-output/chromium-shard-${{ matrix.shard }} \ tests/core \ tests/dns-provider-crud.spec.ts \ tests/dns-provider-types.spec.ts \ + @@ -915,6 +976,14 @@ jobs: path: playwright-report/ retention-days: 14 -+ - name: Upload Playwright output (Chromium shard ${{ matrix.shard }}) -+ if: always() -+ uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0 -+ with: -+ name: playwright-output-chromium-shard-${{ matrix.shard }} -+ path: playwright-output/chromium-shard-${{ matrix.shard }}/ -+ retention-days: 7 -+ +- - name: Upload Playwright output (Chromium shard ${{ matrix.shard }}) +- if: always() +- uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0 +- with: +- name: playwright-output-chromium-shard-${{ matrix.shard }} +- path: playwright-output/chromium-shard-${{ matrix.shard }}/ +- retention-days: 7 +- - name: Upload Chromium coverage (if enabled) if: always() && (inputs.playwright_coverage == 'true' || vars.PLAYWRIGHT_COVERAGE == '1') uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0 + @@ -931,6 +1000,25 @@ jobs: path: test-results/**/*.zip retention-days: 7 -+ - name: Collect diagnostics -+ if: always() -+ run: | -+ mkdir -p diagnostics -+ uptime > diagnostics/uptime.txt -+ free -m > diagnostics/free-m.txt -+ df -h > diagnostics/df-h.txt -+ ps aux > diagnostics/ps-aux.txt -+ docker ps -a > diagnostics/docker-ps.txt || true -+ docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true -+ -+ - name: Upload diagnostics -+ if: always() -+ uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0 -+ with: -+ name: e2e-diagnostics-chromium-shard-${{ matrix.shard }} -+ path: diagnostics/ -+ retention-days: 7 -+ +- - name: Collect diagnostics +- if: always() +- run: | +- mkdir -p diagnostics +- uptime > diagnostics/uptime.txt +- free -m > diagnostics/free-m.txt +- df -h > diagnostics/df-h.txt +- ps aux > diagnostics/ps-aux.txt +- docker ps -a > diagnostics/docker-ps.txt || true +- docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true +- +- - name: Upload diagnostics +- if: always() +- uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0 +- with: +- name: e2e-diagnostics-chromium-shard-${{ matrix.shard }} +- path: diagnostics/ +- retention-days: 7 +- - name: Collect Docker logs on failure if: failure() run: | + @@ -955,7 +1043,7 @@ jobs: if: | ((inputs.browser || 'all') == 'firefox' || (inputs.browser || 'all') == 'all') && ((inputs.test_category || 'all') == 'non-security' || (inputs.test_category || 'all') == 'all') -- timeout-minutes: 30 -+ timeout-minutes: 60 + +- timeout-minutes: 30 +- timeout-minutes: 60 env: CHARON_EMERGENCY_TOKEN: ${{ secrets.CHARON_EMERGENCY_TOKEN }} CHARON_EMERGENCY_SERVER_ENABLED: "true" @@ -202,57 +216,61 @@ index efbcccda..64fcc121 100644 npx playwright test \ --project=firefox \ --shard=${{ matrix.shard }}/${{ matrix.total-shards }} \ -+ --output=playwright-output/firefox-shard-${{ matrix.shard }} \ +- --output=playwright-output/firefox-shard-${{ matrix.shard }} \ tests/core \ tests/dns-provider-crud.spec.ts \ tests/dns-provider-types.spec.ts \ + @@ -1099,6 +1188,14 @@ jobs: path: playwright-report/ retention-days: 14 -+ - name: Upload Playwright output (Firefox shard ${{ matrix.shard }}) -+ if: always() -+ uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0 -+ with: -+ name: playwright-output-firefox-shard-${{ matrix.shard }} -+ path: playwright-output/firefox-shard-${{ matrix.shard }}/ -+ retention-days: 7 -+ +- - name: Upload Playwright output (Firefox shard ${{ matrix.shard }}) +- if: always() +- uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0 +- with: +- name: playwright-output-firefox-shard-${{ matrix.shard }} +- path: playwright-output/firefox-shard-${{ matrix.shard }}/ +- retention-days: 7 +- - name: Upload Firefox coverage (if enabled) if: always() && (inputs.playwright_coverage == 'true' || vars.PLAYWRIGHT_COVERAGE == '1') uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0 + @@ -1115,6 +1212,25 @@ jobs: path: test-results/**/*.zip retention-days: 7 -+ - name: Collect diagnostics -+ if: always() -+ run: | -+ mkdir -p diagnostics -+ uptime > diagnostics/uptime.txt -+ free -m > diagnostics/free-m.txt -+ df -h > diagnostics/df-h.txt -+ ps aux > diagnostics/ps-aux.txt -+ docker ps -a > diagnostics/docker-ps.txt || true -+ docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true -+ -+ - name: Upload diagnostics -+ if: always() -+ uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0 -+ with: -+ name: e2e-diagnostics-firefox-shard-${{ matrix.shard }} -+ path: diagnostics/ -+ retention-days: 7 -+ +- - name: Collect diagnostics +- if: always() +- run: | +- mkdir -p diagnostics +- uptime > diagnostics/uptime.txt +- free -m > diagnostics/free-m.txt +- df -h > diagnostics/df-h.txt +- ps aux > diagnostics/ps-aux.txt +- docker ps -a > diagnostics/docker-ps.txt || true +- docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true +- +- - name: Upload diagnostics +- if: always() +- uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0 +- with: +- name: e2e-diagnostics-firefox-shard-${{ matrix.shard }} +- path: diagnostics/ +- retention-days: 7 +- - name: Collect Docker logs on failure if: failure() run: | + @@ -1139,7 +1255,7 @@ jobs: if: | ((inputs.browser || 'all') == 'webkit' || (inputs.browser || 'all') == 'all') && ((inputs.test_category || 'all') == 'non-security' || (inputs.test_category || 'all') == 'all') -- timeout-minutes: 30 -+ timeout-minutes: 60 + +- timeout-minutes: 30 +- timeout-minutes: 60 env: CHARON_EMERGENCY_TOKEN: ${{ secrets.CHARON_EMERGENCY_TOKEN }} CHARON_EMERGENCY_SERVER_ENABLED: "true" @@ -260,48 +278,50 @@ index efbcccda..64fcc121 100644 npx playwright test \ --project=webkit \ --shard=${{ matrix.shard }}/${{ matrix.total-shards }} \ -+ --output=playwright-output/webkit-shard-${{ matrix.shard }} \ +- --output=playwright-output/webkit-shard-${{ matrix.shard }} \ tests/core \ tests/dns-provider-crud.spec.ts \ tests/dns-provider-types.spec.ts \ + @@ -1283,6 +1400,14 @@ jobs: path: playwright-report/ retention-days: 14 -+ - name: Upload Playwright output (WebKit shard ${{ matrix.shard }}) -+ if: always() -+ uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6 -+ with: -+ name: playwright-output-webkit-shard-${{ matrix.shard }} -+ path: playwright-output/webkit-shard-${{ matrix.shard }}/ -+ retention-days: 7 -+ +- - name: Upload Playwright output (WebKit shard ${{ matrix.shard }}) +- if: always() +- uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6 +- with: +- name: playwright-output-webkit-shard-${{ matrix.shard }} +- path: playwright-output/webkit-shard-${{ matrix.shard }}/ +- retention-days: 7 +- - name: Upload WebKit coverage (if enabled) if: always() && (inputs.playwright_coverage == 'true' || vars.PLAYWRIGHT_COVERAGE == '1') uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6 + @@ -1299,6 +1424,25 @@ jobs: path: test-results/**/*.zip retention-days: 7 -+ - name: Collect diagnostics -+ if: always() -+ run: | -+ mkdir -p diagnostics -+ uptime > diagnostics/uptime.txt -+ free -m > diagnostics/free-m.txt -+ df -h > diagnostics/df-h.txt -+ ps aux > diagnostics/ps-aux.txt -+ docker ps -a > diagnostics/docker-ps.txt || true -+ docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true -+ -+ - name: Upload diagnostics -+ if: always() -+ uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6 -+ with: -+ name: e2e-diagnostics-webkit-shard-${{ matrix.shard }} -+ path: diagnostics/ -+ retention-days: 7 -+ +- - name: Collect diagnostics +- if: always() +- run: | +- mkdir -p diagnostics +- uptime > diagnostics/uptime.txt +- free -m > diagnostics/free-m.txt +- df -h > diagnostics/df-h.txt +- ps aux > diagnostics/ps-aux.txt +- docker ps -a > diagnostics/docker-ps.txt || true +- docker logs --tail 500 charon-e2e > diagnostics/docker-charon-e2e.log 2>&1 || true +- +- - name: Upload diagnostics +- if: always() +- uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6 +- with: +- name: e2e-diagnostics-webkit-shard-${{ matrix.shard }} +- path: diagnostics/ +- retention-days: 7 +- - name: Collect Docker logs on failure if: failure() run: | diff --git a/docs/performance/feature-flags-endpoint.md b/docs/performance/feature-flags-endpoint.md index f63a31ff..c61ef10f 100644 --- a/docs/performance/feature-flags-endpoint.md +++ b/docs/performance/feature-flags-endpoint.md @@ -31,6 +31,7 @@ for _, s := range settings { ``` **Key Improvements:** + - **Single Query:** `WHERE key IN (?, ?, ?)` fetches all flags in one database round-trip - **O(1) Lookups:** Map-based access eliminates linear search overhead - **Error Handling:** Explicit error logging and HTTP 500 response on failure @@ -56,6 +57,7 @@ if err := h.DB.Transaction(func(tx *gorm.DB) error { ``` **Key Improvements:** + - **Atomic Updates:** All flag changes commit or rollback together - **Error Recovery:** Transaction rollback prevents partial state - **Improved Logging:** Explicit error messages for debugging @@ -65,10 +67,12 @@ if err := h.DB.Transaction(func(tx *gorm.DB) error { ### Before Optimization (Baseline - N+1 Pattern) **Architecture:** + - GetFlags(): 3 sequential `WHERE key = ?` queries (one per flag) - UpdateFlags(): Multiple separate transactions **Measured Latency (Expected):** + - **GET P50:** 300ms (CI environment) - **GET P95:** 500ms - **GET P99:** 600ms @@ -77,20 +81,24 @@ if err := h.DB.Transaction(func(tx *gorm.DB) error { - **PUT P99:** 600ms **Query Count:** + - GET: 3 queries (N+1 pattern, N=3 flags) - PUT: 1-3 queries depending on flag count **CI Impact:** + - Test flakiness: ~30% failure rate due to timeouts - E2E test pass rate: ~70% ### After Optimization (Current - Batch Query + Transaction) **Architecture:** + - GetFlags(): 1 batch query `WHERE key IN (?, ?, ?)` - UpdateFlags(): 1 transaction wrapping all updates **Measured Latency (Target):** + - **GET P50:** 100ms (3x faster) - **GET P95:** 150ms (3.3x faster) - **GET P99:** 200ms (3x faster) @@ -99,10 +107,12 @@ if err := h.DB.Transaction(func(tx *gorm.DB) error { - **PUT P99:** 200ms (3x faster) **Query Count:** + - GET: 1 batch query (N+1 eliminated) - PUT: 1 transaction (atomic) **CI Impact (Expected):** + - Test flakiness: 0% (with retry logic + polling) - E2E test pass rate: 100% @@ -125,11 +135,13 @@ if err := h.DB.Transaction(func(tx *gorm.DB) error { **Status:** Complete **Changes:** + - Added `defer` timing to GetFlags() and UpdateFlags() - Log format: `[METRICS] GET/PUT /feature-flags: {duration}ms` - CI pipeline captures P50/P95/P99 metrics **Files Modified:** + - `backend/internal/api/handlers/feature_flags_handler.go` ### Phase 1: Backend Optimization - N+1 Query Fix @@ -139,16 +151,19 @@ if err := h.DB.Transaction(func(tx *gorm.DB) error { **Priority:** P0 - Critical CI Blocker **Changes:** + - **GetFlags():** Replaced N+1 loop with batch query `WHERE key IN (?)` - **UpdateFlags():** Wrapped updates in single transaction - **Tests:** Added batch query and transaction rollback tests - **Benchmarks:** Added BenchmarkGetFlags and BenchmarkUpdateFlags **Files Modified:** + - `backend/internal/api/handlers/feature_flags_handler.go` - `backend/internal/api/handlers/feature_flags_handler_test.go` **Expected Impact:** + - 3-6x latency reduction (600ms → 200ms P99) - Elimination of N+1 query anti-pattern - Atomic updates with rollback on error @@ -159,32 +174,38 @@ if err := h.DB.Transaction(func(tx *gorm.DB) error { ### Test Helpers Used **Polling Helper:** `waitForFeatureFlagPropagation()` + - Polls `/api/v1/feature-flags` until expected state confirmed - Default interval: 500ms - Default timeout: 30s (150x safety margin over 200ms P99) **Retry Helper:** `retryAction()` + - 3 max attempts with exponential backoff (2s, 4s, 8s) - Handles transient network/DB failures ### Timeout Strategy **Helper Defaults:** + - `clickAndWaitForResponse()`: 30s timeout - `waitForAPIResponse()`: 30s timeout - No explicit timeouts in test files (rely on helper defaults) **Typical Poll Count:** + - Local: 1-2 polls (50-200ms response + 500ms interval) - CI: 1-3 polls (50-200ms response + 500ms interval) ### Test Files **E2E Tests:** + - `tests/settings/system-settings.spec.ts` - Feature toggle tests - `tests/utils/wait-helpers.ts` - Polling and retry helpers **Backend Tests:** + - `backend/internal/api/handlers/feature_flags_handler_test.go` - `backend/internal/api/handlers/feature_flags_handler_coverage_test.go` @@ -205,11 +226,13 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$ ### Benchmark Analysis **GetFlags Benchmark:** + - Measures single batch query performance - Tests with 3 flags in database - Includes JSON serialization overhead **UpdateFlags Benchmark:** + - Measures transaction wrapping performance - Tests atomic update of 3 flags - Includes JSON deserialization and validation @@ -219,14 +242,17 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$ ### Why Batch Query Over Individual Queries? **Problem:** N+1 pattern causes linear latency scaling + - 3 flags = 3 queries × 200ms = 600ms total - 10 flags = 10 queries × 200ms = 2000ms total **Solution:** Single batch query with IN clause + - N flags = 1 query × 200ms = 200ms total - Constant time regardless of flag count **Trade-offs:** + - ✅ 3-6x latency reduction - ✅ Scales to more flags without performance degradation - ⚠️ Slightly more complex code (map-based lookup) @@ -234,14 +260,17 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$ ### Why Transaction Wrapping? **Problem:** Multiple separate writes risk partial state + - Flag 1 succeeds, Flag 2 fails → inconsistent state - No rollback mechanism for failed updates **Solution:** Single transaction for all updates + - All succeed together or all rollback - ACID guarantees for multi-flag updates **Trade-offs:** + - ✅ Atomic updates with rollback on error - ✅ Prevents partial state corruption - ⚠️ Slightly longer locks (mitigated by fast SQLite) @@ -253,11 +282,13 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$ **Status:** Not implemented (not needed after Phase 1 optimization) **Rationale:** + - Current latency (50-200ms) is acceptable for feature flags - Feature flags change infrequently (not a hot path) - Adding cache increases complexity without significant benefit **If Needed:** + - Use Redis or in-memory cache with TTL=60s - Invalidate on PUT operations - Expected improvement: 50-200ms → 10-50ms @@ -267,11 +298,13 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$ **Status:** SQLite default indexes sufficient **Rationale:** + - `settings.key` column used in WHERE clauses - SQLite automatically indexes primary key - Query plan analysis shows index usage **If Needed:** + - Add explicit index: `CREATE INDEX idx_settings_key ON settings(key)` - Expected improvement: Minimal (already fast) @@ -280,11 +313,13 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$ **Status:** GORM default pooling sufficient **Rationale:** + - GORM uses `database/sql` pool by default - Current concurrency limits adequate - No connection exhaustion observed **If Needed:** + - Tune `SetMaxOpenConns()` and `SetMaxIdleConns()` - Expected improvement: 10-20% under high load @@ -293,12 +328,14 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$ ### Metrics to Track **Backend Metrics:** + - P50/P95/P99 latency for GET and PUT operations - Query count per request (should remain 1 for GET) - Transaction count per PUT (should remain 1) - Error rate (target: <0.1%) **E2E Metrics:** + - Test pass rate for feature toggle tests - Retry attempt frequency (target: <5%) - Polling iteration count (typical: 1-3) @@ -307,11 +344,13 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$ ### Alerting Thresholds **Backend Alerts:** + - P99 > 500ms → Investigate regression (2.5x slower than optimized) - Error rate > 1% → Check database health - Query count > 1 for GET → N+1 pattern reintroduced **E2E Alerts:** + - Test pass rate < 95% → Check for new flakiness - Timeout errors > 0 → Investigate CI environment - Retry rate > 10% → Investigate transient failure source @@ -319,10 +358,12 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$ ### Dashboard **CI Metrics:** + - Link: `.github/workflows/e2e-tests.yml` artifacts - Extracts `[METRICS]` logs for P50/P95/P99 analysis **Backend Logs:** + - Docker container logs with `[METRICS]` tag - Example: `[METRICS] GET /feature-flags: 120ms` @@ -331,15 +372,18 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$ ### High Latency (P99 > 500ms) **Symptoms:** + - E2E tests timing out - Backend logs show latency spikes **Diagnosis:** + 1. Check query count: `grep "SELECT" backend/logs/query.log` 2. Verify batch query: Should see `WHERE key IN (...)` 3. Check transaction wrapping: Should see single `BEGIN ... COMMIT` **Remediation:** + - If N+1 pattern detected: Verify batch query implementation - If transaction missing: Verify transaction wrapping - If database locks: Check concurrent access patterns @@ -347,15 +391,18 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$ ### Transaction Rollback Errors **Symptoms:** + - PUT requests return 500 errors - Backend logs show transaction failure **Diagnosis:** + 1. Check error message: `grep "Failed to update feature flags" backend/logs/app.log` 2. Verify database constraints: Unique key constraints, foreign keys 3. Check database connectivity: Connection pool exhaustion **Remediation:** + - If constraint violation: Fix invalid flag key or value - If connection issue: Tune connection pool settings - If deadlock: Analyze concurrent access patterns @@ -363,15 +410,18 @@ go test ./internal/api/handlers/ -bench=Benchmark.*Flags -benchmem -run=^$ ### E2E Test Flakiness **Symptoms:** + - Tests pass locally, fail in CI - Timeout errors in Playwright logs **Diagnosis:** + 1. Check backend latency: `grep "[METRICS]" ci-logs.txt` 2. Verify retry logic: Should see retry attempts in logs 3. Check polling behavior: Should see multiple GET requests **Remediation:** + - If backend slow: Investigate CI environment (disk I/O, CPU) - If no retries: Verify `retryAction()` wrapper in test - If no polling: Verify `waitForFeatureFlagPropagation()` usage diff --git a/docs/plans/rate_limit_ci_fix_spec.md b/docs/plans/rate_limit_ci_fix_spec.md index 13ba1215..4a740a85 100644 --- a/docs/plans/rate_limit_ci_fix_spec.md +++ b/docs/plans/rate_limit_ci_fix_spec.md @@ -11,6 +11,7 @@ ### Issue 1: `rate_limit` handler never appears in running Caddy config **Observed symptom** (from CI log): + ``` Attempt 10/10: rate_limit handler not found, waiting... ✗ rate_limit handler verification failed after 10 attempts @@ -22,6 +23,7 @@ Rate limit enforcement test FAILED #### Code path trace The `verify_rate_limit_config` function in `scripts/rate_limit_integration.sh` (lines ~35–58) executes: + ```bash caddy_config=$(curl -s http://localhost:2119/config 2>/dev/null || echo "") if echo "$caddy_config" | grep -q '"handler":"rate_limit"'; then @@ -48,6 +50,7 @@ The handler is absent from Caddy's running config because `ApplyConfig` in `back **Root cause A — silent failure of the security config POST step** (contributing): The security config POST step in the script discards stdout only; curl exits 0 for HTTP 4xx without -f flag, so auth failures are invisible: + ```bash # scripts/rate_limit_integration.sh, ~line 248 curl -s -X POST -H "Content-Type: application/json" \ @@ -55,9 +58,11 @@ curl -s -X POST -H "Content-Type: application/json" \ -b ${TMP_COOKIE} \ http://localhost:8280/api/v1/security/config >/dev/null ``` + No HTTP status check is performed. If this returns 4xx (e.g., `403 Forbidden` because the requesting user lacks the admin role, or `401 Unauthorized` because the cookie was not accepted), the config is never saved to DB, `ApplyConfig` is never called with the rate_limit values, and the handler is never injected. The route is protected by `middleware.RequireRole(models.RoleAdmin)` (routes.go:572–573): + ```go securityAdmin := management.Group("/security") securityAdmin.Use(middleware.RequireRole(models.RoleAdmin)) @@ -69,6 +74,7 @@ A non-admin authenticated user, or an unauthenticated request, returns `403` sil **Root cause B — warn-and-proceed instead of fail-hard** (amplifier): `verify_rate_limit_config` returns `1` on failure, but the calling site in the script treats the failure as non-fatal: + ```bash # scripts/rate_limit_integration.sh, ~line 269 if ! verify_rate_limit_config; then @@ -76,11 +82,13 @@ if ! verify_rate_limit_config; then echo "Proceeding with test anyway..." fi ``` + The enforcement test that follows is guaranteed to fail when the handler is absent (all requests pass through with HTTP 200, never hitting 429), yet the test proceeds unconditionally. The verification failure should be a hard exit. **Root cause C — no response code check for proxy host creation** (contributing): The proxy host creation at step 5 checks the status code (`201` vs other), but allows non-201 with a soft log message: + ```bash if [ "$CREATE_STATUS" = "201" ]; then echo "✓ Proxy host created successfully" @@ -88,11 +96,13 @@ else echo " Proxy host may already exist (status: $CREATE_STATUS)" fi ``` + If this returns `401` (auth failure), no proxy host is registered. Requests to `http://localhost:8180/get` with `Host: ratelimit.local` then hit Caddy's catch-all route returning HTTP 200 (the Charon frontend), not the backend. No 429 will ever appear regardless of rate limit configuration. **Root cause D — `ApplyConfig` failure is swallowed; Caddy not yet ready when config is posted** (primary): In `UpdateConfig` (`security_handler.go:289–292`): + ```go if h.caddyManager != nil { if err := h.caddyManager.ApplyConfig(c.Request.Context()); err != nil { @@ -101,6 +111,7 @@ if h.caddyManager != nil { } c.JSON(http.StatusOK, gin.H{"config": payload}) ``` + If `ApplyConfig` fails (Caddy not yet fully initialized, config validation error), the error is logged as a warning but the HTTP response is still `200 OK`. The test script sees 200, assumes success, and proceeds. --- @@ -110,11 +121,13 @@ If `ApplyConfig` fails (Caddy not yet fully initialized, config validation error **Observed symptom**: During non-CI Docker builds, the GeoIP download step prints `⚠️ Checksum failed` and creates a `.placeholder` file, but the downloaded `.mmdb` is left on disk alongside the placeholder. **Code location**: `Dockerfile`, lines that contain: + ```dockerfile ARG GEOLITE2_COUNTRY_SHA256=aa154fc6bcd712644de232a4abcdd07dac1f801308c0b6f93dbc2b375443da7b ``` **Non-CI verification block** (Dockerfile, local build path): + ```dockerfile if [ -s /app/data/geoip/GeoLite2-Country.mmdb ] && \ echo "${GEOLITE2_COUNTRY_SHA256} /app/data/geoip/GeoLite2-Country.mmdb" | sha256sum -c -; then @@ -146,6 +159,7 @@ fi; **Required change**: Capture the HTTP status code from the login response. Fail fast if login returns non-200. Exact change — replace: + ```bash curl -s -X POST -H "Content-Type: application/json" \ -d '{"email":"ratelimit@example.local","password":"password123"}' \ @@ -156,6 +170,7 @@ echo "✓ Authentication complete" ``` With: + ```bash LOGIN_STATUS=$(curl -s -w "\n%{http_code}" -X POST -H "Content-Type: application/json" \ -d '{"email":"ratelimit@example.local","password":"password123"}' \ @@ -174,6 +189,7 @@ echo "✓ Authentication complete (HTTP $LOGIN_STATUS)" **Current behavior**: Non-201 responses are treated as "may already exist" and execution continues — including `401`/`403` auth failures. Required change — replace: + ```bash if [ "$CREATE_STATUS" = "201" ]; then echo "✓ Proxy host created successfully" @@ -183,6 +199,7 @@ fi ``` With: + ```bash if [ "$CREATE_STATUS" = "201" ]; then echo "✓ Proxy host created successfully" @@ -201,6 +218,7 @@ fi **Rationale**: Root Cause D is the primary driver of handler-not-found failures. If Caddy's admin API is not yet fully initialized when the security config is POSTed, `ApplyConfig` fails silently (logged as a warning only), the rate_limit handler is never injected into Caddy's running config, and the verification loop times out. The readiness gate ensures Caddy is accepting admin API requests before any config change is attempted. **Required change** — insert before the security config POST: + ```bash echo "Waiting for Caddy admin API to be ready..." for i in {1..20}; do @@ -224,6 +242,7 @@ done **Current behavior**: Response is discarded with `>/dev/null`. No status check. Required change — replace: + ```bash curl -s -X POST -H "Content-Type: application/json" \ -d "${SEC_CFG_PAYLOAD}" \ @@ -234,6 +253,7 @@ echo "✓ Rate limiting configured" ``` With: + ```bash SEC_CONFIG_RESP=$(curl -s -w "\n%{http_code}" -X POST -H "Content-Type: application/json" \ -d "${SEC_CFG_PAYLOAD}" \ @@ -258,6 +278,7 @@ echo "✓ Rate limiting configured (HTTP $SEC_CONFIG_STATUS)" **Current behavior**: Failed verification logs a warning and continues. Required change — replace: + ```bash echo "Waiting for Caddy to apply configuration..." sleep 5 @@ -270,6 +291,7 @@ fi ``` With: + ```bash echo "Waiting for Caddy to apply configuration..." sleep 8 @@ -307,6 +329,7 @@ local wait=5 # was: 3 #### Change 7 — Use trailing slash on Caddy admin API URL in `verify_rate_limit_config` **Location**: `verify_rate_limit_config`, line ~42: + ```bash caddy_config=$(curl -s http://localhost:2119/config 2>/dev/null || echo "") ``` @@ -314,11 +337,13 @@ caddy_config=$(curl -s http://localhost:2119/config 2>/dev/null || echo "") Caddy's admin API specification defines `GET /config/` (with trailing slash) as the canonical endpoint for the full running config. Omitting the slash works in practice because Caddy does not redirect, but using the canonical form is more correct and avoids any future behavioral change: Replace: + ```bash caddy_config=$(curl -s http://localhost:2119/config 2>/dev/null || echo "") ``` With: + ```bash caddy_config=$(curl -s http://localhost:2119/config/ 2>/dev/null || echo "") ``` @@ -377,6 +402,7 @@ fi **Important**: Do NOT remove the `ARG GEOLITE2_COUNTRY_SHA256` declaration from the Dockerfile. The `update-geolite2.yml` workflow uses `sed` to update that ARG. If the ARG disappears, the workflow's `sed` command will silently no-op and fail to update the Dockerfile on next run, leaving the stale hash in source while the workflow reports success. Keeping the ARG (even unused) preserves Renovate/workflow compatibility. Keep: + ```dockerfile ARG GEOLITE2_COUNTRY_SHA256=aa154fc6bcd712644de232a4abcdd07dac1f801308c0b6f93dbc2b375443da7b ``` @@ -402,6 +428,7 @@ This ARG is now only referenced by the `update-geolite2.yml` workflow (to know i ### Validating Issue 1 fix **Step 1 — Build and run the integration test locally:** + ```bash # From /projects/Charon chmod +x scripts/rate_limit_integration.sh @@ -409,6 +436,7 @@ scripts/rate_limit_integration.sh 2>&1 | tee /tmp/ratelimit-test.log ``` **Expected output sequence (key lines)**: + ``` ✓ Charon API is ready ✓ Authentication complete (HTTP 200) @@ -428,16 +456,20 @@ Sending request 3+1 (should return 429 Too Many Requests)... **Step 2 — Deliberately break auth to verify the new guard fires:** Temporarily change `password123` in the login curl to a wrong password. The test should now print: + ``` ✗ Login failed (HTTP 401) — aborting ``` + and exit with code 1, rather than proceeding to a confusing 429-enforcement failure. **Step 3 — Verify Caddy config contains the handler before enforcement:** + ```bash # After security config step and sleep 8: curl -s http://localhost:2119/config/ | python3 -m json.tool | grep -A2 '"handler": "rate_limit"' ``` + Expected: handler block with `"rate_limits"` sub-key containing `"static"` zone. **Step 4 — CI validation:** Push to a PR and observe the `Rate Limiting Integration` workflow. The workflow now exits at the first unmissable error rather than proceeding to a deceptive "enforcement test FAILED" message. @@ -445,21 +477,27 @@ Expected: handler block with `"rate_limits"` sub-key containing `"static"` zone. ### Validating Issue 2 fix **Step 1 — Local build without CI flag:** + ```bash docker build -t charon:geolip-test --build-arg CI=false . 2>&1 | grep -E "GeoIP|GeoLite|checksum|✅|⚠️" ``` + Expected: `✅ GeoIP downloaded` (no mention of checksum failure). **Step 2 — Verify file is present and readable:** + ```bash docker run --rm charon:geolip-test stat /app/data/geoip/GeoLite2-Country.mmdb ``` + Expected: file exists with non-zero size, no `.placeholder` alongside. **Step 3 — Confirm ARG still exists for workflow compatibility:** + ```bash grep "GEOLITE2_COUNTRY_SHA256" Dockerfile ``` + Expected: `ARG GEOLITE2_COUNTRY_SHA256=` line is present. --- diff --git a/docs/plans/telegram_implementation_spec.md b/docs/plans/telegram_implementation_spec.md index 57c12d69..e1cef77f 100644 --- a/docs/plans/telegram_implementation_spec.md +++ b/docs/plans/telegram_implementation_spec.md @@ -37,6 +37,7 @@ Content-Type: application/json ``` **Key design decisions:** + - **Token storage:** The bot token is stored in `NotificationProvider.Token` (`json:"-"`, encrypted at rest) — never in the URL field. This mirrors the Gotify pattern where secrets are separated from endpoints. - **URL field:** Stores only the `chat_id` (e.g., `987654321`). At dispatch time, the full API URL is constructed dynamically: `https://api.telegram.org/bot` + decryptedToken + `/sendMessage`. The `chat_id` is passed in the POST body alongside the message text. This prevents token leakage via API responses since URL is `json:"url"`. - **SSRF mitigation:** Before dispatching, validate that the constructed URL hostname is exactly `api.telegram.org`. This prevents SSRF if stored data is tampered with. @@ -475,6 +476,7 @@ Request/response schemas are unchanged. The `type` field now accepts `"telegram" Modeled after `tests/settings/email-notification-provider.spec.ts`. Test scenarios: + 1. Create a Telegram provider (name, chat_id in URL field, bot token in token field, enable events) 2. Verify provider appears in the list 3. Edit the Telegram provider (change name, verify token preservation) @@ -611,6 +613,7 @@ Add telegram to the payload matrix test scenarios. **Scope:** Feature flags, service layer, handler layer, all Go unit tests **Files changed:** + - `backend/internal/notifications/feature_flags.go` - `backend/internal/api/handlers/feature_flags_handler.go` - `backend/internal/notifications/router.go` @@ -624,6 +627,7 @@ Add telegram to the payload matrix test scenarios. **Dependencies:** None (self-contained backend change) **Validation gates:** + - `go test ./...` passes - `make lint-fast` passes - Coverage ≥ 85% @@ -636,6 +640,7 @@ Add telegram to the payload matrix test scenarios. **Scope:** Frontend API client, Notifications page, i18n strings, frontend unit tests, Playwright E2E tests **Files changed:** + - `frontend/src/api/notifications.ts` - `frontend/src/pages/Notifications.tsx` - `frontend/src/locales/en/translation.json` @@ -648,6 +653,7 @@ Add telegram to the payload matrix test scenarios. **Dependencies:** PR-1 must be merged first (backend must accept `type: "telegram"`) **Validation gates:** + - `npm test` passes - `npm run type-check` passes - `npx playwright test --project=firefox` passes diff --git a/docs/plans/telegram_remediation_spec.md b/docs/plans/telegram_remediation_spec.md index 12f1e701..7b4eeaf9 100644 --- a/docs/plans/telegram_remediation_spec.md +++ b/docs/plans/telegram_remediation_spec.md @@ -55,6 +55,7 @@ disabled={testMutation.isPending || (isNew && !isEmail)} **Why it was added:** The backend `Test` handler at `notification_provider_handler.go` (L333-336) requires a saved provider ID for all non-email types. For Gotify/Telegram, the server needs the stored token. For Discord/Webhook, the server still fetches the provider from DB. Without a saved provider, the backend returns `MISSING_PROVIDER_ID`. **Why it breaks tests:** Many existing E2E and unit tests click the test button from a **new (unsaved) provider form** using mocked endpoints. With the new guard: + 1. The `