Files
Charon/docs/plans/current_spec.md
GitHub Actions 7497cbecd0 chore: Implement manual test plan for SMTP mock server flakiness fix
- Added a new documentation file outlining the manual test plan to validate the SMTP mock server flakiness fix, ensuring improved mail test reliability without affecting production behavior.
- Updated the current specification document to reflect the focus on stabilizing flaky SMTP STARTTLS+AUTH unit tests, including detailed research findings and requirements for the implementation.
- Created a QA/Security validation report for the SMTP flaky test fix, confirming that changes are test-only, stable under repeated runs, and do not introduce new security risks.
2026-02-22 06:29:37 +00:00

309 lines
12 KiB
Markdown

---
post_title: "Current Spec: Stabilize Flaky SMTP STARTTLS+AUTH Unit Test"
categories:
- actions
- testing
- backend
- reliability
tags:
- go
- smtp
- starttls
- unit-tests
- ci-stability
summary: "Implementation plan to remove CI flakiness in MailService STARTTLS+AUTH tests by hardening mock SMTP server lifecycle and connection handling in test helpers only."
post_date: 2026-02-22
---
## Active Plan: Stabilize Flaky SMTP STARTTLS+AUTH Unit Test
Date: 2026-02-22
Status: Active and authoritative
Scope Type: Test reliability hardening (backend unit tests only)
## 1) Introduction
This plan addresses flakiness in backend unit test
`TestMailService_TestConnection_StartTLSSuccessWithAuth` by improving mock SMTP
test helpers used by `backend/internal/services/mail_service_test.go`.
Root-cause hypothesis to validate and fix:
- Existing mock SMTP server helpers accept only one connection, then exit.
- STARTTLS + AUTH flows in `net/smtp` are negotiation-heavy and can involve
additional command/connection behavior under CI timing variance.
- Single-accept test server behavior creates race-prone shutdown windows.
Goal:
- Make test helper servers robust and deterministic by accepting connections in
a loop until cleanup, handling each connection in its own goroutine, and
enforcing deterministic shutdown that waits for accept-loop exit and active
handler goroutines (explicit per-connection synchronization via waitgroup).
Non-goal:
- No production mail service behavior changes.
## 2) Research Findings
### 2.1 Current flaky target and helper topology
Primary target test:
- `backend/internal/services/mail_service_test.go`
- `TestMailService_TestConnection_StartTLSSuccessWithAuth`
Helper functions in same file (current behavior):
- `startMockSMTPServer(...)`
- Single `Accept()` call, single connection handler, then goroutine exits.
- `startMockSSLSMTPServer(...)`
- Single `Accept()` call, single connection handler, then goroutine exits.
- `handleSMTPConn(...)`
- Handles SMTP protocol conversation for each accepted connection.
### 2.2 Flakiness vector summary
- Current single-accept model is fragile for tests where client behavior may
include additional negotiation/timing paths.
- Cleanup currently closes listener and waits on a done signal, but done is
tied to a single accept goroutine rather than a full server lifecycle.
- Under CI contention, this can surface nondeterministic failures in tests
expecting reliable STARTTLS + AUTH handshake behavior.
### 2.3 Repo config review requested by user
Reviewed files:
- `.gitignore`
- `codecov.yml`
- `.dockerignore`
- `Dockerfile`
Conclusion for this scope:
- No changes required for this test-only helper stabilization.
- Existing ignore/coverage/docker config does not block this plan.
## 3) Requirements (EARS)
- R1: WHEN mock SMTP test helpers are started, THE SYSTEM SHALL accept
connections in a loop until explicit cleanup closes the listener.
- R2: WHEN a connection is accepted, THE SYSTEM SHALL handle it in its own
goroutine so one slow session cannot block new accepts.
- R3: WHEN cleanup is invoked, THE SYSTEM SHALL close the listener and await
deterministic server-loop termination plus completion of active connection
handlers (done channel + waitgroup synchronization).
- R4: IF cleanup wait exceeds timeout, THEN THE SYSTEM SHALL report a failure
signal and return without hanging test execution.
- R5: WHEN this fix is implemented, THE SYSTEM SHALL keep production code
untouched and restrict changes to test helper scope.
- R6: WHEN targeted backend tests run, THE SYSTEM SHALL pass reliably in local
and CI-like conditions.
## 4) Technical Specification
### 4.1 Exact target files and functions (minimal diff scope)
Primary file to edit:
- `backend/internal/services/mail_service_test.go`
Functions in scope:
- `startMockSMTPServer(t *testing.T, tlsConf *tls.Config, supportStartTLS bool, requireAuth bool) (string, func())`
- `startMockSSLSMTPServer(t *testing.T, tlsConf *tls.Config, requireAuth bool) (string, func())`
Related function (read-only unless strictly necessary):
- `handleSMTPConn(conn net.Conn, tlsConf *tls.Config, supportStartTLS bool, requireAuth bool)`
Out of scope:
- `backend/internal/services/mail_service.go`
- Any non-mail-service test files.
### 4.2 Planned helper behavior changes
For both helper server starters:
1. Replace single `Accept()` with accept loop:
- Continue accepting until listener close returns an error.
- Treat listener-close accept errors as normal shutdown path.
2. Spawn per-connection handler goroutine:
- Each accepted connection handled independently.
- Connection closed within goroutine after handling.
3. Deterministic lifecycle signaling:
- Keep `done` channel signaling tied to server accept-loop exit.
- Ensure exactly one close of `done` from server goroutine.
- Track active connection-handler goroutines with explicit waitgroup sync.
4. Cleanup contract:
- `cleanup()` closes listener first.
- Wait for accept-loop exit and active handler completion with bounded timeout (`2s` currently acceptable).
- Never block indefinitely.
- Timeout path is a test failure signal (non-hanging), not a silent pass.
### 4.3 Concurrency and race safety notes
- Accept-loop ownership:
- One goroutine owns listener accept loop and is the only goroutine that may
close `done`.
- Connection handler isolation:
- One goroutine per accepted connection; no shared mutable state required for
protocol behavior in this helper.
- Per-connection synchronization:
- A waitgroup must increment before each handler goroutine starts and must
decrement on handler exit so cleanup can deterministically wait for active
handlers to finish.
- Listener close semantics:
- `listener.Close()` from cleanup is expected to break `Accept()`.
- Exit condition should avoid noisy test failures on intentional close.
- Cleanup timeout behavior:
- Timeout remains defensive to prevent suite hangs under pathological CI
resource starvation.
- Timeout branch must report failure (e.g., `t.Errorf`/`t.Fatalf` policy) and
return; no panic and no silent success.
### 4.4 Error handling policy in helpers
- Treat only expected shutdown accept errors (listener close path) as normal.
- Surface unexpected `Accept()` failures as test failure signals.
- Keep helper logic simple and deterministic; avoid over-engineered retry logic.
## 5) Implementation Plan
### Phase 1: Testing protocol sequencing note
- Policy remains E2E-first globally.
- Explicit exception rationale for this change: scope is backend test-helper-only
(`mail_service_test.go`) with no production/frontend/runtime behavior delta,
so targeted backend test-first verification is authorized for this plan.
- Mandatory preflight before unit/coverage steps:
- `bash scripts/local-patch-report.sh`
- Artifacts: `test-results/local-patch-report.md` and `test-results/local-patch-report.json`
### Phase 2: Baseline confirmation and failure reproduction context
- Capture current helper behavior and flaky test target:
- `go test ./backend/internal/services -run TestMailService_TestConnection_StartTLSSuccessWithAuth -count=1 -v`
### Phase 3: Helper lifecycle hardening
- Update `startMockSMTPServer` and `startMockSSLSMTPServer` to loop accept.
- Add per-connection goroutine handling.
- Preserve/strengthen deterministic cleanup using listener close + done wait +
per-connection waitgroup completion.
### Phase 4: Targeted validation
- Re-run target test repeatedly to validate stability:
- `go test ./backend/internal/services -run TestMailService_TestConnection_StartTLSSuccessWithAuth -count=20`
- Run mail service test subset:
- `go test ./backend/internal/services -run TestMailService_ -count=1`
- Run race-focused targeted validation:
- `go test -race ./backend/internal/services -run 'TestMailService_(TestConnection|Send)' -count=1`
### Phase 5: Backend coverage validation
- Run backend coverage task/script required by repo workflow:
- Preferred script: `scripts/go-test-coverage.sh`
- VS Code equivalent: backend coverage task if available.
## 6) Validation Matrix
| Validation Item | Command / Task | Scope | Pass Criteria |
|---|---|---|---|
| Targeted flaky test | `go test ./backend/internal/services -run TestMailService_TestConnection_StartTLSSuccessWithAuth -count=20` | Direct flaky test | No failures across repeated runs |
| Mail service subset | `go test ./backend/internal/services -run TestMailService_ -count=1` | Nearby regression safety | All selected tests pass |
| Race-focused targeted tests | `go test -race ./backend/internal/services -run 'TestMailService_(TestConnection|Send)' -count=1` | Concurrency/race safety | No race reports; tests pass |
| Package sanity | `go test ./backend/internal/services -count=1` | Service package confidence | Package tests pass |
| Backend coverage gate | `scripts/go-test-coverage.sh` | Repo-required backend coverage check | Meets configured minimum threshold (85% default) |
Notes:
- E2E-first protocol remains project-wide policy. This plan uses the explicit
backend-only targeted-test exception because scope is confined to test helper
internals with no production/UI behavior changes.
- Local patch report preflight is required before unit/coverage gates.
## 7) Risk Assessment
Risk level: Low.
- Change type: test-only.
- Production code: untouched.
- Runtime behavior: unchanged for shipped binary.
- Primary risk: helper lifecycle bug causing test hangs.
- Mitigation: bounded cleanup timeout, accept-loop exit on listener close,
focused repeated-run validation.
## 8) PR Slicing Strategy
Decision: Single PR.
Rationale:
- Extremely small scope (one test file, two helper functions).
- No cross-domain dependencies.
- Easier review and rollback.
### PR-1 (single slice)
- Scope:
- `backend/internal/services/mail_service_test.go` helper lifecycle updates.
- Dependencies:
- None.
- Validation gates:
- Validation matrix in Section 6.
- Rollback contingency:
- Revert single PR if instability increases.
## 9) Config File Review Outcome
Reviewed for this request:
- `.gitignore`
- `codecov.yml`
- `.dockerignore`
- `Dockerfile`
Suggested updates:
- None required for this scope.
- Revisit only if implementation introduces new generated artifacts or test
output paths not currently handled (not expected here).
## 10) Acceptance Criteria
- AC1: `startMockSMTPServer` accepts in loop until cleanup and no longer exits
after a single connection.
- AC2: `startMockSSLSMTPServer` accepts in loop until cleanup and no longer
exits after a single connection.
- AC3: Each accepted connection is handled in its own goroutine.
- AC4: Cleanup closes listener and uses done-channel plus per-connection
waitgroup synchronization with bounded timeout.
- AC5: Unexpected `Accept()` failures are surfaced as test failure signals;
expected listener-close shutdown errors are treated as normal.
- AC6: `TestMailService_TestConnection_StartTLSSuccessWithAuth` passes reliably
under repeated runs.
- AC7: Targeted race validation for mail-service tests passes with `go test -race`.
- AC8: Cleanup timeout path reports failure and returns (non-hanging), never
silent pass.
- AC9: Backend coverage script/task completes successfully at configured
threshold.
- AC10: No production file changes are included in the implementation PR.
## 11) Definition of Done
- Planned helper changes are implemented exactly in scoped functions.
- Cleanup deterministically waits for accept-loop exit and active handlers via
done + waitgroup synchronization.
- Only expected listener-close shutdown accept errors are non-fatal; unexpected
accept errors fail tests.
- Cleanup timeout is reported as failure signal and cannot pass silently.
- Validation matrix passes.
- Diff is limited to test helper scope in `mail_service_test.go`.
- No updates to `.gitignore`, `codecov.yml`, `.dockerignore`, or `Dockerfile`
are required or included.