- Added a new documentation file outlining the manual test plan to validate the SMTP mock server flakiness fix, ensuring improved mail test reliability without affecting production behavior. - Updated the current specification document to reflect the focus on stabilizing flaky SMTP STARTTLS+AUTH unit tests, including detailed research findings and requirements for the implementation. - Created a QA/Security validation report for the SMTP flaky test fix, confirming that changes are test-only, stable under repeated runs, and do not introduce new security risks.
12 KiB
12 KiB
post_title, categories, tags, summary, post_date
| post_title | categories | tags | summary | post_date | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Current Spec: Stabilize Flaky SMTP STARTTLS+AUTH Unit Test |
|
|
Implementation plan to remove CI flakiness in MailService STARTTLS+AUTH tests by hardening mock SMTP server lifecycle and connection handling in test helpers only. | 2026-02-22 |
Active Plan: Stabilize Flaky SMTP STARTTLS+AUTH Unit Test
Date: 2026-02-22 Status: Active and authoritative Scope Type: Test reliability hardening (backend unit tests only)
1) Introduction
This plan addresses flakiness in backend unit test
TestMailService_TestConnection_StartTLSSuccessWithAuth by improving mock SMTP
test helpers used by backend/internal/services/mail_service_test.go.
Root-cause hypothesis to validate and fix:
- Existing mock SMTP server helpers accept only one connection, then exit.
- STARTTLS + AUTH flows in
net/smtpare negotiation-heavy and can involve additional command/connection behavior under CI timing variance. - Single-accept test server behavior creates race-prone shutdown windows.
Goal:
- Make test helper servers robust and deterministic by accepting connections in a loop until cleanup, handling each connection in its own goroutine, and enforcing deterministic shutdown that waits for accept-loop exit and active handler goroutines (explicit per-connection synchronization via waitgroup).
Non-goal:
- No production mail service behavior changes.
2) Research Findings
2.1 Current flaky target and helper topology
Primary target test:
backend/internal/services/mail_service_test.goTestMailService_TestConnection_StartTLSSuccessWithAuth
Helper functions in same file (current behavior):
startMockSMTPServer(...)- Single
Accept()call, single connection handler, then goroutine exits.
- Single
startMockSSLSMTPServer(...)- Single
Accept()call, single connection handler, then goroutine exits.
- Single
handleSMTPConn(...)- Handles SMTP protocol conversation for each accepted connection.
2.2 Flakiness vector summary
- Current single-accept model is fragile for tests where client behavior may include additional negotiation/timing paths.
- Cleanup currently closes listener and waits on a done signal, but done is tied to a single accept goroutine rather than a full server lifecycle.
- Under CI contention, this can surface nondeterministic failures in tests expecting reliable STARTTLS + AUTH handshake behavior.
2.3 Repo config review requested by user
Reviewed files:
.gitignorecodecov.yml.dockerignoreDockerfile
Conclusion for this scope:
- No changes required for this test-only helper stabilization.
- Existing ignore/coverage/docker config does not block this plan.
3) Requirements (EARS)
- R1: WHEN mock SMTP test helpers are started, THE SYSTEM SHALL accept connections in a loop until explicit cleanup closes the listener.
- R2: WHEN a connection is accepted, THE SYSTEM SHALL handle it in its own goroutine so one slow session cannot block new accepts.
- R3: WHEN cleanup is invoked, THE SYSTEM SHALL close the listener and await deterministic server-loop termination plus completion of active connection handlers (done channel + waitgroup synchronization).
- R4: IF cleanup wait exceeds timeout, THEN THE SYSTEM SHALL report a failure signal and return without hanging test execution.
- R5: WHEN this fix is implemented, THE SYSTEM SHALL keep production code untouched and restrict changes to test helper scope.
- R6: WHEN targeted backend tests run, THE SYSTEM SHALL pass reliably in local and CI-like conditions.
4) Technical Specification
4.1 Exact target files and functions (minimal diff scope)
Primary file to edit:
backend/internal/services/mail_service_test.go
Functions in scope:
startMockSMTPServer(t *testing.T, tlsConf *tls.Config, supportStartTLS bool, requireAuth bool) (string, func())startMockSSLSMTPServer(t *testing.T, tlsConf *tls.Config, requireAuth bool) (string, func())
Related function (read-only unless strictly necessary):
handleSMTPConn(conn net.Conn, tlsConf *tls.Config, supportStartTLS bool, requireAuth bool)
Out of scope:
backend/internal/services/mail_service.go- Any non-mail-service test files.
4.2 Planned helper behavior changes
For both helper server starters:
- Replace single
Accept()with accept loop:- Continue accepting until listener close returns an error.
- Treat listener-close accept errors as normal shutdown path.
- Spawn per-connection handler goroutine:
- Each accepted connection handled independently.
- Connection closed within goroutine after handling.
- Deterministic lifecycle signaling:
- Keep
donechannel signaling tied to server accept-loop exit. - Ensure exactly one close of
donefrom server goroutine. - Track active connection-handler goroutines with explicit waitgroup sync.
- Cleanup contract:
cleanup()closes listener first.
- Wait for accept-loop exit and active handler completion with bounded timeout (
2scurrently acceptable). - Never block indefinitely.
- Timeout path is a test failure signal (non-hanging), not a silent pass.
4.3 Concurrency and race safety notes
- Accept-loop ownership:
- One goroutine owns listener accept loop and is the only goroutine that may
close
done.
- One goroutine owns listener accept loop and is the only goroutine that may
close
- Connection handler isolation:
- One goroutine per accepted connection; no shared mutable state required for protocol behavior in this helper.
- Per-connection synchronization:
- A waitgroup must increment before each handler goroutine starts and must decrement on handler exit so cleanup can deterministically wait for active handlers to finish.
- Listener close semantics:
listener.Close()from cleanup is expected to breakAccept().- Exit condition should avoid noisy test failures on intentional close.
- Cleanup timeout behavior:
- Timeout remains defensive to prevent suite hangs under pathological CI resource starvation.
- Timeout branch must report failure (e.g.,
t.Errorf/t.Fatalfpolicy) and return; no panic and no silent success.
4.4 Error handling policy in helpers
- Treat only expected shutdown accept errors (listener close path) as normal.
- Surface unexpected
Accept()failures as test failure signals. - Keep helper logic simple and deterministic; avoid over-engineered retry logic.
5) Implementation Plan
Phase 1: Testing protocol sequencing note
- Policy remains E2E-first globally.
- Explicit exception rationale for this change: scope is backend test-helper-only
(
mail_service_test.go) with no production/frontend/runtime behavior delta, so targeted backend test-first verification is authorized for this plan. - Mandatory preflight before unit/coverage steps:
bash scripts/local-patch-report.sh- Artifacts:
test-results/local-patch-report.mdandtest-results/local-patch-report.json
Phase 2: Baseline confirmation and failure reproduction context
- Capture current helper behavior and flaky test target:
go test ./backend/internal/services -run TestMailService_TestConnection_StartTLSSuccessWithAuth -count=1 -v
Phase 3: Helper lifecycle hardening
- Update
startMockSMTPServerandstartMockSSLSMTPServerto loop accept. - Add per-connection goroutine handling.
- Preserve/strengthen deterministic cleanup using listener close + done wait + per-connection waitgroup completion.
Phase 4: Targeted validation
- Re-run target test repeatedly to validate stability:
go test ./backend/internal/services -run TestMailService_TestConnection_StartTLSSuccessWithAuth -count=20
- Run mail service test subset:
go test ./backend/internal/services -run TestMailService_ -count=1
- Run race-focused targeted validation:
go test -race ./backend/internal/services -run 'TestMailService_(TestConnection|Send)' -count=1
Phase 5: Backend coverage validation
- Run backend coverage task/script required by repo workflow:
- Preferred script:
scripts/go-test-coverage.sh - VS Code equivalent: backend coverage task if available.
- Preferred script:
6) Validation Matrix
| Validation Item | Command / Task | Scope | Pass Criteria |
|---|---|---|---|
| Targeted flaky test | go test ./backend/internal/services -run TestMailService_TestConnection_StartTLSSuccessWithAuth -count=20 |
Direct flaky test | No failures across repeated runs |
| Mail service subset | go test ./backend/internal/services -run TestMailService_ -count=1 |
Nearby regression safety | All selected tests pass |
| Race-focused targeted tests | `go test -race ./backend/internal/services -run 'TestMailService_(TestConnection | Send)' -count=1` | Concurrency/race safety |
| Package sanity | go test ./backend/internal/services -count=1 |
Service package confidence | Package tests pass |
| Backend coverage gate | scripts/go-test-coverage.sh |
Repo-required backend coverage check | Meets configured minimum threshold (85% default) |
Notes:
- E2E-first protocol remains project-wide policy. This plan uses the explicit backend-only targeted-test exception because scope is confined to test helper internals with no production/UI behavior changes.
- Local patch report preflight is required before unit/coverage gates.
7) Risk Assessment
Risk level: Low.
- Change type: test-only.
- Production code: untouched.
- Runtime behavior: unchanged for shipped binary.
- Primary risk: helper lifecycle bug causing test hangs.
- Mitigation: bounded cleanup timeout, accept-loop exit on listener close, focused repeated-run validation.
8) PR Slicing Strategy
Decision: Single PR.
Rationale:
- Extremely small scope (one test file, two helper functions).
- No cross-domain dependencies.
- Easier review and rollback.
PR-1 (single slice)
- Scope:
backend/internal/services/mail_service_test.gohelper lifecycle updates.
- Dependencies:
- None.
- Validation gates:
- Validation matrix in Section 6.
- Rollback contingency:
- Revert single PR if instability increases.
9) Config File Review Outcome
Reviewed for this request:
.gitignorecodecov.yml.dockerignoreDockerfile
Suggested updates:
- None required for this scope.
- Revisit only if implementation introduces new generated artifacts or test output paths not currently handled (not expected here).
10) Acceptance Criteria
- AC1:
startMockSMTPServeraccepts in loop until cleanup and no longer exits after a single connection. - AC2:
startMockSSLSMTPServeraccepts in loop until cleanup and no longer exits after a single connection. - AC3: Each accepted connection is handled in its own goroutine.
- AC4: Cleanup closes listener and uses done-channel plus per-connection waitgroup synchronization with bounded timeout.
- AC5: Unexpected
Accept()failures are surfaced as test failure signals; expected listener-close shutdown errors are treated as normal. - AC6:
TestMailService_TestConnection_StartTLSSuccessWithAuthpasses reliably under repeated runs. - AC7: Targeted race validation for mail-service tests passes with
go test -race. - AC8: Cleanup timeout path reports failure and returns (non-hanging), never silent pass.
- AC9: Backend coverage script/task completes successfully at configured threshold.
- AC10: No production file changes are included in the implementation PR.
11) Definition of Done
- Planned helper changes are implemented exactly in scoped functions.
- Cleanup deterministically waits for accept-loop exit and active handlers via done + waitgroup synchronization.
- Only expected listener-close shutdown accept errors are non-fatal; unexpected accept errors fail tests.
- Cleanup timeout is reported as failure signal and cannot pass silently.
- Validation matrix passes.
- Diff is limited to test helper scope in
mail_service_test.go. - No updates to
.gitignore,codecov.yml,.dockerignore, orDockerfileare required or included.