Files
Charon/docs/plans/current_spec.md
GitHub Actions 7497cbecd0 chore: Implement manual test plan for SMTP mock server flakiness fix
- Added a new documentation file outlining the manual test plan to validate the SMTP mock server flakiness fix, ensuring improved mail test reliability without affecting production behavior.
- Updated the current specification document to reflect the focus on stabilizing flaky SMTP STARTTLS+AUTH unit tests, including detailed research findings and requirements for the implementation.
- Created a QA/Security validation report for the SMTP flaky test fix, confirming that changes are test-only, stable under repeated runs, and do not introduce new security risks.
2026-02-22 06:29:37 +00:00

12 KiB

post_title, categories, tags, summary, post_date
post_title categories tags summary post_date
Current Spec: Stabilize Flaky SMTP STARTTLS+AUTH Unit Test
actions
testing
backend
reliability
go
smtp
starttls
unit-tests
ci-stability
Implementation plan to remove CI flakiness in MailService STARTTLS+AUTH tests by hardening mock SMTP server lifecycle and connection handling in test helpers only. 2026-02-22

Active Plan: Stabilize Flaky SMTP STARTTLS+AUTH Unit Test

Date: 2026-02-22 Status: Active and authoritative Scope Type: Test reliability hardening (backend unit tests only)

1) Introduction

This plan addresses flakiness in backend unit test TestMailService_TestConnection_StartTLSSuccessWithAuth by improving mock SMTP test helpers used by backend/internal/services/mail_service_test.go.

Root-cause hypothesis to validate and fix:

  • Existing mock SMTP server helpers accept only one connection, then exit.
  • STARTTLS + AUTH flows in net/smtp are negotiation-heavy and can involve additional command/connection behavior under CI timing variance.
  • Single-accept test server behavior creates race-prone shutdown windows.

Goal:

  • Make test helper servers robust and deterministic by accepting connections in a loop until cleanup, handling each connection in its own goroutine, and enforcing deterministic shutdown that waits for accept-loop exit and active handler goroutines (explicit per-connection synchronization via waitgroup).

Non-goal:

  • No production mail service behavior changes.

2) Research Findings

2.1 Current flaky target and helper topology

Primary target test:

  • backend/internal/services/mail_service_test.go
    • TestMailService_TestConnection_StartTLSSuccessWithAuth

Helper functions in same file (current behavior):

  • startMockSMTPServer(...)
    • Single Accept() call, single connection handler, then goroutine exits.
  • startMockSSLSMTPServer(...)
    • Single Accept() call, single connection handler, then goroutine exits.
  • handleSMTPConn(...)
    • Handles SMTP protocol conversation for each accepted connection.

2.2 Flakiness vector summary

  • Current single-accept model is fragile for tests where client behavior may include additional negotiation/timing paths.
  • Cleanup currently closes listener and waits on a done signal, but done is tied to a single accept goroutine rather than a full server lifecycle.
  • Under CI contention, this can surface nondeterministic failures in tests expecting reliable STARTTLS + AUTH handshake behavior.

2.3 Repo config review requested by user

Reviewed files:

  • .gitignore
  • codecov.yml
  • .dockerignore
  • Dockerfile

Conclusion for this scope:

  • No changes required for this test-only helper stabilization.
  • Existing ignore/coverage/docker config does not block this plan.

3) Requirements (EARS)

  • R1: WHEN mock SMTP test helpers are started, THE SYSTEM SHALL accept connections in a loop until explicit cleanup closes the listener.
  • R2: WHEN a connection is accepted, THE SYSTEM SHALL handle it in its own goroutine so one slow session cannot block new accepts.
  • R3: WHEN cleanup is invoked, THE SYSTEM SHALL close the listener and await deterministic server-loop termination plus completion of active connection handlers (done channel + waitgroup synchronization).
  • R4: IF cleanup wait exceeds timeout, THEN THE SYSTEM SHALL report a failure signal and return without hanging test execution.
  • R5: WHEN this fix is implemented, THE SYSTEM SHALL keep production code untouched and restrict changes to test helper scope.
  • R6: WHEN targeted backend tests run, THE SYSTEM SHALL pass reliably in local and CI-like conditions.

4) Technical Specification

4.1 Exact target files and functions (minimal diff scope)

Primary file to edit:

  • backend/internal/services/mail_service_test.go

Functions in scope:

  • startMockSMTPServer(t *testing.T, tlsConf *tls.Config, supportStartTLS bool, requireAuth bool) (string, func())
  • startMockSSLSMTPServer(t *testing.T, tlsConf *tls.Config, requireAuth bool) (string, func())

Related function (read-only unless strictly necessary):

  • handleSMTPConn(conn net.Conn, tlsConf *tls.Config, supportStartTLS bool, requireAuth bool)

Out of scope:

  • backend/internal/services/mail_service.go
  • Any non-mail-service test files.

4.2 Planned helper behavior changes

For both helper server starters:

  1. Replace single Accept() with accept loop:
    • Continue accepting until listener close returns an error.
    • Treat listener-close accept errors as normal shutdown path.
  2. Spawn per-connection handler goroutine:
    • Each accepted connection handled independently.
    • Connection closed within goroutine after handling.
  3. Deterministic lifecycle signaling:
  • Keep done channel signaling tied to server accept-loop exit.
  • Ensure exactly one close of done from server goroutine.
  • Track active connection-handler goroutines with explicit waitgroup sync.
  1. Cleanup contract:
    • cleanup() closes listener first.
  • Wait for accept-loop exit and active handler completion with bounded timeout (2s currently acceptable).
  • Never block indefinitely.
  • Timeout path is a test failure signal (non-hanging), not a silent pass.

4.3 Concurrency and race safety notes

  • Accept-loop ownership:
    • One goroutine owns listener accept loop and is the only goroutine that may close done.
  • Connection handler isolation:
    • One goroutine per accepted connection; no shared mutable state required for protocol behavior in this helper.
  • Per-connection synchronization:
    • A waitgroup must increment before each handler goroutine starts and must decrement on handler exit so cleanup can deterministically wait for active handlers to finish.
  • Listener close semantics:
    • listener.Close() from cleanup is expected to break Accept().
    • Exit condition should avoid noisy test failures on intentional close.
  • Cleanup timeout behavior:
    • Timeout remains defensive to prevent suite hangs under pathological CI resource starvation.
    • Timeout branch must report failure (e.g., t.Errorf/t.Fatalf policy) and return; no panic and no silent success.

4.4 Error handling policy in helpers

  • Treat only expected shutdown accept errors (listener close path) as normal.
  • Surface unexpected Accept() failures as test failure signals.
  • Keep helper logic simple and deterministic; avoid over-engineered retry logic.

5) Implementation Plan

Phase 1: Testing protocol sequencing note

  • Policy remains E2E-first globally.
  • Explicit exception rationale for this change: scope is backend test-helper-only (mail_service_test.go) with no production/frontend/runtime behavior delta, so targeted backend test-first verification is authorized for this plan.
  • Mandatory preflight before unit/coverage steps:
    • bash scripts/local-patch-report.sh
    • Artifacts: test-results/local-patch-report.md and test-results/local-patch-report.json

Phase 2: Baseline confirmation and failure reproduction context

  • Capture current helper behavior and flaky test target:
    • go test ./backend/internal/services -run TestMailService_TestConnection_StartTLSSuccessWithAuth -count=1 -v

Phase 3: Helper lifecycle hardening

  • Update startMockSMTPServer and startMockSSLSMTPServer to loop accept.
  • Add per-connection goroutine handling.
  • Preserve/strengthen deterministic cleanup using listener close + done wait + per-connection waitgroup completion.

Phase 4: Targeted validation

  • Re-run target test repeatedly to validate stability:
    • go test ./backend/internal/services -run TestMailService_TestConnection_StartTLSSuccessWithAuth -count=20
  • Run mail service test subset:
    • go test ./backend/internal/services -run TestMailService_ -count=1
  • Run race-focused targeted validation:
    • go test -race ./backend/internal/services -run 'TestMailService_(TestConnection|Send)' -count=1

Phase 5: Backend coverage validation

  • Run backend coverage task/script required by repo workflow:
    • Preferred script: scripts/go-test-coverage.sh
    • VS Code equivalent: backend coverage task if available.

6) Validation Matrix

Validation Item Command / Task Scope Pass Criteria
Targeted flaky test go test ./backend/internal/services -run TestMailService_TestConnection_StartTLSSuccessWithAuth -count=20 Direct flaky test No failures across repeated runs
Mail service subset go test ./backend/internal/services -run TestMailService_ -count=1 Nearby regression safety All selected tests pass
Race-focused targeted tests `go test -race ./backend/internal/services -run 'TestMailService_(TestConnection Send)' -count=1` Concurrency/race safety
Package sanity go test ./backend/internal/services -count=1 Service package confidence Package tests pass
Backend coverage gate scripts/go-test-coverage.sh Repo-required backend coverage check Meets configured minimum threshold (85% default)

Notes:

  • E2E-first protocol remains project-wide policy. This plan uses the explicit backend-only targeted-test exception because scope is confined to test helper internals with no production/UI behavior changes.
  • Local patch report preflight is required before unit/coverage gates.

7) Risk Assessment

Risk level: Low.

  • Change type: test-only.
  • Production code: untouched.
  • Runtime behavior: unchanged for shipped binary.
  • Primary risk: helper lifecycle bug causing test hangs.
  • Mitigation: bounded cleanup timeout, accept-loop exit on listener close, focused repeated-run validation.

8) PR Slicing Strategy

Decision: Single PR.

Rationale:

  • Extremely small scope (one test file, two helper functions).
  • No cross-domain dependencies.
  • Easier review and rollback.

PR-1 (single slice)

  • Scope:
    • backend/internal/services/mail_service_test.go helper lifecycle updates.
  • Dependencies:
    • None.
  • Validation gates:
    • Validation matrix in Section 6.
  • Rollback contingency:
    • Revert single PR if instability increases.

9) Config File Review Outcome

Reviewed for this request:

  • .gitignore
  • codecov.yml
  • .dockerignore
  • Dockerfile

Suggested updates:

  • None required for this scope.
  • Revisit only if implementation introduces new generated artifacts or test output paths not currently handled (not expected here).

10) Acceptance Criteria

  • AC1: startMockSMTPServer accepts in loop until cleanup and no longer exits after a single connection.
  • AC2: startMockSSLSMTPServer accepts in loop until cleanup and no longer exits after a single connection.
  • AC3: Each accepted connection is handled in its own goroutine.
  • AC4: Cleanup closes listener and uses done-channel plus per-connection waitgroup synchronization with bounded timeout.
  • AC5: Unexpected Accept() failures are surfaced as test failure signals; expected listener-close shutdown errors are treated as normal.
  • AC6: TestMailService_TestConnection_StartTLSSuccessWithAuth passes reliably under repeated runs.
  • AC7: Targeted race validation for mail-service tests passes with go test -race.
  • AC8: Cleanup timeout path reports failure and returns (non-hanging), never silent pass.
  • AC9: Backend coverage script/task completes successfully at configured threshold.
  • AC10: No production file changes are included in the implementation PR.

11) Definition of Done

  • Planned helper changes are implemented exactly in scoped functions.
  • Cleanup deterministically waits for accept-loop exit and active handlers via done + waitgroup synchronization.
  • Only expected listener-close shutdown accept errors are non-fatal; unexpected accept errors fail tests.
  • Cleanup timeout is reported as failure signal and cannot pass silently.
  • Validation matrix passes.
  • Diff is limited to test helper scope in mail_service_test.go.
  • No updates to .gitignore, codecov.yml, .dockerignore, or Dockerfile are required or included.