Files
Charon/docs/plans/archive/shard1_investigation_summary.md
2026-03-04 18:34:49 +00:00

8.7 KiB
Raw Blame History

Shard 1 Investigation Summary

Date: 2026-02-03 Status: Root Cause Identified - Fix Ready CI Run: https://github.com/Wikid82/Charon/actions/runs/21613888904


Problem Statement

After completing Phase 1-3 of timeout remediation (semantic wait helpers, coverage improvements):

  • Shard 1 failed on ALL 3 browsers (Chromium, Firefox, WebKit)
  • Shards 2 & 3 passed
  • Overall success rate: 50% (6/12 jobs)
  • Shard 4: Cancelled (never ran)

Investigation Findings

1. Shard Distribution Analysis

50 total test files → 4 shards = ~12.5 files per shard

Shard 1 (Files 1-13):

✅ tests/core/access-lists-crud.spec.ts      (32 timeout replacements)
✅ tests/core/authentication.spec.ts         (1 timeout replacement)
✅ tests/core/certificates.spec.ts           (20 timeout replacements)
   tests/core/dashboard.spec.ts
   tests/core/navigation.spec.ts
✅ tests/core/proxy-hosts.spec.ts            (38 timeout replacements)
   tests/dns-provider-crud.spec.ts
   tests/dns-provider-types.spec.ts
   tests/emergency-server/emergency-server.spec.ts
   tests/emergency-server/tier2-validation.spec.ts
   tests/integration/backup-restore-e2e.spec.ts
   tests/integration/import-to-production.spec.ts
   tests/integration/multi-feature-workflows.spec.ts

Critical Pattern: 4 out of 13 files (31%) were refactored in Phase 2 to use wait-helpers.ts

Total Impact: 91 timeout replacements in Shard 1 using new wait helpers

2. Local vs CI Differences

| Factor | Local | CI | Impact | |--------|-------|----|----- --| | Workers | Default (CPU/2) | 1 | CI serializes execution | | Retries | 0 | 2 | CI masks intermittent issues | | Module Cache | Warm (parallel) | Cold (sequential) | CI slower module resolution | | Test Result | Pass | Fail | Environment-specific issue |

3. Code Analysis

Dynamic Imports in wait-helpers.ts (2 locations):

Location 1: Line 69-70 in clickAndWaitForResponse()

const { clickSwitch } = await import('./ui-helpers');

Location 2: Line 108-109 in clickSwitchAndWaitForResponse()

const { clickSwitch } = await import('./ui-helpers');

Why This is Problematic:

  1. Dynamic imports are async - add runtime overhead
  2. Module resolution happens at call time, not module load time
  3. CI's single worker executes Shard 1 first with cold module cache
  4. Shard 1 has 4 refactored files calling these helpers extensively
  5. Subsequent shards benefit from warm cache, avoiding the issue

4. Dependency Verification

Circular Dependency Check:

grep -n "wait-helpers" tests/utils/ui-helpers.ts
# Result: No matches ✅

Conclusion: Safe to convert dynamic imports to static imports

Expect Import Analysis:

grep -n "await expect(" tests/utils/wait-helpers.ts
# Result: 20+ usages ✅

Conclusion: expect import from @bgotink/playwright-coverage is correct and necessary


Root Cause

PRIMARY CAUSE: Dynamic Import Resolution in CI

Confidence Level: 85%

Mechanism:

  1. wait-helpers.ts uses dynamic imports in hot paths
  2. CI environment (Docker + single worker) has slower module resolution
  3. Shard 1 runs first with cold module cache
  4. Async import overhead causes subtle timing issues
  5. Shards 2-3 benefit from warmed cache

Why It Passes Locally:

  • Multiple workers pre-warm module cache in parallel
  • Native filesystem has faster module resolution
  • Parallel execution masks timing issues

Why It Fails in CI:

  • Single worker (workers: 1) serializes execution
  • Docker filesystem might be slower
  • Cold module cache on first shard
  • Timing issues exposed by sequential execution

Solution

Replace Dynamic Imports with Static Imports

File to Modify: tests/utils/wait-helpers.ts

Change 1: Add Static Import (Line 5)

// BEFORE:
import type { Page, Locator, Response } from '@playwright/test';

// AFTER:
import type { Page, Locator, Response } from '@playwright/test';
import { clickSwitch } from './ui-helpers'; // ✅ Static import

Change 2: Remove Dynamic Import (Line 69-70)

// BEFORE:
const { clickSwitch } = await import('./ui-helpers');

// AFTER:
// Use imported clickSwitch directly (already imported at top)

Change 3: Remove Dynamic Import (Line 108-109)

// BEFORE:
const { clickSwitch } = await import('./ui-helpers');

// AFTER:
// Use imported clickSwitch directly

Expected Impact

Before Fix

Shard Browser Status Note
1 Chromium Failed Dynamic imports
1 Firefox Failed Dynamic imports
1 WebKit Failed Dynamic imports
2 Chromium Passed Warm cache
2 Firefox Passed Warm cache
2 WebKit Passed Warm cache
3 Chromium Passed Warm cache
3 Firefox Passed Warm cache
3 WebKit Passed Warm cache
4 All ⚠️ Cancelled Workflow stopped

Success Rate: 50% (6/12 jobs passing)

After Fix

Shard Browser Status Note
1 Chromium Pass Static imports
1 Firefox Pass Static imports
1 WebKit Pass Static imports
2 Chromium Pass No change
2 Firefox Pass No change
2 WebKit Pass No change
3 Chromium Pass No change
3 Firefox Pass No change
3 WebKit Pass No change
4 Chromium Pass Will run
4 Firefox Pass Will run
4 WebKit Pass Will run

Success Rate: 100% (12/12 jobs passing)


Implementation Timeline

Step Task Duration
1 Remove dynamic imports from wait-helpers.ts 5 min
2 Test locally with CI=true 5 min
3 Commit and push 2 min
4 Monitor CI pipeline 15 min
Total 27 min

With buffer: ~1 hour


Validation Checklist

Pre-Implementation

  • Shard 1 test files identified
  • Dynamic import locations found
  • No circular dependencies confirmed
  • expect usage verified

Implementation

  • Static import added to wait-helpers.ts
  • Dynamic imports removed (2 locations)
  • Local test passes: CI=true npx playwright test --shard=1/4 --project=chromium

Post-Implementation

  • Fix pushed to repository
  • CI pipeline triggered
  • Shard 1 Chromium passes
  • Shard 1 Firefox passes
  • Shard 1 WebKit passes
  • Shards 2-3 still pass
  • Shard 4 runs and passes
  • GitHub issue updated

Risk Assessment

Implementation Risk: LOW

Why:

  • Static imports are standard practice
  • No architectural changes required
  • No circular dependencies exist
  • Change is localized to 3 lines in 1 file

Regression Risk: VERY LOW

Why:

  • Only changes module load timing
  • Shards 2-3 already passing (won't affect them)
  • Local tests already passing
  • Fix makes code simpler and more maintainable

Option 1: Increase Timeouts

Pros: Quick fix Cons: Hides root cause, makes tests slower Verdict: Not recommended

Option 2: Disable Shard 1 Tests

Pros: Unblocks CI immediately Cons: Reduces coverage by 25%, hides problem Verdict: Not recommended

Option 3: Split wait-helpers.ts

Pros: Separates concerns Cons: More complex, requires refactoring all imports Verdict: Overkill for this issue


Lessons Learned

1. Dynamic Imports in Test Utilities

Problem: Async module resolution adds overhead in CI Solution: Use static imports unless truly necessary

2. CI-Specific Behavior

Problem: Single worker serialization exposes issues masked locally Learning: Always test with CI=true locally before pushing

3. Module Cache Effects

Problem: Warm cache in later shards masks cold cache issues in Shard 1 Learning: Pay special attention to first shard in CI

4. Shard Distribution

Problem: Alphabetical ordering concentrated refactored files in Shard 1 Learning: Consider test file naming to balance shard load


References


Investigation Complete: 2026-02-03 Next Action: Implement fix per shard1_fix_plan.md