Skip to content

fix(compactor): de-flake sharded-ownership compactor tests (reduce mock workload, align poll budget)#7617

Open
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix/flaky-partition-compactor-ownership-poll
Open

fix(compactor): de-flake sharded-ownership compactor tests (reduce mock workload, align poll budget)#7617
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix/flaky-partition-compactor-ownership-poll

Conversation

@sandy2008

Copy link
Copy Markdown
Contributor

What this PR does:

De-flakes TestCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndMultipleInstancesRunning and its partition twin TestPartitionCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndMultipleInstancesRunning (issue #7607). Test-only: 6 lines across the two tests, plus a CHANGELOG entry.

  1. numUsers := 10020 in both tests — the load-bearing fix. It cuts the quadratic testify-mock matching work ~25× (registered expectations on the shared bucket.ClientMock: 3,707 → ~745 in the partition test, 3,208 → ~648 in the sibling). The per-user ownership assertions are tenant-count independent.
  2. Partition test's run-completion poll 60s120s (compactor_paritioning_test.go:1217) — a seatbelt, not the fix: it removes the unexplained asymmetry with its non-partition sibling, which already polled with 120s.
  3. cfg.ShardingRing.WaitActiveInstanceTimeout = 30s in both tests — preventive hardening on the exact Fix flaky TestCompactor_RingLifecyclerShouldAutoForgetUnhealthyInstances #7503 pattern. prepareConfig/prepareConfigForPartitioning pin this to 5s, and the starvation envelope observed in the failing run (the partition test's non-poll overhead was 21.65s / 17.67s in the two attempts vs ~3s locally, i.e. ~6–7×) puts that 5s ACTIVE ceiling within reach; had it tripped, these same tests would have failed at StartAndAwaitRunning with a different signature.

Root cause (correcting the issue's original framing)

The issue framed this as ring convergence racing a too-tight 60s timeout. The stored CI logs and two independent CPU profiles show otherwise:

  • The ring is ACTIVE and stable before the poll even starts. services.StartAndAwaitRunning returns only after the compactor's starting() has waited for the instance to become ACTIVE in the ring and then for ring stability (pkg/compactor/compactor.go:719-741). The failing poll measures time-to-first-completed-compaction-run, not ring convergence.
  • What is actually slow is the first compaction cycle itself. Both compactors and both blocks-cleaners funnel every bucket operation through one shared bucket.ClientMock; each call holds testify's single mock mutex through a linear O(expectations) reflective scan (findExpectedCall/Arguments.Diff). At numUsers=100 that is ~3,700/~3,200 registered expectations; two independent CPU profiles put 37.7% / 38.2% of samples in that matching code, and a GOMAXPROCS=2 run is as fast as an unrestricted one — the work is mutex-serialized, so more cores don't help and a loaded CI runner stretches it linearly. The first cycle alone costs ~19–20s under -race on an idle machine.
  • CI evidence — run 26632776611, job test (amd64): the run has two attempts, both failed both twins (final run conclusion is failure):
    • Attempt 1 (stored log: gh api repos/cortexproject/cortex/actions/jobs/78485485760/logs): the partition twin failed its 60s poll at 81.65s, and the non-partition sibling — which already had a 120s budget — burned all of it and failed too at 124.41s (compactor_test.go:1164).
    • Attempt 2 (the numbers the issue quotes): 77.67s / 122.49s.
    • Reviewer note on the numbers: gh run view --job 78485485760 resolves a job ID to the run's latest attempt, which is why the issue shows 77.67s while the attempt-1 stored log shows 81.65s. Both attempts are real samples — four failures, two of them with the 120s budget already in place.
  • Worst-case math at the observed ~6–7× starvation: N=100 first cycle ≈ 19.5s idle × 7 ≈ 137s > 120s (matching the observed 124.41s flat burn); N=20 ≈ 3s × 7 ≈ 20s ≪ 120s (≥6× margin even at the worst observed starvation).

Why a bigger timeout cannot fix this on its own

Every per-tenant meta sync runs under a deadline equal to -compactor.compaction-interval: compactUser passes c.compactorCfg.CompactionInterval (5s in these tests) to compact.NewMetaSyncerWithMetrics as syncMetasTimeout (pkg/compactor/compactor.go:1085), and Thanos' Syncer.SyncMetas wraps each call in context.WithTimeout(ctx, syncMetasTimeout) (vendor/github.com/thanos-io/thanos/pkg/compact/compact.go:152-154). With 100 tenants, both compactors and both blocks-cleaners share one bucket.ClientMock, and every bucket op holds the single testify mutex while linearly scanning ~3,700 registered expectations (~38% of CPU in two independent profiles). Under CI starvation a single tenant's sync can exceed 5s; that tenant errors, compactUsers finishes with compactionErrorCount > 0, and the run increments CompactionRunsFailed instead of CompactionRunsCompleted (pkg/compactor/compactor.go:832-840). Runs then keep firing every ~5s, each completing-as-failed, and the polled counter can stay at 0 for as long as starvation persists — which is how the non-partition sibling burned a flat 120 seconds in the same CI run (124.41s total = ~4.4s setup + a full 120s poll with zero counter movement, compactor_test.go:1164, job 78485485760). Shrinking the fixture to 20 tenants cuts the quadratic mock-matching work ~25×, restoring a 10–50× margin under the 5s sync deadline and landing the first clean run in seconds even at the observed 6–7× starvation; the 60s→120s change merely removes the unexplained asymmetry with the sibling (#6510 set both in one PR) as a seatbelt.

Why not other approaches

  • Timeout-only: falsified four times over by the same run — the sibling already had 120s and burned it flat in both attempts, and the sync-deadline mechanism above means no poll budget is provably sufficient at N=100.
  • Replace ClientMock with a seeded objstore.NewInMemBucket() (fix(compactor): Fix flaky TestBlocksCleaner_ShouldRemoveBlocksOutside… #7486 precedent): the right way to retire this O(N²) pattern for good, but the wrong vehicle for a flake fix — a stateful bucket activates second-cycle read-back paths (visit markers, partitioned-group files, bucket indexes) that the frozen mock never returned, so it needs its own soak. Listed as a follow-up below.
  • Restructuring startup (StartAsync / concurrent starts): doesn't touch the serialized-mock bottleneck, breaks parity between the twins, and buys nothing at N=20.
  • A pre-poll ring wait: redundant — starting() already guarantees ACTIVE + stability before the poll begins (see root cause).

Scope

  • Test-only; no production, vendored, or shared-helper (prepareConfig*, Poll) changes.
  • The shuffle-sharding twins are deliberately untouched. Their live failure in the same CI run is a different class: arm64 job 78485485722 (attempt 1) failed TestCompactor_ShouldCompactOnlyShardsOwnedByTheInstanceOnShardingEnabledWithShuffleShardingAndMultipleInstancesRunning in 20.86s at compactor_test.go:1318 with group with group_hash=4233903734 owned by multiple compactors — an ownership-exclusivity violation (its run-completion poll passed), possibly a real visit-marker race under starvation, which deserves its own root-causing rather than a poll band-aid. Touching those tests here would muddy attribution when that signature recurs; a follow-up issue is suggested instead.
  • Zero overlap with Flaky test: TestCompactor_DeleteLocalSyncFiles (arm64) recurs after #7567 — 30s CompactionRunsCompleted>=2 poll times out #7608: the DeleteLocalSyncFiles twins and everything that PR touches are untouched here.
  • Degenerate-split note: at N=20, a skewed user split between the two instances cannot fail these tests. The poll is split-independent (a compactor owning zero users still completes a run), and findCompactorByUserID derives each user's owner from the ring; that owner's own completed run guarantees the asserted log line. P(one instance owns 0 of 20) ≈ 2e-6, and even that case passes — no Flaky test: TestCompactor_DeleteLocalSyncFiles (arm64) recurs after #7567 — 30s CompactionRunsCompleted>=2 poll times out #7608-style trap.
  • Suggested follow-ups (deliberately not bundled here):
    1. Root-cause the shuffle-twin ownership-exclusivity failure above (and only then revisit that family's fragile exact-match == N polls).
    2. syncMetasTimeout is silently coupled to -compactor.compaction-interval (pkg/compactor/compactor.go:1085), so short intervals impose equally short per-tenant meta-sync deadlines and can turn slow-but-healthy syncs into entirely failed runs (cortex_compactor_runs_failed_total with no progress). Consider an explicit syncer timeout (or a floor) independent of cycle cadence.
    3. Migrate fixture-heavy compactor tests from bucket.ClientMock to a seeded in-memory bucket to remove the O(expectations²) testify-matching pattern suite-wide.

Which issue(s) this PR fixes:
Fixes #7607

Checklist

  • Tests updated (test-only change to the two flaky tests)
  • Documentation added — N/A (no flags or config changed; make doc not required)
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]
  • docs/configuration/v1-guarantees.md updated if this PR introduces experimental flags — N/A

Test plan

All runs on Apple Silicon with -tags "netgo slicelabels" -race (the CI test job configuration; -race matters — no-race runs understate the mock cost ~5×).

TestPartitionCompactor_...UsersOwned... TestCompactor_...UsersOwned...
Before (base 74185ef, -count=1) 21.00s 24.01s
After (same command) 3.57s 3.55s
  • go test -tags "netgo slicelabels" -race -count=5 -run 'Compactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndMultipleInstancesRunning$' ./pkg/compactor/10/10 PASS (3.47–8.37s per test).
  • GOMAXPROCS=2 go test -tags "netgo slicelabels" -race -count=3 -run '...same pattern...' ./pkg/compactor/6/6 PASS (3.22–3.56s per test; post-fix the tests are no longer mutex-bound, so constraining cores no longer hurts).
  • go vet -tags "netgo slicelabels" ./pkg/compactor/ — clean.
  • gofmt -l / goimports -local github.com/cortexproject/cortex -l on the touched files — clean.
  • Single commit, DCO signed-off.

AI assistance disclosure

Per GENAI_POLICY.md: this contribution (root-cause analysis from the stored CI logs and CPU profiles, the patch, and this PR description) was produced with substantial AI assistance (Claude Code). The submitter reviewed the change, and every timing and verification result above comes from actually running the listed commands.

🤖 Generated with Claude Code

TestCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndMultipleInstancesRunning
and its partition twin seed one shared bucket.ClientMock with 100
tenants (~3,208/~3,707 registered testify expectations). Every bucket
operation from both compactors and both blocks cleaners holds the
mock's single mutex through an O(expectations) reflective scan (~38%
of CPU in two independent profiles), so the first compaction cycle
alone takes ~20s under -race on an idle machine and 6-7x longer on
starved CI runners. Each per-tenant meta sync additionally runs under
a deadline equal to -compactor.compaction-interval (5s in these tests,
pkg/compactor/compactor.go:1085), so starved syncs turn whole runs
into failures and CompactionRunsCompleted can stay pinned at 0 longer
than ANY poll budget: in CI run 26632776611 the partition test failed
its 60s poll AND the non-partition sibling burned its full 120s poll,
in both attempts. The ring itself was ACTIVE and stable before the
poll started (starting() waits for both), correcting the issue's
original ring-convergence framing.

Fix (test-only):
- Cut numUsers from 100 to 20 in both tests (~25x less quadratic
  mock-matching work; the per-user ownership assertions are
  tenant-count independent).
- Align the partition test's run-completion poll budget with its
  non-partition sibling (60s -> 120s) as a seatbelt.
- Set WaitActiveInstanceTimeout=30s in both tests (same hardening as
  cortexproject#7503): prepareConfig's 5s ACTIVE ceiling is within reach of the
  observed CI starvation envelope.

The shuffle-sharding twins are deliberately untouched: their live
arm64 failure in the same run is an ownership-exclusivity violation
(compactor_test.go:1318), a different failure class that needs its own
root-causing.

Before/after on Apple Silicon, -race: 21.00s/24.01s -> 3.57s/3.55s.

Fixes cortexproject#7607

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
@sandy2008 sandy2008 force-pushed the fix/flaky-partition-compactor-ownership-poll branch from e31fe74 to 9d77047 Compare June 11, 2026 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

1 participant