fix(compactor): de-flake sharded-ownership compactor tests (reduce mock workload, align poll budget)#7617
Open
sandy2008 wants to merge 1 commit into
Conversation
TestCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndMultipleInstancesRunning and its partition twin seed one shared bucket.ClientMock with 100 tenants (~3,208/~3,707 registered testify expectations). Every bucket operation from both compactors and both blocks cleaners holds the mock's single mutex through an O(expectations) reflective scan (~38% of CPU in two independent profiles), so the first compaction cycle alone takes ~20s under -race on an idle machine and 6-7x longer on starved CI runners. Each per-tenant meta sync additionally runs under a deadline equal to -compactor.compaction-interval (5s in these tests, pkg/compactor/compactor.go:1085), so starved syncs turn whole runs into failures and CompactionRunsCompleted can stay pinned at 0 longer than ANY poll budget: in CI run 26632776611 the partition test failed its 60s poll AND the non-partition sibling burned its full 120s poll, in both attempts. The ring itself was ACTIVE and stable before the poll started (starting() waits for both), correcting the issue's original ring-convergence framing. Fix (test-only): - Cut numUsers from 100 to 20 in both tests (~25x less quadratic mock-matching work; the per-user ownership assertions are tenant-count independent). - Align the partition test's run-completion poll budget with its non-partition sibling (60s -> 120s) as a seatbelt. - Set WaitActiveInstanceTimeout=30s in both tests (same hardening as cortexproject#7503): prepareConfig's 5s ACTIVE ceiling is within reach of the observed CI starvation envelope. The shuffle-sharding twins are deliberately untouched: their live arm64 failure in the same run is an ownership-exclusivity violation (compactor_test.go:1318), a different failure class that needs its own root-causing. Before/after on Apple Silicon, -race: 21.00s/24.01s -> 3.57s/3.55s. Fixes cortexproject#7607 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
e31fe74 to
9d77047
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does:
De-flakes
TestCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndMultipleInstancesRunningand its partition twinTestPartitionCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndMultipleInstancesRunning(issue #7607). Test-only: 6 lines across the two tests, plus a CHANGELOG entry.numUsers := 100→20in both tests — the load-bearing fix. It cuts the quadratic testify-mock matching work ~25× (registered expectations on the sharedbucket.ClientMock: 3,707 → ~745 in the partition test, 3,208 → ~648 in the sibling). The per-user ownership assertions are tenant-count independent.60s→120s(compactor_paritioning_test.go:1217) — a seatbelt, not the fix: it removes the unexplained asymmetry with its non-partition sibling, which already polled with 120s.cfg.ShardingRing.WaitActiveInstanceTimeout = 30sin both tests — preventive hardening on the exact Fix flaky TestCompactor_RingLifecyclerShouldAutoForgetUnhealthyInstances #7503 pattern.prepareConfig/prepareConfigForPartitioningpin this to 5s, and the starvation envelope observed in the failing run (the partition test's non-poll overhead was 21.65s / 17.67s in the two attempts vs ~3s locally, i.e. ~6–7×) puts that 5s ACTIVE ceiling within reach; had it tripped, these same tests would have failed atStartAndAwaitRunningwith a different signature.Root cause (correcting the issue's original framing)
The issue framed this as ring convergence racing a too-tight 60s timeout. The stored CI logs and two independent CPU profiles show otherwise:
services.StartAndAwaitRunningreturns only after the compactor'sstarting()has waited for the instance to become ACTIVE in the ring and then for ring stability (pkg/compactor/compactor.go:719-741). The failing poll measures time-to-first-completed-compaction-run, not ring convergence.bucket.ClientMock; each call holds testify's single mock mutex through a linear O(expectations) reflective scan (findExpectedCall/Arguments.Diff). AtnumUsers=100that is ~3,700/~3,200 registered expectations; two independent CPU profiles put 37.7% / 38.2% of samples in that matching code, and aGOMAXPROCS=2run is as fast as an unrestricted one — the work is mutex-serialized, so more cores don't help and a loaded CI runner stretches it linearly. The first cycle alone costs ~19–20s under-raceon an idle machine.test (amd64): the run has two attempts, both failed both twins (final run conclusion isfailure):gh api repos/cortexproject/cortex/actions/jobs/78485485760/logs): the partition twin failed its 60s poll at 81.65s, and the non-partition sibling — which already had a 120s budget — burned all of it and failed too at 124.41s (compactor_test.go:1164).gh run view --job 78485485760resolves a job ID to the run's latest attempt, which is why the issue shows 77.67s while the attempt-1 stored log shows 81.65s. Both attempts are real samples — four failures, two of them with the 120s budget already in place.Why a bigger timeout cannot fix this on its own
Every per-tenant meta sync runs under a deadline equal to
-compactor.compaction-interval:compactUserpassesc.compactorCfg.CompactionInterval(5s in these tests) tocompact.NewMetaSyncerWithMetricsassyncMetasTimeout(pkg/compactor/compactor.go:1085), and Thanos'Syncer.SyncMetaswraps each call incontext.WithTimeout(ctx, syncMetasTimeout)(vendor/github.com/thanos-io/thanos/pkg/compact/compact.go:152-154). With 100 tenants, both compactors and both blocks-cleaners share onebucket.ClientMock, and every bucket op holds the single testify mutex while linearly scanning ~3,700 registered expectations (~38% of CPU in two independent profiles). Under CI starvation a single tenant's sync can exceed 5s; that tenant errors,compactUsersfinishes withcompactionErrorCount > 0, and the run incrementsCompactionRunsFailedinstead ofCompactionRunsCompleted(pkg/compactor/compactor.go:832-840). Runs then keep firing every ~5s, each completing-as-failed, and the polled counter can stay at 0 for as long as starvation persists — which is how the non-partition sibling burned a flat 120 seconds in the same CI run (124.41s total = ~4.4s setup + a full 120s poll with zero counter movement,compactor_test.go:1164, job 78485485760). Shrinking the fixture to 20 tenants cuts the quadratic mock-matching work ~25×, restoring a 10–50× margin under the 5s sync deadline and landing the first clean run in seconds even at the observed 6–7× starvation; the 60s→120s change merely removes the unexplained asymmetry with the sibling (#6510 set both in one PR) as a seatbelt.Why not other approaches
ClientMockwith a seededobjstore.NewInMemBucket()(fix(compactor): Fix flaky TestBlocksCleaner_ShouldRemoveBlocksOutside… #7486 precedent): the right way to retire this O(N²) pattern for good, but the wrong vehicle for a flake fix — a stateful bucket activates second-cycle read-back paths (visit markers, partitioned-group files, bucket indexes) that the frozen mock never returned, so it needs its own soak. Listed as a follow-up below.starting()already guarantees ACTIVE + stability before the poll begins (see root cause).Scope
prepareConfig*,Poll) changes.TestCompactor_ShouldCompactOnlyShardsOwnedByTheInstanceOnShardingEnabledWithShuffleShardingAndMultipleInstancesRunningin 20.86s atcompactor_test.go:1318withgroup with group_hash=4233903734 owned by multiple compactors— an ownership-exclusivity violation (its run-completion poll passed), possibly a real visit-marker race under starvation, which deserves its own root-causing rather than a poll band-aid. Touching those tests here would muddy attribution when that signature recurs; a follow-up issue is suggested instead.DeleteLocalSyncFilestwins and everything that PR touches are untouched here.findCompactorByUserIDderives each user's owner from the ring; that owner's own completed run guarantees the asserted log line. P(one instance owns 0 of 20) ≈ 2e-6, and even that case passes — no Flaky test: TestCompactor_DeleteLocalSyncFiles (arm64) recurs after #7567 — 30s CompactionRunsCompleted>=2 poll times out #7608-style trap.== Npolls).syncMetasTimeoutis silently coupled to-compactor.compaction-interval(pkg/compactor/compactor.go:1085), so short intervals impose equally short per-tenant meta-sync deadlines and can turn slow-but-healthy syncs into entirely failed runs (cortex_compactor_runs_failed_totalwith no progress). Consider an explicit syncer timeout (or a floor) independent of cycle cadence.bucket.ClientMockto a seeded in-memory bucket to remove the O(expectations²) testify-matching pattern suite-wide.Which issue(s) this PR fixes:
Fixes #7607
Checklist
make docnot required)CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]docs/configuration/v1-guarantees.mdupdated if this PR introduces experimental flags — N/ATest plan
All runs on Apple Silicon with
-tags "netgo slicelabels" -race(the CItestjob configuration;-racematters — no-race runs understate the mock cost ~5×).TestPartitionCompactor_...UsersOwned...TestCompactor_...UsersOwned...-count=1)go test -tags "netgo slicelabels" -race -count=5 -run 'Compactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndMultipleInstancesRunning$' ./pkg/compactor/— 10/10 PASS (3.47–8.37s per test).GOMAXPROCS=2 go test -tags "netgo slicelabels" -race -count=3 -run '...same pattern...' ./pkg/compactor/— 6/6 PASS (3.22–3.56s per test; post-fix the tests are no longer mutex-bound, so constraining cores no longer hurts).go vet -tags "netgo slicelabels" ./pkg/compactor/— clean.gofmt -l/goimports -local github.com/cortexproject/cortex -lon the touched files — clean.AI assistance disclosure
Per GENAI_POLICY.md: this contribution (root-cause analysis from the stored CI logs and CPU profiles, the patch, and this PR description) was produced with substantial AI assistance (Claude Code). The submitter reviewed the change, and every timing and verification result above comes from actually running the listed commands.
🤖 Generated with Claude Code