fix(compactor): make DeleteLocalSyncFiles tests deterministic (pin ring tokens, drive compaction manually)#7619
Open
sandy2008 wants to merge 1 commit into
Conversation
76bfdf7 to
a0d54c6
Compare
…ng tokens, drive compaction manually) TestCompactor_DeleteLocalSyncFiles and its partition twin relied on timer-driven compaction cycles and on the random per-instance ring tokens drawn at startup. With 2 compactors x 512 random tokens and only 10 fixed users there is a measured ~0.103% (~1-in-960 per run) chance that the second compactor owns zero of the users. In such runs every c2 cycle completes without ever creating a meta-sync directory, so the condition polled since cortexproject#7567 (CompactionRunsCompleted >= 2 && len(dirs) > 0) is permanently false and the test burns the entire 30s budget: the exact 32.8s arm64 CI failure of cortexproject#7608, and the true root cause of cortexproject#7565 (cortexproject#7567's "transient ring-view skew" was a misdiagnosis - no timeout can fix a permanently false condition). Restore the manual-drive structure the test had when introduced in cortexproject#3851 (removed by cortexproject#6510) and pin the ownership split, identically in both twins: - CompactionInterval = 10m: timers are out of the picture; the test drives compaction cycles itself. - Pin per-instance ring tokens via ShardingRing.TokensFilePath (new pinnedTokens helper): a guard token fnv32a(user)+1 for alternating users makes compactor-1 own user-1,3,5,7,9 and compactor-2 own user-2,4,6,8,10 deterministically, since the integer interval (h, h+1) is empty and ring lookup picks the first token strictly greater than the user hash. - Replace the timed CompactionRunsCompleted polls with direct compactUsers() calls on both compactors, gated by one poll waiting until BOTH compactors' ring views see two healthy ACTIVE instances. - Assert exact ownership counts (numUsers/2 each) instead of NotZero/NotEqual, so any silent un-pinning fails loudly. - Tolerate (only) context.Canceled from StopAndAwaitTerminated in the test cleanup: with a 10m interval the compactor is usually still in its initial jittered wait when stopped, and running() returns ctx.Err() there since cortexproject#6510. Supersedes cortexproject#7567's poll-based approach. Fixes cortexproject#7608 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
a0d54c6 to
661d418
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Makes
TestCompactor_DeleteLocalSyncFilesandTestPartitionCompactor_DeleteLocalSyncFilesdeterministic, fixing the second-generation arm64 flake of issue #7608 (32.80s ... compactor_test.go:1855: expected true, got false). Test-only change.Lineage of this flake:
CompactionInterval = 10 * time.Minute // We will only call compaction manually.— compaction cycles were driven synchronously by the test.prepareConfig()'s 5s timer with pre-run jitter, and assertions started sampling timer-driven cycles.Should not be zero, but was 0#7565 — first observed failure (Should not be zero, but was 0).CompactionRunsCompleted >= 2 && len(listTenantsWithMetaSyncDirectories()) > 0. That fixed a real-but-secondary mid-cycle sampling race while leaving the actual root cause in place.Root cause
The compactor ring assigns each instance 512 random tokens (
NumTokensis hardcoded inRingConfig.ToLifecyclerConfig), and the test has only 10 fixed users (fnv32a(userID)→ring.Get). With 2 instances × 512 random tokens, there is a measured ≈0.103% (~1-in-960 per twin run) probability that the second compactor owns zero of the 10 users (pooled over ≈11M Monte-Carlo trials across five independent simulations using the real fnv32a user hashes and the ring's strictly-greater token search).In such a draw, every c2 compaction cycle completes successfully but skips all users, so no meta-sync directory is ever created and #7567's
len(dirs) > 0clause is permanently false — the poll burns its full 30s regardless of timeout. The same draw explains #7565 (c2Userswas genuinely 0). No timeout bump can fix a permanently-false condition.Deterministic reproduction (fail-without-fix evidence): pinning an adversarial token layout into the current test via
ShardingRing.TokensFilePath(first compactor gets guard tokens for all 10 users, second only low filler tokens) reproduces the CI failure exactly and on every run: 32.81s locally vs CI's 32.80s, with the identicalcompactor_test.go:1855: expected true, got falsemessage and c2 logging six clean cycles of "skipping user ... not owned". The adversarial variant is cited here as evidence and intentionally not committed. Under the new test design the same degenerate layout is unconstructible.How the fix resolves it
Restores #3851's manual-drive structure and pins the ownership split, identically in both twins:
CompactionInterval = 10 * time.Minutewith the original// We will only call compaction manually.comment — timer-driven cycles are out of the picture.pinnedTokenshelper pins per-instance ring tokens viaShardingRing.TokensFilePath: exactly 512 tokens per instance (shorter files are silently topped up with random tokens by the lifecycler — guarded byrequire.Len), where a guard tokenfnv32a(user)+1for every 2nd user makes compactor-1 ownuser-1,3,5,7,9and compactor-2 ownuser-2,4,6,8,10deterministically: the integer interval(h, h+1)is empty, so the guard is always the first token strictly greater than the user's hash, regardless of the low filler tokens. The split was validated by simulation against the ring'ssearchTokenstrictly-greater semantics (exact 5/5, no token collisions; the minimum pairwise hash gap between the test users is the FNV prime 16777619, so guards cannot shadow each other). The helper's comment warns against evenly-spaced token schemes:fnv32a("user-N")hashes step by the FNV prime and resonate with regular spacings, which can degenerate back to a 0/10 split.CompactionRunsCompletedpolls are replaced by direct, synchronouscompactUsers()calls on both compactors (the test already did this for c1's final cleanup cycle). This is a mandatory pairing with the 10m interval: c1's old startup poll would otherwise out-wait the now-jittered initial auto-run most of the time.GetAllHealthy(RingOp)— the same operationownUserqueries. c2's own view is already barriered bystarting()(it waits until c2 is ACTIVE in its own view, and every KV snapshot containing c2 also contains the earlier-registered c1), but c1's ring watcher ingests c2's registration asynchronously, and the finalc1.compactUsers()cleanup depends on c1's view.numUsers/2for c2,numUsers - numUsers/2for c1, plus the historical sum invariant) instead ofNotZero/NotEqual. This is self-protecting: if pinning were ever silently bypassed, a random draw lands exactly 5/5 only ~25% of the time, so a regression fails loudly and immediately rather than once in a thousand runs.running()returnsctx.Err()when the service is stopped during the initial jittered wait (up to 60s at a 10m interval), whichStopAndAwaitTerminatedreports as a service failure. Other long-interval tests in this suite ignore the stop error entirely (//nolint:errcheck); these tests now use a stricter form that tolerates onlycontext.Canceled, keeping the cleanup's protective value for real failures.Why not other approaches
c2owns zero users → exact-count orNotZeroassertions fail) — i.e. generation-3 of the same flake, just with a clearer message.user-Nstep by the FNV prime 16777619 and alias regular spacings; a plausible-looking evenly-spaced scheme degenerates to a 0/10 split at 512 tokens. Hence guard tokens, which are provably correct in one sentence.Should not be zero, but was 0#7565 sampling race) on top of the ownership draw.Scope
pkg/compactor/compactor_test.go,pkg/compactor/compactor_paritioning_test.go(both twins get byte-identical treatment), plus oneCHANGELOG.mdline. No production changes, noprepareConfig()changes.TestPartitionCompactor_ShouldCompactOnlyUsersOwnedByTheInstance...), which touches different functions in the partition test file.Which issue(s) this PR fixes:
Fixes #7608
References #7565, #7567 (supersedes #7567's poll-based approach).
Checklist
make docnot required)CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]docs/configuration/v1-guarantees.mdupdated if this PR introduces experimental flags — N/ATest plan (all on the branch, Apple Silicon arm64)
go test -tags "netgo slicelabels" -count=20 -run 'TestCompactor_DeleteLocalSyncFiles$|TestPartitionCompactor_DeleteLocalSyncFiles$' ./pkg/compactor/— ok, 99.1s total for 40 test executions ≈ 2.5s per test (previously ~8.5s healthy, 32.8s when the flake hit)-race -count=5— ok, 27.1sGOMAXPROCS=2 go test -tags "netgo slicelabels" -count=1 -run '...' ./pkg/compactor/— ok, 5.5sgo test -tags "netgo slicelabels" ./pkg/compactor/— ok, 92.5sgo vet -tags "netgo slicelabels" ./pkg/compactor/— clean;gofmt/goimports -local github.com/cortexproject/cortex— cleansearchTokensemantics: c1 =user-1,3,5,7,9, c2 =user-2,4,6,8,10(exact 5/5); all guards far belowMaxUint32; max filler token 1014 vs minimum user hash 1986520588.Per the project's GenAI policy: this contribution was developed with substantial AI assistance (Claude Code) — root-cause analysis, Monte-Carlo simulations, code, and verification — and was reviewed and submitted under the DCO by the contributor.
🤖 Generated with Claude Code