Skip to content

fix(querier): de-flake TestQuerierWithBlocksStorageOnMissingBlocksFromStorage by waiting for ACTIVE store-gateway in querier ring view#7615

Open
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix/flaky-querier-missing-blocks-sg-active-wait
Open

fix(querier): de-flake TestQuerierWithBlocksStorageOnMissingBlocksFromStorage by waiting for ACTIVE store-gateway in querier ring view#7615
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix/flaky-querier-missing-blocks-sg-active-wait

Conversation

@sandy2008

Copy link
Copy Markdown
Contributor

What this PR does

De-flakes TestQuerierWithBlocksStorageOnMissingBlocksFromStorage (integration_querier, observed on arm64): the first happy-path query — before any blocks are deleted — intermittently failed with server_error: server error: 500 at querier_test.go:367. This is a test-only change: one additional readiness wait plus a CHANGELOG.md line. No production code is touched, and the test's deliberate post-deletion 500 assertions are unchanged.

Root cause (decoded from the CI logs)

The querier logs the 500 response body gzipped inside the Response: field of its request log line. Unescaping the Go-quoted bytes and gunzipping them yields the same body in both failing runs (modulo block ULID):

{"status":"error","errorType":"server_error","error":"expanding series: failed to get store-gateway replication set owning the block 01KT4J55J7SX01JD4N5RD8F3WP: at least 1 healthy replica required, could only find 0 - unhealthy instances: 172.18.0.8:9095"}
  • Run 26832378915 (2026-06-02): 500 at 16:20:11.513, server latency 5.19ms, block 01KT4J55J7SX01JD4N5RD8F3WP.
  • Run 26632776611 (2026-05-30): 500 at 01:15:25.864, server latency 4.13ms, block 01KSV76AWZPKY9H51G4JCGTRYX.

The 4–5ms latency means the error is querier-local: the query never reached the store-gateway. The mechanism:

  1. The store-gateway registers in the ring with all 512 tokens in JOINING state (pkg/storegateway/gateway.go:461, OnRingInstanceRegister), runs InitialSync (gateway.go:325 — this is what drives cortex_bucket_store_blocks_loaded to 1), and only then CASes itself to ACTIVE (gateway.go:333).
  2. All three of the test's readiness waits are satisfiable while the querier still sees the store-gateway in JOINING: ring tokens are registered (and counted by cortex_ring_tokens_total) already at JOINING, and the other two waits are store-gateway-local metrics.
  3. The querier's BlocksRead ring operation admits ACTIVE instances only (pkg/storegateway/gateway_ring.go:49-55); with IgnoreUnhealthyInstancesReplicationStrategy (pkg/querier/blocks_store_queryable.go:236) and the single store-gateway this test runs, a JOINING instance yields exactly at least 1 healthy replica required, could only find 0 - unhealthy instances: ... (pkg/ring/replication_strategy.go:93), wrapped at pkg/querier/blocks_store_replicated_set.go:127 into the decoded message above → HTTP 500.
  4. The race window is structural, not just "slow CI": the querier's consul WatchKey poll is rate-limited to 1 req/s by default (pkg/ring/kv/consul/client.go:79), so after consuming the JOINING registration update the querier learns about the ACTIVE flip ≥ ~1s later (longer under arm64 runner contention), while the test's three metric waits can all pass on their first poll (300–600ms backoff).

This is not a production bug: JOINING exists precisely to keep BlocksRead away from store-gateways that have not completed their initial blocks sync. The gap is purely in the test's readiness conditions — they assert producer-side (store-gateway) progress but never the consumer-side state (the querier's ring view) that the query path actually consults.

Correcting the original triage in #7605

The ResourceExhausted ... exceeded series limit ... limit 1 violated (got 2) store-gateway log lines quoted in the issue were misattributed: they appear at 16:20:23 — after this test had already failed (16:20:11) — and belong to the next test, TestQuerierWithBlocksStorageLimits, which deliberately sets -querier.max-fetched-series-per-query=1 (and passed). This test sets no series limit. Two more negatives ruled out while decoding: the "reuse grpc buffer in querier's store-gateway stream" change is unrelated (the failing query never reaches the store-gateway stream), and cortex_bucket_store_blocks_loaded == 1 is the correct expected count (cortex_ingester_shipper_uploads_total == 1 is asserted earlier in the test).

The fix

One wait inserted after the existing ring/blocks waits and before the first query, so the test waits until the querier's own view of the store-gateway ring shows 1 ACTIVE member:

	// Wait until the querier observes the store-gateway as ACTIVE in its view of the store-gateway
	// ring: the store-gateway registers as JOINING and switches to ACTIVE only after the initial
	// blocks sync, so the waits above can all pass while queries would still fail with
	// "at least 1 healthy replica required, could only find 0" (500). Keep after the tokens wait.
	require.NoError(t, querier.WaitSumMetricsWithOptions(e2e.Equals(1), []string{"cortex_ring_members"}, e2e.WithLabelMatchers(
		labels.MustNewMatcher(labels.MatchEqual, "name", "store-gateway-client"),
		labels.MustNewMatcher(labels.MatchEqual, "state", "ACTIVE"),
	)))

Why this deterministically closes the window rather than just shrinking it:

  • The awaited cortex_ring_members{name="store-gateway-client",state="ACTIVE"} gauge is computed by updateRingMetrics from the same mutex-guarded ringDesc that GetClientsFor → ring.Get → Filter consults (pkg/ring/ring.go), so the wait condition is the exact negation of the failure condition.
  • ACTIVE is monotonic within this test (no store-gateway restarts/leaves), and the heartbeat period (15s) is far below the 1m heartbeat timeout, so the instance cannot become unhealthy between the wait and the query.
  • The wait sits after the cortex_ring_tokens_total == 512*2 wait, which guarantees the querier has already processed a non-empty store-gateway ring snapshot; updateRingMetrics zero-fills the per-state series (including ACTIVE) on that first snapshot (pkg/ring/ring.go:670-681), so the bare matcher form (no e2e.WaitMissingMetrics) cannot hit "metric not found" here — and if the waits are ever reordered, it fails loudly on the first poll instead of reintroducing a flake (hence the ordering note in the comment).

Precedent: commit 32bd46f (#5975) fixed this same store-gateway-not-yet-ACTIVE race with the identical inline wait in integration/backward_compatibility_test.go.

Why not other approaches

  • Shared helper in integration/util.go: this race is being fixed in two independent PRs (see Scope below); a helper introduced by either would couple their merge order. Three one-line call sites don't yet earn the abstraction, and the in-tree precedent (Fix flaky TestNewDistributorsCanPushToOldIngestersWithReplication test due SG not ready #5975) is inline. A helper can be promoted in a later sweep.
  • e2e.WaitMissingMetrics: provably unnecessary after the tokens wait (the ACTIVE series exists, zero-filled, from the first snapshot onward), and it would trade an immediate loud failure on misplacement for silent busy-polling.
  • Query retry / timeout bumps in the e2e client: would mask genuine first-query 500s — the rejected shape of Fix flaky integration tests: increase e2e timeouts and retry limits #7306.
  • Production changes (admit JOINING in BlocksRead, querier retry on 0 healthy replicas, register tokens only at ACTIVE): the observed behavior is the designed cold-start guarantee — changing it for a test race would weaken the consistency model during rollouts.

Scope

Which issue(s) this PR fixes

Fixes #7605

Checklist

  • Tests updated (the change is itself a test fix)
  • Documentation added — N/A (no flags or config changed; make doc not required)
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]
  • docs/configuration/v1-guarantees.md updated if this PR introduces experimental flags — N/A

Test plan

  • gofmt -l / goimports -local github.com/cortexproject/cortex -l integration/querier_test.go — clean.
  • go vet -tags "slicelabels,integration,integration_querier" ./integration/ — clean.
  • go test -c -tags "slicelabels,integration,integration_querier" -o /dev/null ./integration/ — compiles.
  • The Docker-based run is exercised by CI (integration_querier job). Since the flake is a race, a flake-count run can't prove absence; the determinism argument above (the waited gauge and the query path read the same lock-guarded ring state) is the stability proof. Reviewer repro recipe (not run here; requires Docker): temporarily add -consul.watch-rate-limit=0.2 to the querier's flags to widen the JOINING-view window to ~5s — the unpatched test should then fail near-deterministically while the patched test passes. The knob is intentionally not part of this PR.

AI assistance disclosure (per GENAI_POLICY.md): the bulk of this contribution — CI log decoding, root-cause analysis, the test change, and this description — was produced with AI assistance and reviewed by the contributor.

🤖 Generated with Claude Code

…estQuerierWithBlocksStorageOnMissingBlocksFromStorage

The first happy-path query in this integration test intermittently
failed with a 500 on arm64 CI (issue cortexproject#7605). Decoding the gzipped 500
response body logged by the querier in both failing runs shows the
failure is querier-local and the query never reached the store-gateway:

  expanding series: failed to get store-gateway replication set owning
  the block <ULID>: at least 1 healthy replica required, could only
  find 0 - unhealthy instances: 172.18.0.8:9095

The store-gateway registers in the ring with all its tokens in JOINING
state (pkg/storegateway/gateway.go:461), runs the initial blocks sync
(which is what drives cortex_bucket_store_blocks_loaded to 1), and only
then switches to ACTIVE (gateway.go:333). The test's existing waits
(querier ring tokens 512*2, store-gateway ring tokens 512,
blocks_loaded 1) are therefore all satisfiable while the querier's view
of the store-gateway ring still holds the instance in JOINING, and the
BlocksRead ring operation admits ACTIVE instances only
(pkg/storegateway/gateway_ring.go:49-55), so the first query fails with
the error above (pkg/ring/replication_strategy.go:93, wrapped at
pkg/querier/blocks_store_replicated_set.go:127). The window is
structural: the querier's consul watch is rate-limited to 1 req/s
(pkg/ring/kv/consul/client.go:79), so the JOINING->ACTIVE flip can
reach the querier's ring view over a second after the store-gateway
CASed it, while all three metric waits can pass on their first poll.

Close the race by additionally waiting until the querier exposes
cortex_ring_members{name="store-gateway-client",state="ACTIVE"} == 1
before the first query - the same inline readiness wait used to fix the
identical race in backward_compatibility_test.go (cortexproject#5975, 32bd46f).
The waited gauge is computed from the same mutex-guarded ring
descriptor that the querier's GetClientsFor consults, so the wait
condition is the exact negation of the failure condition. The
deliberate post-deletion 500 assertions are unchanged.

Fixes cortexproject#7605

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
@sandy2008 sandy2008 force-pushed the fix/flaky-querier-missing-blocks-sg-active-wait branch from 73bc245 to 9101b6f Compare June 11, 2026 06:55
Comment thread CHANGELOG.md
* [BUGFIX] Query Frontend: Fix native histogram responses not being handled correctly in `minTime()` sort ordering for split_by_interval merge. #7555
* [BUGFIX] Distributor: Release the push worker pool goroutines on shutdown by stopping the async executor during the stopping phase when `-distributor.num-push-workers` is set. #7602
* [BUGFIX] Querier: Fix unbounded resource leak in the bucket-scan blocks finder (used when the bucket index is disabled). Per-tenant metadata fetchers, their Prometheus registries, and on-disk meta caches are now evicted once a tenant is no longer active, instead of being retained for the lifetime of the process. #7573
* [BUGFIX] Querier: Fix flake in integration test TestQuerierWithBlocksStorageOnMissingBlocksFromStorage by waiting for the querier to see the store-gateway ACTIVE in the ring before the first query. #7615

@SungJin1212 SungJin1212 Jun 12, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you delete the changelog since this PR only fixes tests with no user-facing behavior change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: TestQuerierWithBlocksStorageOnMissingBlocksFromStorage (integration_querier, arm64) — transient 500 on the pre-deletion query

2 participants