fix(querier): de-flake TestQuerierWithBlocksStorageOnMissingBlocksFromStorage by waiting for ACTIVE store-gateway in querier ring view by sandy2008 · Pull Request #7615 · cortexproject/cortex

sandy2008 · 2026-06-11T00:03:15Z

What this PR does

De-flakes TestQuerierWithBlocksStorageOnMissingBlocksFromStorage (integration_querier, observed on arm64): the first happy-path query — before any blocks are deleted — intermittently failed with server_error: server error: 500 at querier_test.go:367. This is a test-only change: one additional readiness wait plus a CHANGELOG.md line. No production code is touched, and the test's deliberate post-deletion 500 assertions are unchanged.

Root cause (decoded from the CI logs)

The querier logs the 500 response body gzipped inside the Response: field of its request log line. Unescaping the Go-quoted bytes and gunzipping them yields the same body in both failing runs (modulo block ULID):

{"status":"error","errorType":"server_error","error":"expanding series: failed to get store-gateway replication set owning the block 01KT4J55J7SX01JD4N5RD8F3WP: at least 1 healthy replica required, could only find 0 - unhealthy instances: 172.18.0.8:9095"}

Run 26832378915 (2026-06-02): 500 at 16:20:11.513, server latency 5.19ms, block 01KT4J55J7SX01JD4N5RD8F3WP.
Run 26632776611 (2026-05-30): 500 at 01:15:25.864, server latency 4.13ms, block 01KSV76AWZPKY9H51G4JCGTRYX.

The 4–5ms latency means the error is querier-local: the query never reached the store-gateway. The mechanism:

The store-gateway registers in the ring with all 512 tokens in JOINING state (pkg/storegateway/gateway.go:461, OnRingInstanceRegister), runs InitialSync (gateway.go:325 — this is what drives cortex_bucket_store_blocks_loaded to 1), and only then CASes itself to ACTIVE (gateway.go:333).
All three of the test's readiness waits are satisfiable while the querier still sees the store-gateway in JOINING: ring tokens are registered (and counted by cortex_ring_tokens_total) already at JOINING, and the other two waits are store-gateway-local metrics.
The querier's BlocksRead ring operation admits ACTIVE instances only (pkg/storegateway/gateway_ring.go:49-55); with IgnoreUnhealthyInstancesReplicationStrategy (pkg/querier/blocks_store_queryable.go:236) and the single store-gateway this test runs, a JOINING instance yields exactly at least 1 healthy replica required, could only find 0 - unhealthy instances: ... (pkg/ring/replication_strategy.go:93), wrapped at pkg/querier/blocks_store_replicated_set.go:127 into the decoded message above → HTTP 500.
The race window is structural, not just "slow CI": the querier's consul WatchKey poll is rate-limited to 1 req/s by default (pkg/ring/kv/consul/client.go:79), so after consuming the JOINING registration update the querier learns about the ACTIVE flip ≥ ~1s later (longer under arm64 runner contention), while the test's three metric waits can all pass on their first poll (300–600ms backoff).

This is not a production bug: JOINING exists precisely to keep BlocksRead away from store-gateways that have not completed their initial blocks sync. The gap is purely in the test's readiness conditions — they assert producer-side (store-gateway) progress but never the consumer-side state (the querier's ring view) that the query path actually consults.

Correcting the original triage in #7605

The ResourceExhausted ... exceeded series limit ... limit 1 violated (got 2) store-gateway log lines quoted in the issue were misattributed: they appear at 16:20:23 — after this test had already failed (16:20:11) — and belong to the next test, TestQuerierWithBlocksStorageLimits, which deliberately sets -querier.max-fetched-series-per-query=1 (and passed). This test sets no series limit. Two more negatives ruled out while decoding: the "reuse grpc buffer in querier's store-gateway stream" change is unrelated (the failing query never reaches the store-gateway stream), and cortex_bucket_store_blocks_loaded == 1 is the correct expected count (cortex_ingester_shipper_uploads_total == 1 is asserted earlier in the test).

The fix

One wait inserted after the existing ring/blocks waits and before the first query, so the test waits until the querier's own view of the store-gateway ring shows 1 ACTIVE member:

	// Wait until the querier observes the store-gateway as ACTIVE in its view of the store-gateway
	// ring: the store-gateway registers as JOINING and switches to ACTIVE only after the initial
	// blocks sync, so the waits above can all pass while queries would still fail with
	// "at least 1 healthy replica required, could only find 0" (500). Keep after the tokens wait.
	require.NoError(t, querier.WaitSumMetricsWithOptions(e2e.Equals(1), []string{"cortex_ring_members"}, e2e.WithLabelMatchers(
		labels.MustNewMatcher(labels.MatchEqual, "name", "store-gateway-client"),
		labels.MustNewMatcher(labels.MatchEqual, "state", "ACTIVE"),
	)))

Why this deterministically closes the window rather than just shrinking it:

The awaited cortex_ring_members{name="store-gateway-client",state="ACTIVE"} gauge is computed by updateRingMetrics from the same mutex-guarded ringDesc that GetClientsFor → ring.Get → Filter consults (pkg/ring/ring.go), so the wait condition is the exact negation of the failure condition.
ACTIVE is monotonic within this test (no store-gateway restarts/leaves), and the heartbeat period (15s) is far below the 1m heartbeat timeout, so the instance cannot become unhealthy between the wait and the query.
The wait sits after the cortex_ring_tokens_total == 512*2 wait, which guarantees the querier has already processed a non-empty store-gateway ring snapshot; updateRingMetrics zero-fills the per-state series (including ACTIVE) on that first snapshot (pkg/ring/ring.go:670-681), so the bare matcher form (no e2e.WaitMissingMetrics) cannot hit "metric not found" here — and if the waits are ever reordered, it fails loudly on the first poll instead of reintroducing a flake (hence the ordering note in the comment).

Precedent: commit 32bd46f (#5975) fixed this same store-gateway-not-yet-ACTIVE race with the identical inline wait in integration/backward_compatibility_test.go.

Why not other approaches

Shared helper in integration/util.go: this race is being fixed in two independent PRs (see Scope below); a helper introduced by either would couple their merge order. Three one-line call sites don't yet earn the abstraction, and the in-tree precedent (Fix flaky TestNewDistributorsCanPushToOldIngestersWithReplication test due SG not ready #5975) is inline. A helper can be promoted in a later sweep.
e2e.WaitMissingMetrics: provably unnecessary after the tokens wait (the ACTIVE series exists, zero-filled, from the first snapshot onward), and it would trade an immediate loud failure on misplacement for silent busy-polling.
Query retry / timeout bumps in the e2e client: would mask genuine first-query 500s — the rejected shape of Fix flaky integration tests: increase e2e timeouts and retry limits #7306.
Production changes (admit JOINING in BlocksRead, querier retry on 0 healthy replicas, register tokens only at ACTIVE): the observed behavior is the designed cold-start guarantee — changing it for a test race would weaken the consistency model during rollouts.

Scope

One hunk in integration/querier_test.go (this test only) + one CHANGELOG.md line.
Flaky test: TestQuerierWithStoreGatewayDataBytesLimits (integration_querier, arm64) — got 500 instead of expected 422 #7606 is the same root cause, decoded from the same CI job (identical message at 16:20:34.929, block 01KT4J5W1MDTKP13EGBC8GEH3B, in TestQuerierWithStoreGatewayDataBytesLimits). It is being fixed in a separate PR that adds the same wait to TestQuerierWithStoreGatewayDataBytesLimits and TestQuerierWithBlocksStorageLimits. The two PRs have zero overlapping hunks and are merge-order independent.
Known same-shape sites intentionally left untouched (never flaked; currently shielded by an incidental 3s sleep): integration/query_frontend_test.go:533-536 (TestQueryFrontendNoRetryChunkPool) and integration/native_histogram_test.go:184-186. They are candidates for a follow-up readiness sweep (where a shared helper becomes justified by call-site count), coordinated with the flaky-test umbrellas Flaky TestQuerierWithBlocksStorageRunningInSingleBinaryMode #2856 / TestQuerierWithBlocksStorageRunningInMicroservicesMode is flaky #5306 — without claiming those issues share this root cause.

Which issue(s) this PR fixes

Fixes #7605

Checklist

Tests updated (the change is itself a test fix)
Documentation added — N/A (no flags or config changed; make doc not required)
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]
docs/configuration/v1-guarantees.md updated if this PR introduces experimental flags — N/A

Test plan

gofmt -l / goimports -local github.com/cortexproject/cortex -l integration/querier_test.go — clean.
go vet -tags "slicelabels,integration,integration_querier" ./integration/ — clean.
go test -c -tags "slicelabels,integration,integration_querier" -o /dev/null ./integration/ — compiles.
The Docker-based run is exercised by CI (integration_querier job). Since the flake is a race, a flake-count run can't prove absence; the determinism argument above (the waited gauge and the query path read the same lock-guarded ring state) is the stability proof. Reviewer repro recipe (not run here; requires Docker): temporarily add -consul.watch-rate-limit=0.2 to the querier's flags to widen the JOINING-view window to ~5s — the unpatched test should then fail near-deterministically while the patched test passes. The knob is intentionally not part of this PR.

AI assistance disclosure (per GENAI_POLICY.md): the bulk of this contribution — CI log decoding, root-cause analysis, the test change, and this description — was produced with AI assistance and reviewed by the contributor.

🤖 Generated with Claude Code

…estQuerierWithBlocksStorageOnMissingBlocksFromStorage The first happy-path query in this integration test intermittently failed with a 500 on arm64 CI (issue cortexproject#7605). Decoding the gzipped 500 response body logged by the querier in both failing runs shows the failure is querier-local and the query never reached the store-gateway: expanding series: failed to get store-gateway replication set owning the block <ULID>: at least 1 healthy replica required, could only find 0 - unhealthy instances: 172.18.0.8:9095 The store-gateway registers in the ring with all its tokens in JOINING state (pkg/storegateway/gateway.go:461), runs the initial blocks sync (which is what drives cortex_bucket_store_blocks_loaded to 1), and only then switches to ACTIVE (gateway.go:333). The test's existing waits (querier ring tokens 512*2, store-gateway ring tokens 512, blocks_loaded 1) are therefore all satisfiable while the querier's view of the store-gateway ring still holds the instance in JOINING, and the BlocksRead ring operation admits ACTIVE instances only (pkg/storegateway/gateway_ring.go:49-55), so the first query fails with the error above (pkg/ring/replication_strategy.go:93, wrapped at pkg/querier/blocks_store_replicated_set.go:127). The window is structural: the querier's consul watch is rate-limited to 1 req/s (pkg/ring/kv/consul/client.go:79), so the JOINING->ACTIVE flip can reach the querier's ring view over a second after the store-gateway CASed it, while all three metric waits can pass on their first poll. Close the race by additionally waiting until the querier exposes cortex_ring_members{name="store-gateway-client",state="ACTIVE"} == 1 before the first query - the same inline readiness wait used to fix the identical race in backward_compatibility_test.go (cortexproject#5975, 32bd46f). The waited gauge is computed from the same mutex-guarded ring descriptor that the querier's GetClientsFor consults, so the wait condition is the exact negation of the failure condition. The deliberate post-deletion 500 assertions are unchanged. Fixes cortexproject#7605 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>

SungJin1212 · 2026-06-12T02:18:07Z

 * [BUGFIX] Query Frontend: Fix native histogram responses not being handled correctly in `minTime()` sort ordering for split_by_interval merge. #7555
 * [BUGFIX] Distributor: Release the push worker pool goroutines on shutdown by stopping the async executor during the stopping phase when `-distributor.num-push-workers` is set. #7602
 * [BUGFIX] Querier: Fix unbounded resource leak in the bucket-scan blocks finder (used when the bucket index is disabled). Per-tenant metadata fetchers, their Prometheus registries, and on-disk meta caches are now evicted once a tenant is no longer active, instead of being retained for the lifetime of the process. #7573
+* [BUGFIX] Querier: Fix flake in integration test TestQuerierWithBlocksStorageOnMissingBlocksFromStorage by waiting for the querier to see the store-gateway ACTIVE in the ring before the first query. #7615


Can you delete the changelog since this PR only fixes tests with no user-facing behavior change?

pull-request-size Bot added the size/S label Jun 11, 2026

dosubot Bot added component/querier type/flaky-test labels Jun 11, 2026

sandy2008 mentioned this pull request Jun 11, 2026

Flaky test: TestQuerierWithBlocksStorageOnMissingBlocksFromStorage (integration_querier, arm64) — transient 500 on the pre-deletion query #7605

Open

sandy2008 force-pushed the fix/flaky-querier-missing-blocks-sg-active-wait branch from 73bc245 to 9101b6f Compare June 11, 2026 06:55

SungJin1212 reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(querier): de-flake TestQuerierWithBlocksStorageOnMissingBlocksFromStorage by waiting for ACTIVE store-gateway in querier ring view#7615

fix(querier): de-flake TestQuerierWithBlocksStorageOnMissingBlocksFromStorage by waiting for ACTIVE store-gateway in querier ring view#7615
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix/flaky-querier-missing-blocks-sg-active-wait

sandy2008 commented Jun 11, 2026

Uh oh!

SungJin1212 Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sandy2008 commented Jun 11, 2026

What this PR does

Root cause (decoded from the CI logs)

Correcting the original triage in #7605

The fix

Why not other approaches

Scope

Which issue(s) this PR fixes

Checklist

Test plan

Uh oh!

SungJin1212 Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SungJin1212 Jun 12, 2026 •

edited

Loading