Skip to content

perf(cubestore): early compaction split#11079

Open
waralexrom wants to merge 4 commits into
cubestore-compaction-improvementfrom
cubestore-early-split
Open

perf(cubestore): early compaction split#11079
waralexrom wants to merge 4 commits into
cubestore-compaction-improvementfrom
cubestore-early-split

Conversation

@waralexrom

Copy link
Copy Markdown
Member

Check List

  • Tests have been run in packages where changes have been made if available
  • Linter has been run for changed code
  • Tests for the changes have been added if not covered yet
  • Docs have been added / updated if required

…multiplier, size-aware split

- Import: cap a CSV data frame by estimated Arrow buffer size in addition to
  row count (CUBESTORE_WAL_SPLIT_SIZE_THRESHOLD, default 100MB) so wide rows
  bound peak ingest memory instead of accumulating 524K-row frames.
- Compaction: CUBESTORE_COMPACTION_CHUNKS_THRESHOLD_MULTIPLIER (f64, default 3.0)
  scales the persisted-chunk count and total-size thresholds, letting more
  chunks accumulate and merge in one streaming k-way pass.
- Compaction split now sizes children by the bytes actually written (existing
  main table + pending chunk bytes) so a partition splits by file size in a
  single pass instead of under-splitting and re-splitting.

Ignores 4 tests whose partition-count assertions assume the old lagged
split-by-file-size semantics (FIXME(early-split)).
@waralexrom waralexrom requested review from a team and keydunov as code owners June 13, 2026 02:45
@github-actions github-actions Bot added cube store Issues relating to Cube Store rust Pull requests that update Rust code javascript Pull requests that update Javascript code labels Jun 13, 2026
@claude

claude Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Claude finished @waralexrom's task in 1m 7s —— View job


Review: perf(cubestore): early compaction split

1 high, 2 medium, 2 low

Full review (click to expand)
  • Read PR diff
  • Review Rust changes (cubestore)
  • Review JS env config changes
  • Check for existing review threads
  • Post review

Summary

The core change has three parts:

  1. compaction.rs — sizing the partition split by partition_file_size + chunks_total_file_size so a partition whose chunks dwarf the main table splits in one pass instead of under-splitting and re-splitting.
  2. config/mod.rs — a new compaction_chunks_threshold_multiplier (default 3.0) applied to the count + total-size compaction thresholds, deferring compaction so it runs on larger inputs that can be sized correctly.
  3. import/mod.rs — a byte cap (CUBESTORE_WAL_SPLIT_SIZE_THRESHOLD, default 100 MiB) on WAL frames during ingestion, in addition to the existing row count threshold.

The shape is reasonable and the eager-split-by-bytes logic in compaction.rs:586-598 is a sensible fix. A few concerns below.

High — three tests left #[ignore] with "FIXME needs updating"

rust/cubestore/cubestore/src/sql/mod.rs:3313 (over_10k_join), :4047 (inactive_partitions_cleanup), and rust/cubestore/cubestore/src/store/compaction.rs:2402 (partition_compaction_decimal96), :2460 (partition_split_by_file_size) are all disabled with identical comments stating the assertions are tuned to the old lagged semantics and "need updating."

These are exactly the tests that validate the behavior this PR changes — splitting and partition counts. Shipping a perf change to the split path while silencing the tests that exercise it leaves no executed coverage of the new semantics. Please update the assertions to the new expected counts in this PR rather than deferring. If the test expectations are non-trivial to recompute, that itself is signal that the new behavior deserves a fresh, explicit test of the split-by-pending-bytes path (e.g. small/empty main table + large pending chunks → split count expected).

Medium — unrelated autoRunMode env var

packages/cubejs-backend-shared/src/env.ts:713-720 adds CUBEJS_AUTO_RUN_MODE ("Whether queries are executed automatically… Only used in Cube Cloud"). This has no apparent relation to "early compaction split" and isn't read anywhere in this diff. Either it's leftover from another branch and should be dropped, or it belongs in its own PR — bundling it makes the change history harder to bisect and review.

Medium — default multiplier of 3.0 is a behavior change for existing deployments

rust/cubestore/cubestore/src/config/mod.rs:1345-1348 defaults compaction_chunks_threshold_multiplier to 3.0, which triples both compaction_chunks_total_size_threshold (8 MiB → 24 MiB) and compaction_chunks_count_threshold (4 → 12) for every existing installation on upgrade. Compaction will fire later and on larger inputs, which is the intent, but this is a meaningful operational change: more memory pressure during compaction, longer chunk-list scans, more pending data at crash time.

At minimum this should be called out in the PR description and changelog. Consider whether 1.0 is a safer default and the 3.0 value should be opt-in per deployment until validated under production load.

Low — multiplier can floor count threshold to zero

config/mod.rs:715-719: (compaction_chunks_count_threshold as f64 * multiplier).floor() as u64 can produce 0 for small multipliers (e.g. 0.1 * 4 = 0.4 → 0). Not a real risk at the default, but a .max(1) would make the getter robust to any operator-supplied value.

Low — Float32 path: as f64 cast precision

query_executor.rs:2181-2190: (a.value(i) as f64).into() is fine; just noting that f32 → f64 is exact so there's no precision loss to worry about. The added test at :2574 covers it. Nice addition, including the Float16/Float32/Float64 → ColumnType::Float test at :2570-2576.

Nits

  • import/mod.rs:90-101 — the estimate_arrow_value_size accounting (4 bytes per offset for varlen, 16 for Int96/Decimal*, 1 for bool) looks correct and the "exactness is unnecessary for a threshold" comment is the right framing.
  • compaction.rs:586-598 — the unwrap_or(0) + chunks_total_file_size form is clean; the comment explains why well.
· [Branch](https://github.com/cube-js/cube/tree/cubestore-early-split)

@waralexrom waralexrom force-pushed the cubestore-early-split branch from 281cc13 to 11eb290 Compare June 13, 2026 02:45
@claude

claude Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Claude finished @waralexrom's task in 1m 32s —— View job


Claude Code Review

Verdict: Approach is sound (size-aware import split + eager compaction split by total bytes), but landing it disables four behavioral tests and silently rescales two existing env vars. Both should be addressed before merge. 1 high, 1 medium, 2 low.

Details

Tasks

  • Read PR diff
  • Review Rust changes (cubestore)
  • Review JS env config changes (none — only Rust env parsing)
  • Check for existing review threads (none)
  • Post review

High

  • sql/mod.rs:3314, compaction.rs:2403/2460 — Four tests #[ignore]d with FIXME ("assertions tuned to the old lagged semantics and need updating"). These are precisely the tests that cover splitting/compaction behavior this PR changes. Update them, or at minimum file a tracked follow-up and reference it in the ignore reason.

Medium

  • config/mod.rs:713–723compaction_chunks_threshold_multiplier (default 3.0) is folded into the accessors for both compaction_chunks_total_size_threshold() and compaction_chunks_count_threshold(). Existing deployments that tuned CUBESTORE_CHUNKS_TOTAL_SIZE_THRESHOLD / CUBESTORE_CHUNKS_COUNT_THRESHOLD will see their configured values silently 3×'d on upgrade. Either default the multiplier to 1.0, apply it only at the call sites that need the widening, or document the rescale.

Low

  • import/mod.rs:794–807estimated_bytes accumulates inside the visit_line closure regardless of is_data_row, while num_rows only counts data rows. Cosmetic given the 100MB threshold, but inconsistent.
  • compaction.rs:590–599chunks_total_file_size is capped by the in-memory size break at line 537, so on partitions with very large pending backlogs the split still reflects only the bytes in this pass, not the eventual total. The comment reads like the full pending set is counted; worth a one-line clarification.

Nits / observations

  • estimate_arrow_value_size: + 4 for variable-length offset is fine as an approximation; null strings still contribute 4 bytes/row, which is harmless at scale. Boolean = 1 is conservative (Arrow packs to bits). Decimal96 = 16 is slightly over (i96 = 12) — fine for a threshold.
  • wal_split_size_threshold_bytes uses env_parse_size (accepts 100M etc.) — good UX, matches sibling size knobs.
  • Comment at compaction.rs:586–589 clearly explains the under-splitting failure mode this fixes; nice.
· branch: `cubestore-early-split`

Comment on lines 713 to +723
}

fn compaction_chunks_total_size_threshold(&self) -> u64 {
self.compaction_chunks_total_size_threshold
(self.compaction_chunks_total_size_threshold as f64
* self.compaction_chunks_threshold_multiplier)
.floor() as u64
}

fn compaction_chunks_count_threshold(&self) -> u64 {
self.compaction_chunks_count_threshold
(self.compaction_chunks_count_threshold as f64
* self.compaction_chunks_threshold_multiplier)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiplier silently rescales two pre-existing env vars.

The multiplier (default 3.0) is applied inside the accessors for both compaction_chunks_total_size_threshold() and compaction_chunks_count_threshold(). That means every existing deployment that has tuned CUBESTORE_CHUNKS_TOTAL_SIZE_THRESHOLD or CUBESTORE_CHUNKS_COUNT_THRESHOLD will now see those values multiplied by 3 with no opt-in — a 4-chunk count threshold becomes 12, an 8MB size threshold becomes 24MB. That's a meaningful behavioral change for an upgrade.

A few options worth considering:

  • Apply the multiplier only at the call sites that need the early-split widening, leaving the trait getters returning the configured value verbatim.
  • Default the multiplier to 1.0 and let operators opt in.
  • At minimum, document this in the changelog / config docs so the rescale isn't a surprise.

Also worth a comment near the field explaining what the multiplier is widening and why — the name reads like a simple knob, but it changes the contract of two other knobs.

Comment thread rust/cubestore/cubestore/src/sql/mod.rs Outdated
Comment on lines 800 to 807
let line = line?;
let is_data_row = parser.visit_line(line.as_str(), |insert_pos, column, value| {
let builder = builders[insert_pos].as_mut();
estimated_bytes +=
ImportFormat::estimate_arrow_value_size(column.get_column_type(), value);
match value {
None => {
append_value(builder, column.get_column_type(), &TableValue::Null);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: estimated_bytes accumulates on non-data rows.

The visit_line closure increments estimated_bytes unconditionally, but num_rows += 1 is gated on is_data_row. If headers / non-data lines ever invoke the per-column callback, their bytes silently leak into the running total and never get reset on continue. In practice the leak is tiny relative to a 100MB threshold so this is cosmetic, but it would be cleaner to either accumulate into a per-line scratch and commit it only when is_data_row, or reset both counters in the same place.

Comment on lines +590 to +599
let new_partitions_count_by_file_size = {
let total_file_size =
partition.get_row().file_size().unwrap_or(0) + chunks_total_file_size;
if total_file_size > 0 {
let threshold = self.config.partition_size_split_threshold_bytes();
(div_ceil(partition_file_size, threshold) as usize).min(16)
(div_ceil(total_file_size, threshold) as usize).min(16)
} else {
1
};
}
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that chunks_total_file_size here is capped by the in-memory size threshold break above (line 537) — if pending chunks exceed that threshold, only the chunks taken into this compaction pass are counted, not the full pending set. That's likely the intended behavior (we split by what we're actually writing), and matches the comment, but it does mean a partition with a huge backlog of pending chunks will still under-split on any single pass relative to the eventual total. Worth a one-line note acknowledging this, since the comment currently implies the split reflects "the bytes actually being written" without flagging the cap.

…t budget wait

Greppable by "csv-import-timing": logs the moment a CREATE TABLE request is
received, and the CSV import job start both before and after the rate-limiter
budget wait (with wait_ms), so the budget-wait gap is readable without trace.
@claude

claude Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Claude finished @waralexrom's task in 50s —— View job


Review: latest push (info-level timing logs)

Verdict: New commit is a small diagnostic-logging-only change. No new substantive findings. The four prior unresolved review threads still apply to the current diff.

Details

What's new since the last review

Commit 7c17003 only adds two info! log sites tagged [csv-import-timing]:

  • sql/mod.rs:920 — logs every CREATE TABLE request received.
  • cluster/ingestion/job_runner.rs:318,322 — logs before and after the rate-limiter wait_for_allow call on CSV import jobs, with wait_ms.

The substantive logic of the PR (size-aware split, threshold multiplier, WAL byte cap) is unchanged.

Observations on the logging

  • Info-level on the CREATE TABLE hot path. [csv-import-timing] fires once per CREATE TABLE and twice per CSV import job at info-level. Greppability is fine, but on busy clusters this becomes persistent log volume for what looks like a temporary diagnostic. Consider debug! or gating behind a config flag if these are meant to stay; if they are temporary, a TODO/issue link in the message would make their lifecycle obvious.
  • Mixed macro styles. sql/mod.rs:920 uses log::info! while job_runner.rs:316 uses bare info! — minor inconsistency, both work but the file already imports one or the other; pick whichever matches the surrounding module.

Prior threads still applicable (not re-posting)

All four open Claude threads remain unresolved and the underlying code is unchanged:

  • config/mod.rs:723 — multiplier silently 3×'s CUBESTORE_CHUNKS_TOTAL_SIZE_THRESHOLD / CUBESTORE_CHUNKS_COUNT_THRESHOLD for existing deployments.
  • sql/mod.rs:3318 — four #[ignore]d tests covering the behavior this PR changes.
  • import/mod.rs:807estimated_bytes accumulates on non-data rows.
  • compaction.rs:599chunks_total_file_size is capped by the in-memory size break upstream; comment should flag this.

Re-affirming the four prior threads rather than posting duplicates.

· · Branch

… logs

Inline DefaultHasher loc_hash on every csv-import-timing line (create_table
loc_hashes, import job before/after budget wait) plus job completed/failed
lines, so a location ties create_table to its import job lifecycle.
@claude

claude Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Claude finished @waralexrom's task in 43s —— View job


Review: latest push (bafd592 — csv-import-timing log enrichment)

Verdict: Diagnostic-logging-only change. No new substantive findings. Four prior unresolved review threads still apply to the current diff.

Details
  • Diff prior commit (7c17003..HEAD) reviewed
  • Existing review threads checked for duplicates
  • Logging-only change confirmed — substantive PR logic unchanged

What's new since the last review

bafd592 adds two enrichments to the existing [csv-import-timing] log lines, no logic change:

  • sql/mod.rs:920–931create_table log now emits loc_hashes=[…] (DefaultHasher-formatted hex of each location string) so the same load can be correlated across the create_table site and the per-location job sites.
  • cluster/ingestion/job_runner.rs:312–317 — computes a loc_hash once per job spawn from the location string.
  • job_runner.rs:325, 333 — start + budget-acquired logs now include loc_hash=… alongside table_id.
  • job_runner.rs:351–354, 359–362 — new import job completed (with data_loaded) and import job failed (with err) info-level lines, also tagged with loc_hash.

Observations on the new logging

  • Stable correlation key — good. DefaultHasher is non-cryptographic but deterministic within a single process run, which is all that's needed to join a create_table line to its later per-job lines. Note that hashes are NOT comparable across processes (Rust's RandomState reseeds), so cross-cluster log joins won't work — fine for in-process diagnostics, just be aware.
  • Failure path now logged at info-level. The new import job failed line at :359 is helpful — previously a job error returned from this branch with no [csv-import-timing] line, which made failures invisible to anyone grepping the timing tag. Good addition.
  • Same caveats as the prior logging commit still apply. These are info-level on the CSV import hot path and tagged as transient diagnostic markers. If they're meant to stay, consider debug! or a config gate; if temporary, a TODO/issue link in the message would make their lifecycle obvious. (Re-flagging rather than re-posting an inline thread.)
  • Hasher import inside the closure is fine but cluttered. use std::hash::{Hash, Hasher}; inside the closure body works; if these logs become permanent, a small location_hash(&str) -> String helper near the top of the module would dedupe the seven-line idiom that now appears in two files.

Prior threads still applicable (not re-posting)

All four open Claude threads remain unresolved and the underlying code is unchanged in this push:

  • config/mod.rs:723 — multiplier silently 3×'s CUBESTORE_CHUNKS_TOTAL_SIZE_THRESHOLD / CUBESTORE_CHUNKS_COUNT_THRESHOLD for existing deployments.
  • sql/mod.rs:3327 — four #[ignore]d tests covering the behavior this PR changes.
  • import/mod.rs:807estimated_bytes accumulates on non-data rows.
  • compaction.rs:599chunks_total_file_size is capped by the in-memory size break upstream; comment should flag this.

Re-affirming the four prior threads rather than posting duplicates.

· Branch ·

@codecov

codecov Bot commented Jun 13, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.50%. Comparing base (f6b1f71) to head (97e756a).

Additional details and impacted files
@@                        Coverage Diff                        @@
##           cubestore-compaction-improvement   #11079   +/-   ##
=================================================================
  Coverage                             58.50%   58.50%           
=================================================================
  Files                                   216      216           
  Lines                                 17269    17269           
  Branches                               3524     3524           
=================================================================
  Hits                                  10103    10103           
  Misses                                 6651     6651           
  Partials                                515      515           
Flag Coverage Δ
cube-backend 58.50% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…it tests

Removes the temporary [csv-import-timing] info logs and re-enables the four
tests ignored after the size-aware split change:
- partition_compaction_decimal96, inactive_partitions_cleanup: pin
  partition_size_split_threshold_bytes high so the split stays row-based
  (these cover decimal handling / inactive-file GC, not size split).
- over_10k_join: raise max_joined_partitions above the now-larger partition
  count (the test asserts join correctness, not the cap).
- partition_split_by_file_size: assert the first compaction already splits
  (eager split-by-file-size), keeping the per-partition size invariant.
@claude

claude Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Claude finished @waralexrom's task in 51s —— View job


Review: latest push (97e756a — drop timing logs + un-ignore tests)

Verdict: Strong push. Resolves the prior High finding (four #[ignore]d tests now re-enabled with updated assertions/config) and drops all [csv-import-timing] info-level diagnostic logs. Three prior threads still apply (multiplier rescale, non-data-row byte accounting, compaction comment). No new findings.

Details

What's new since the last review

97e756a does two things, no substantive logic change to the perf path:

1. Un-ignores the four behavioral tests that earlier commits had disabled, and adjusts them to the new eager-split semantics rather than masking them:

  • sql/mod.rs:3313 over_10k_join — lifts max_joined_partitions from 10 → 50 with an inline comment that the test asserts join correctness, not the partition cap. Reasonable: the right table now fragments further because of size-aware split.
  • sql/mod.rs:4047 inactive_partitions_cleanup — sets partition_size_split_threshold_bytes = 1 MiB so a single active partition remains; comment says it covers inactive-file GC, not size split. Reasonable scoping.
  • compaction.rs:2403 partition_compaction_decimal96 — same partition_size_split_threshold_bytes = 1 MiB knob; keeps split row-based to isolate decimal-handling coverage.
  • compaction.rs:2460 partition_split_by_file_size — assertion flipped from partitions.len() == 1 to partitions.len() > 1, which is exactly what the new "eager split on first compaction when pending chunks exceed the size threshold" behavior should produce. Comment explains why.

This addresses the prior High thread directly. The previously-outdated Four tests disabled, not updated. thread is now resolved.

2. Drops the [csv-import-timing] diagnostic logs added in 7c17003/bafd592:

  • sql/mod.rs:917-930create_table log + loc_hash collection removed.
  • cluster/ingestion/job_runner.rs:312-360 — start/budget-acquired/completed/failed log lines and the DefaultHasher block removed.

This addresses the "info-level on the hot path" caveat I raised on the prior two pushes — good outcome.

Prior threads still applicable (not re-posting)

The substantive perf code (config/mod.rs multiplier, import/mod.rs byte cap, compaction.rs size-aware split) is unchanged in this push, so:

  • config/mod.rs:723 — multiplier (default 3.0) silently 3×'s CUBESTORE_CHUNKS_TOTAL_SIZE_THRESHOLD / CUBESTORE_CHUNKS_COUNT_THRESHOLD for existing deployments.
  • import/mod.rs:807estimated_bytes accumulates on non-data rows.
  • compaction.rs:599chunks_total_file_size is capped by the in-memory size break upstream; the comment doesn't flag this.

Re-affirming these three rather than posting duplicates.

· Branch ·

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cube store Issues relating to Cube Store javascript Pull requests that update Javascript code rust Pull requests that update Rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant