[opt](csv reader) optimize nullable string deserialization in CSV/text load hot path#64476
Conversation
…t load hot path Eliminate per-row per-column overhead when loading CSV/hive-text data: 1. Use assert_cast<..., TypeCheckOnRelease::DISABLE> in _deserialize_nullable_string so the release build performs a plain static_cast instead of a typeid comparison per cell. Debug builds still verify the cast. 2. Write the string/null_map directly instead of going through DataTypeStringSerDe::deserialize_one_cell_from_csv/hive_text, which removes the SerDe call layer and its internal per-cell assert_cast. The SerDe methods never fail, so the old fill-null-on-failure branch was dead code. 3. Add _reserve_nullable_string_columns, called once per batch: it performs checked assert_casts (backing the unchecked per-row casts with a real type validation per batch) and reserves offsets/null_map capacity to avoid incremental PODArray growth in the row loop. The virtual _deserialize_nullable_string dispatch is kept, so TextReader's hive-text semantics (different escape handling and null detection) remain intact.
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
/review |
There was a problem hiding this comment.
Pull request overview
Optimizes the CSV/Hive-text load hot path for nullable string columns by removing per-cell SerDe overhead and amortizing type validation/capacity reservations to once per batch, improving load throughput for wide tables.
Changes:
- Inline nullable-string deserialization in
CsvReader/TextReaderto avoid SerDe calls and repeated per-cellassert_castchecks in release builds. - Add
_reserve_nullable_string_columns(...)to validate concrete column types once per batch and pre-reserve offsets/null-map capacity. - Switch hot-path casts to
assert_cast<..., TypeCheckOnRelease::DISABLE>after per-batch validation.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| be/src/format/text/text_reader.cpp | Inlines hive-text nullable string parsing and uses unchecked release casts for the per-cell hot path. |
| be/src/format/csv/csv_reader.h | Declares _reserve_nullable_string_columns(...) helper for per-batch validation/reserve. |
| be/src/format/csv/csv_reader.cpp | Calls the new reserve helper per batch; inlines nullable string CSV parsing and disables release cast checks in the hot loop. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Review result: no blocking issues found.
Critical checkpoint conclusions:
- Goal/test: the PR optimizes nullable string deserialization in CSV/text load/query row readers. The changed code preserves the existing null-format, empty-field-as-null, quote-trimming, and escape behavior by mirroring the previous
DataTypeStringSerDecalls. No new automated test was added; I did not run BE unit/regression tests. - Scope: focused three-file BE change with no unrelated refactor in the PR diff.
- Concurrency: no new shared state or thread handoff. The modified columns are batch-local through the existing scanner/block ownership path.
- Lifecycle/static initialization: no new global/static lifecycle; the previous function-local static SerDe is removed.
- Config/compatibility: no new config, protocol, storage-format, or FE-BE compatibility surface.
- Parallel paths: both CSV and hive-text nullable string fast paths were updated; non-string and non-nullable paths continue through the existing SerDe path.
- Conditions/invariants: the release-disabled casts are backed by checked casts once per batch in
_reserve_nullable_string_columns, and the same concrete column assumptions already existed in the prior per-cell path. - Test results/style: no result files were changed.
git diff --checkpassed for the three PR files.build-support/check-format.shcould not run in this runner because the availableclang-formatis not version 16. - Observability/transactions/data writes: no transaction, persistence, storage visibility, or observability changes are involved.
- Performance: the optimization removes per-cell virtual/type-check overhead and reserves offsets/null maps without preallocating unpredictable string payload bytes.
User focus: no additional user-provided review focus was present, and I found no extra issue in that area.
TPC-H: Total hot run time: 28660 ms |
TPC-DS: Total hot run time: 167669 ms |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #60920 (previous attempt, superseded by this stateless implementation)
Problem Summary:
When loading CSV data, every column is read as a nullable string, so
_deserialize_nullable_stringis the per-row per-column hot path (ClickBench: 105 columns x 100M rows = ~10.5 billion cells). Flame graph shows two major per-cell overheads:assert_cast<ColumnNullable&>performs a typeid comparison per cell in release builds.DataTypeStringSerDe::deserialize_one_cell_from_csvadds a call layer with another per-cellassert_cast<ColumnString&>inside, plus Status plumbing. Its fill-null-on-failure branch is dead code since the method never fails.Changes
assert_cast<..., TypeCheckOnRelease::DISABLE>inCsvReader::_deserialize_nullable_stringandTextReader::_deserialize_nullable_string, which compiles to a plainstatic_castin release builds. Debug builds still verify the cast.ColumnNullable::insert_data/DataTypeStringSerDeimplementations). The virtual_deserialize_nullable_stringdispatch is kept, so TextReader's hive-text semantics (different escape handling and null detection) remain intact._reserve_nullable_string_columns, called once per batch: it performs checkedassert_casts (backing the unchecked per-row casts with a real type validation per batch, throwing instead of UB on mismatch) and reserves offsets/null_map capacity to avoid incremental PODArray growth in the row loop.The implementation is stateless: no cached column pointers, no per-batch member state to initialize/clear.
Performance
A/B test on full ClickBench dataset (73GB / 100M rows / 105 columns), identical deployment and config, only the BE binary differs:
All 10 splits (10M rows each) improved consistently by 14-18% with small variance. Loaded row counts are identical between the two runs (99,997,497 rows).
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)