Skip to content

[opt](csv reader) optimize nullable string deserialization in CSV/text load hot path#64476

Merged
liaoxin01 merged 1 commit into
apache:masterfrom
liaoxin01:opt-csv-reader-nullable-string-v2
Jun 13, 2026
Merged

[opt](csv reader) optimize nullable string deserialization in CSV/text load hot path#64476
liaoxin01 merged 1 commit into
apache:masterfrom
liaoxin01:opt-csv-reader-nullable-string-v2

Conversation

@liaoxin01

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #60920 (previous attempt, superseded by this stateless implementation)

Problem Summary:

When loading CSV data, every column is read as a nullable string, so _deserialize_nullable_string is the per-row per-column hot path (ClickBench: 105 columns x 100M rows = ~10.5 billion cells). Flame graph shows two major per-cell overheads:

  1. assert_cast<ColumnNullable&> performs a typeid comparison per cell in release builds.
  2. DataTypeStringSerDe::deserialize_one_cell_from_csv adds a call layer with another per-cell assert_cast<ColumnString&> inside, plus Status plumbing. Its fill-null-on-failure branch is dead code since the method never fails.

Changes

  1. Use assert_cast<..., TypeCheckOnRelease::DISABLE> in CsvReader::_deserialize_nullable_string and TextReader::_deserialize_nullable_string, which compiles to a plain static_cast in release builds. Debug builds still verify the cast.
  2. Write the string column and null map directly instead of going through the SerDe layer (semantically identical, verified against ColumnNullable::insert_data / DataTypeStringSerDe implementations). The virtual _deserialize_nullable_string dispatch is kept, so TextReader's hive-text semantics (different escape handling and null detection) remain intact.
  3. Add _reserve_nullable_string_columns, called once per batch: it performs checked assert_casts (backing the unchecked per-row casts with a real type validation per batch, throwing instead of UB on mismatch) and reserves offsets/null_map capacity to avoid incremental PODArray growth in the row loop.

The implementation is stateless: no cached column pointers, no per-batch member state to initialize/clear.

Performance

A/B test on full ClickBench dataset (73GB / 100M rows / 105 columns), identical deployment and config, only the BE binary differs:

Metric Before After Improvement
Total load time (BE LoadTime) 636.6s 530.9s -16.6% (1.20x)
CSV parse (ReadDataTime) 590.6s 484.5s -18.0%
Avg throughput 115 MB/s 138 MB/s +20%

All 10 splits (10M rows each) improved consistently by 14-18% with small variance. Loaded row counts are identical between the two runs (99,997,497 rows).

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
      • Full ClickBench load A/B test, see Performance section above. Behavioral equivalence is covered by existing CSV/text load regression cases.
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

…t load hot path

Eliminate per-row per-column overhead when loading CSV/hive-text data:

1. Use assert_cast<..., TypeCheckOnRelease::DISABLE> in
   _deserialize_nullable_string so the release build performs a plain
   static_cast instead of a typeid comparison per cell. Debug builds
   still verify the cast.
2. Write the string/null_map directly instead of going through
   DataTypeStringSerDe::deserialize_one_cell_from_csv/hive_text, which
   removes the SerDe call layer and its internal per-cell assert_cast.
   The SerDe methods never fail, so the old fill-null-on-failure branch
   was dead code.
3. Add _reserve_nullable_string_columns, called once per batch: it
   performs checked assert_casts (backing the unchecked per-row casts
   with a real type validation per batch) and reserves offsets/null_map
   capacity to avoid incremental PODArray growth in the row loop.

The virtual _deserialize_nullable_string dispatch is kept, so
TextReader's hive-text semantics (different escape handling and null
detection) remain intact.
Copilot AI review requested due to automatic review settings June 12, 2026 16:13
@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@liaoxin01

Copy link
Copy Markdown
Contributor Author

run buildall

@liaoxin01

Copy link
Copy Markdown
Contributor Author

/review

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes the CSV/Hive-text load hot path for nullable string columns by removing per-cell SerDe overhead and amortizing type validation/capacity reservations to once per batch, improving load throughput for wide tables.

Changes:

  • Inline nullable-string deserialization in CsvReader/TextReader to avoid SerDe calls and repeated per-cell assert_cast checks in release builds.
  • Add _reserve_nullable_string_columns(...) to validate concrete column types once per batch and pre-reserve offsets/null-map capacity.
  • Switch hot-path casts to assert_cast<..., TypeCheckOnRelease::DISABLE> after per-batch validation.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
be/src/format/text/text_reader.cpp Inlines hive-text nullable string parsing and uses unchecked release casts for the per-cell hot path.
be/src/format/csv/csv_reader.h Declares _reserve_nullable_string_columns(...) helper for per-batch validation/reserve.
be/src/format/csv/csv_reader.cpp Calls the new reserve helper per batch; inlines nullable string CSV parsing and disables release cast checks in the hot loop.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread be/src/format/csv/csv_reader.cpp
Comment thread be/src/format/text/text_reader.cpp

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: no blocking issues found.

Critical checkpoint conclusions:

  • Goal/test: the PR optimizes nullable string deserialization in CSV/text load/query row readers. The changed code preserves the existing null-format, empty-field-as-null, quote-trimming, and escape behavior by mirroring the previous DataTypeStringSerDe calls. No new automated test was added; I did not run BE unit/regression tests.
  • Scope: focused three-file BE change with no unrelated refactor in the PR diff.
  • Concurrency: no new shared state or thread handoff. The modified columns are batch-local through the existing scanner/block ownership path.
  • Lifecycle/static initialization: no new global/static lifecycle; the previous function-local static SerDe is removed.
  • Config/compatibility: no new config, protocol, storage-format, or FE-BE compatibility surface.
  • Parallel paths: both CSV and hive-text nullable string fast paths were updated; non-string and non-nullable paths continue through the existing SerDe path.
  • Conditions/invariants: the release-disabled casts are backed by checked casts once per batch in _reserve_nullable_string_columns, and the same concrete column assumptions already existed in the prior per-cell path.
  • Test results/style: no result files were changed. git diff --check passed for the three PR files. build-support/check-format.sh could not run in this runner because the available clang-format is not version 16.
  • Observability/transactions/data writes: no transaction, persistence, storage visibility, or observability changes are involved.
  • Performance: the optimization removes per-cell virtual/type-check overhead and reserves offsets/null maps without preallocating unpredictable string payload bytes.

User focus: no additional user-provided review focus was present, and I found no extra issue in that area.

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 28660 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit d2d3c703b10cf7a5d27ba083b191e11cc445b087, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17825	3966	3980	3966
q2	q3	10780	1394	805	805
q4	4719	467	336	336
q5	7513	852	562	562
q6	177	169	137	137
q7	775	849	621	621
q8	9385	1651	1540	1540
q9	6333	4460	4519	4460
q10	6810	1794	1512	1512
q11	437	270	249	249
q12	632	423	289	289
q13	18148	3656	2747	2747
q14	263	266	238	238
q15	q16	816	777	706	706
q17	924	899	989	899
q18	7130	5724	5510	5510
q19	1279	1263	1060	1060
q20	522	403	255	255
q21	6158	2680	2459	2459
q22	452	364	309	309
Total cold run time: 101078 ms
Total hot run time: 28660 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4833	4782	4770	4770
q2	q3	5071	5153	4575	4575
q4	2146	2203	1389	1389
q5	4829	4903	4692	4692
q6	220	171	127	127
q7	1891	1745	1545	1545
q8	2383	1909	1896	1896
q9	7360	7375	7387	7375
q10	4751	4687	4218	4218
q11	545	377	357	357
q12	723	740	533	533
q13	3005	3357	2794	2794
q14	286	275	250	250
q15	q16	664	695	602	602
q17	1277	1244	1231	1231
q18	7429	6935	6837	6837
q19	1100	1077	1086	1077
q20	2200	2194	1942	1942
q21	5234	4567	4402	4402
q22	526	468	416	416
Total cold run time: 56473 ms
Total hot run time: 51028 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 167669 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit d2d3c703b10cf7a5d27ba083b191e11cc445b087, data reload: false

query5	4324	610	474	474
query6	450	184	169	169
query7	4826	542	301	301
query8	436	213	196	196
query9	8750	4061	3974	3974
query10	458	319	275	275
query11	5976	2374	2169	2169
query12	159	101	94	94
query13	1290	609	433	433
query14	6355	5315	5008	5008
query14_1	4329	4365	4353	4353
query15	205	193	168	168
query16	1013	434	419	419
query17	1084	678	563	563
query18	2438	458	330	330
query19	206	185	147	147
query20	113	113	108	108
query21	216	144	118	118
query22	13690	13615	13439	13439
query23	17163	16467	16044	16044
query23_1	16262	16433	16570	16433
query24	7611	1787	1297	1297
query24_1	1345	1314	1330	1314
query25	552	456	390	390
query26	1296	307	170	170
query27	2710	561	338	338
query28	4471	2049	2031	2031
query29	1120	633	493	493
query30	318	244	200	200
query31	1150	1084	958	958
query32	107	63	59	59
query33	545	320	261	261
query34	1227	1094	634	634
query35	757	802	692	692
query36	1358	1381	1274	1274
query37	158	104	94	94
query38	3231	3146	3075	3075
query39	931	923	897	897
query39_1	879	887	868	868
query40	232	123	103	103
query41	70	69	66	66
query42	95	95	98	95
query43	314	321	276	276
query44	
query45	198	184	180	180
query46	1060	1185	723	723
query47	2364	2374	2297	2297
query48	430	404	308	308
query49	639	481	360	360
query50	986	351	252	252
query51	4343	4435	4293	4293
query52	88	89	78	78
query53	242	269	192	192
query54	301	227	211	211
query55	80	78	70	70
query56	250	246	240	240
query57	1462	1421	1301	1301
query58	247	220	222	220
query59	1568	1680	1373	1373
query60	297	285	220	220
query61	150	146	140	140
query62	694	639	588	588
query63	226	181	183	181
query64	2591	752	595	595
query65	
query66	1788	460	345	345
query67	29000	29704	28841	28841
query68	
query69	425	289	256	256
query70	986	981	916	916
query71	309	219	209	209
query72	3001	2589	2252	2252
query73	840	781	432	432
query74	5126	4993	4749	4749
query75	2652	2558	2222	2222
query76	2354	1194	757	757
query77	351	369	276	276
query78	12342	12345	11789	11789
query79	1448	1070	782	782
query80	568	455	391	391
query81	453	279	242	242
query82	690	158	116	116
query83	350	267	248	248
query84	
query85	862	494	406	406
query86	376	301	253	253
query87	3367	3257	3162	3162
query88	3622	2719	2719	2719
query89	489	375	326	326
query90	1980	168	172	168
query91	168	160	131	131
query92	61	61	55	55
query93	1477	1579	841	841
query94	549	348	304	304
query95	673	461	337	337
query96	1097	739	353	353
query97	2710	2715	2559	2559
query98	209	204	207	204
query99	1158	1175	1045	1045
Total cold run time: 250250 ms
Total hot run time: 167669 ms

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jun 12, 2026
@github-actions

Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions

Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (34/34) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.96% (28315/38282)
Line Coverage 57.91% (308400/532513)
Region Coverage 54.68% (257947/471744)
Branch Coverage 56.12% (112132/199810)

@liaoxin01 liaoxin01 merged commit 4dda859 into apache:master Jun 13, 2026
35 of 36 checks passed
@liaoxin01 liaoxin01 deleted the opt-csv-reader-nullable-string-v2 branch June 13, 2026 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants