[opt](csv reader) optimize nullable string deserialization in CSV/text load hot path by liaoxin01 · Pull Request #64476 · apache/doris

liaoxin01 · 2026-06-12T16:13:27Z

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #60920 (previous attempt, superseded by this stateless implementation)

Problem Summary:

When loading CSV data, every column is read as a nullable string, so _deserialize_nullable_string is the per-row per-column hot path (ClickBench: 105 columns x 100M rows = ~10.5 billion cells). Flame graph shows two major per-cell overheads:

assert_cast<ColumnNullable&> performs a typeid comparison per cell in release builds.
DataTypeStringSerDe::deserialize_one_cell_from_csv adds a call layer with another per-cell assert_cast<ColumnString&> inside, plus Status plumbing. Its fill-null-on-failure branch is dead code since the method never fails.

Changes

Use assert_cast<..., TypeCheckOnRelease::DISABLE> in CsvReader::_deserialize_nullable_string and TextReader::_deserialize_nullable_string, which compiles to a plain static_cast in release builds. Debug builds still verify the cast.
Write the string column and null map directly instead of going through the SerDe layer (semantically identical, verified against ColumnNullable::insert_data / DataTypeStringSerDe implementations). The virtual _deserialize_nullable_string dispatch is kept, so TextReader's hive-text semantics (different escape handling and null detection) remain intact.
Add _reserve_nullable_string_columns, called once per batch: it performs checked assert_casts (backing the unchecked per-row casts with a real type validation per batch, throwing instead of UB on mismatch) and reserves offsets/null_map capacity to avoid incremental PODArray growth in the row loop.

The implementation is stateless: no cached column pointers, no per-batch member state to initialize/clear.

Performance

A/B test on full ClickBench dataset (73GB / 100M rows / 105 columns), identical deployment and config, only the BE binary differs:

Metric	Before	After	Improvement
Total load time (BE LoadTime)	636.6s	530.9s	-16.6% (1.20x)
CSV parse (ReadDataTime)	590.6s	484.5s	-18.0%
Avg throughput	115 MB/s	138 MB/s	+20%

All 10 splits (10M rows each) improved consistently by 14-18% with small variance. Loaded row counts are identical between the two runs (99,997,497 rows).

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
  - Full ClickBench load A/B test, see Performance section above. Behavioral equivalence is covered by existing CSV/text load regression cases.
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

…t load hot path Eliminate per-row per-column overhead when loading CSV/hive-text data: 1. Use assert_cast<..., TypeCheckOnRelease::DISABLE> in _deserialize_nullable_string so the release build performs a plain static_cast instead of a typeid comparison per cell. Debug builds still verify the cast. 2. Write the string/null_map directly instead of going through DataTypeStringSerDe::deserialize_one_cell_from_csv/hive_text, which removes the SerDe call layer and its internal per-cell assert_cast. The SerDe methods never fail, so the old fill-null-on-failure branch was dead code. 3. Add _reserve_nullable_string_columns, called once per batch: it performs checked assert_casts (backing the unchecked per-row casts with a real type validation per batch) and reserves offsets/null_map capacity to avoid incremental PODArray growth in the row loop. The virtual _deserialize_nullable_string dispatch is kept, so TextReader's hive-text semantics (different escape handling and null detection) remain intact.

hello-stephen · 2026-06-12T16:13:33Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

liaoxin01 · 2026-06-12T16:13:37Z

run buildall

liaoxin01 · 2026-06-12T16:15:37Z

/review

Copilot

Pull request overview

Optimizes the CSV/Hive-text load hot path for nullable string columns by removing per-cell SerDe overhead and amortizing type validation/capacity reservations to once per batch, improving load throughput for wide tables.

Changes:

Inline nullable-string deserialization in CsvReader/TextReader to avoid SerDe calls and repeated per-cell assert_cast checks in release builds.
Add _reserve_nullable_string_columns(...) to validate concrete column types once per batch and pre-reserve offsets/null-map capacity.
Switch hot-path casts to assert_cast<..., TypeCheckOnRelease::DISABLE> after per-batch validation.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
be/src/format/text/text_reader.cpp	Inlines hive-text nullable string parsing and uses unchecked release casts for the per-cell hot path.
be/src/format/csv/csv_reader.h	Declares `_reserve_nullable_string_columns(...)` helper for per-batch validation/reserve.
be/src/format/csv/csv_reader.cpp	Calls the new reserve helper per batch; inlines nullable string CSV parsing and disables release cast checks in the hot loop.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions

Review result: no blocking issues found.

Critical checkpoint conclusions:

Goal/test: the PR optimizes nullable string deserialization in CSV/text load/query row readers. The changed code preserves the existing null-format, empty-field-as-null, quote-trimming, and escape behavior by mirroring the previous DataTypeStringSerDe calls. No new automated test was added; I did not run BE unit/regression tests.
Scope: focused three-file BE change with no unrelated refactor in the PR diff.
Concurrency: no new shared state or thread handoff. The modified columns are batch-local through the existing scanner/block ownership path.
Lifecycle/static initialization: no new global/static lifecycle; the previous function-local static SerDe is removed.
Config/compatibility: no new config, protocol, storage-format, or FE-BE compatibility surface.
Parallel paths: both CSV and hive-text nullable string fast paths were updated; non-string and non-nullable paths continue through the existing SerDe path.
Conditions/invariants: the release-disabled casts are backed by checked casts once per batch in _reserve_nullable_string_columns, and the same concrete column assumptions already existed in the prior per-cell path.
Test results/style: no result files were changed. git diff --check passed for the three PR files. build-support/check-format.sh could not run in this runner because the available clang-format is not version 16.
Observability/transactions/data writes: no transaction, persistence, storage visibility, or observability changes are involved.
Performance: the optimization removes per-cell virtual/type-check overhead and reserves offsets/null maps without preallocating unpredictable string payload bytes.

User focus: no additional user-provided review focus was present, and I found no extra issue in that area.

hello-stephen · 2026-06-12T16:43:29Z

TPC-H: Total hot run time: 28660 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit d2d3c703b10cf7a5d27ba083b191e11cc445b087, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17825	3966	3980	3966
q2	q3	10780	1394	805	805
q4	4719	467	336	336
q5	7513	852	562	562
q6	177	169	137	137
q7	775	849	621	621
q8	9385	1651	1540	1540
q9	6333	4460	4519	4460
q10	6810	1794	1512	1512
q11	437	270	249	249
q12	632	423	289	289
q13	18148	3656	2747	2747
q14	263	266	238	238
q15	q16	816	777	706	706
q17	924	899	989	899
q18	7130	5724	5510	5510
q19	1279	1263	1060	1060
q20	522	403	255	255
q21	6158	2680	2459	2459
q22	452	364	309	309
Total cold run time: 101078 ms
Total hot run time: 28660 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4833	4782	4770	4770
q2	q3	5071	5153	4575	4575
q4	2146	2203	1389	1389
q5	4829	4903	4692	4692
q6	220	171	127	127
q7	1891	1745	1545	1545
q8	2383	1909	1896	1896
q9	7360	7375	7387	7375
q10	4751	4687	4218	4218
q11	545	377	357	357
q12	723	740	533	533
q13	3005	3357	2794	2794
q14	286	275	250	250
q15	q16	664	695	602	602
q17	1277	1244	1231	1231
q18	7429	6935	6837	6837
q19	1100	1077	1086	1077
q20	2200	2194	1942	1942
q21	5234	4567	4402	4402
q22	526	468	416	416
Total cold run time: 56473 ms
Total hot run time: 51028 ms

hello-stephen · 2026-06-12T16:54:27Z

TPC-DS: Total hot run time: 167669 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit d2d3c703b10cf7a5d27ba083b191e11cc445b087, data reload: false

query5	4324	610	474	474
query6	450	184	169	169
query7	4826	542	301	301
query8	436	213	196	196
query9	8750	4061	3974	3974
query10	458	319	275	275
query11	5976	2374	2169	2169
query12	159	101	94	94
query13	1290	609	433	433
query14	6355	5315	5008	5008
query14_1	4329	4365	4353	4353
query15	205	193	168	168
query16	1013	434	419	419
query17	1084	678	563	563
query18	2438	458	330	330
query19	206	185	147	147
query20	113	113	108	108
query21	216	144	118	118
query22	13690	13615	13439	13439
query23	17163	16467	16044	16044
query23_1	16262	16433	16570	16433
query24	7611	1787	1297	1297
query24_1	1345	1314	1330	1314
query25	552	456	390	390
query26	1296	307	170	170
query27	2710	561	338	338
query28	4471	2049	2031	2031
query29	1120	633	493	493
query30	318	244	200	200
query31	1150	1084	958	958
query32	107	63	59	59
query33	545	320	261	261
query34	1227	1094	634	634
query35	757	802	692	692
query36	1358	1381	1274	1274
query37	158	104	94	94
query38	3231	3146	3075	3075
query39	931	923	897	897
query39_1	879	887	868	868
query40	232	123	103	103
query41	70	69	66	66
query42	95	95	98	95
query43	314	321	276	276
query44	
query45	198	184	180	180
query46	1060	1185	723	723
query47	2364	2374	2297	2297
query48	430	404	308	308
query49	639	481	360	360
query50	986	351	252	252
query51	4343	4435	4293	4293
query52	88	89	78	78
query53	242	269	192	192
query54	301	227	211	211
query55	80	78	70	70
query56	250	246	240	240
query57	1462	1421	1301	1301
query58	247	220	222	220
query59	1568	1680	1373	1373
query60	297	285	220	220
query61	150	146	140	140
query62	694	639	588	588
query63	226	181	183	181
query64	2591	752	595	595
query65	
query66	1788	460	345	345
query67	29000	29704	28841	28841
query68	
query69	425	289	256	256
query70	986	981	916	916
query71	309	219	209	209
query72	3001	2589	2252	2252
query73	840	781	432	432
query74	5126	4993	4749	4749
query75	2652	2558	2222	2222
query76	2354	1194	757	757
query77	351	369	276	276
query78	12342	12345	11789	11789
query79	1448	1070	782	782
query80	568	455	391	391
query81	453	279	242	242
query82	690	158	116	116
query83	350	267	248	248
query84	
query85	862	494	406	406
query86	376	301	253	253
query87	3367	3257	3162	3162
query88	3622	2719	2719	2719
query89	489	375	326	326
query90	1980	168	172	168
query91	168	160	131	131
query92	61	61	55	55
query93	1477	1579	841	841
query94	549	348	304	304
query95	673	461	337	337
query96	1097	739	353	353
query97	2710	2715	2559	2559
query98	209	204	207	204
query99	1158	1175	1045	1045
Total cold run time: 250250 ms
Total hot run time: 167669 ms

github-actions · 2026-06-12T17:40:00Z

PR approved by at least one committer and no changes requested.

github-actions · 2026-06-12T17:40:03Z

PR approved by anyone and no changes requested.

hello-stephen · 2026-06-12T19:24:03Z

BE Regression && UT Coverage Report

Increment line coverage 100.00% (34/34) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	73.96% (28315/38282)
Line Coverage	57.91% (308400/532513)
Region Coverage	54.68% (257947/471744)
Branch Coverage	56.12% (112132/199810)

Copilot AI review requested due to automatic review settings June 12, 2026 16:13

Copilot started reviewing on behalf of liaoxin01 June 12, 2026 16:13 View session

liaoxin01 added dev/3.1.x dev/4.0.x dev/4.1.x labels Jun 12, 2026

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Comment thread be/src/format/csv/csv_reader.cpp

Comment thread be/src/format/text/text_reader.cpp

github-actions Bot reviewed Jun 12, 2026

View reviewed changes

gavinchou approved these changes Jun 12, 2026

View reviewed changes

github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jun 12, 2026

github-actions Bot added the reviewed label Jun 12, 2026

liaoxin01 merged commit 4dda859 into apache:master Jun 13, 2026
35 of 36 checks passed

liaoxin01 deleted the opt-csv-reader-nullable-string-v2 branch June 13, 2026 03:01

github-actions Bot added dev/4.0.x-conflict dev/4.1.x-conflict labels Jun 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[opt](csv reader) optimize nullable string deserialization in CSV/text load hot path#64476

[opt](csv reader) optimize nullable string deserialization in CSV/text load hot path#64476
liaoxin01 merged 1 commit into
apache:masterfrom
liaoxin01:opt-csv-reader-nullable-string-v2

liaoxin01 commented Jun 12, 2026

Uh oh!

hello-stephen commented Jun 12, 2026

Uh oh!

liaoxin01 commented Jun 12, 2026

Uh oh!

liaoxin01 commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Uh oh!

hello-stephen commented Jun 12, 2026

Uh oh!

hello-stephen commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

hello-stephen commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

liaoxin01 commented Jun 12, 2026

What problem does this PR solve?

Changes

Performance

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

hello-stephen commented Jun 12, 2026

Uh oh!

liaoxin01 commented Jun 12, 2026

Uh oh!

liaoxin01 commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

hello-stephen commented Jun 12, 2026

Uh oh!

hello-stephen commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

hello-stephen commented Jun 12, 2026

BE Regression && UT Coverage Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants