[KYUUBI #7245] Fix arrow batch converter error #7246

echo567 · 2025-11-16T11:16:42Z

Why are the changes needed?

Control the amount of data to prevent memory overflow and increase to initial speed.

When kyuubi.operation.result.format=arrow, spark.connect.grpc.arrow.maxBatchSize does not work as expected.

Reproduction:
You can debug KyuubiArrowConverters or add the following log to line 300 of KyuubiArrowConverters:

logInfo(s"Total limit: ${limit}, rowCount: ${rowCount}, " +
s"rowCountInLastBatch:${rowCountInLastBatch}," +
s"estimatedBatchSize: ${estimatedBatchSize}," +
s"maxEstimatedBatchSize: ${maxEstimatedBatchSize}," +
s"maxRecordsPerBatch:${maxRecordsPerBatch}")

Test data: 1.6 million rows, 30 columns per row. Command executed:

bin/beeline \
  -u 'jdbc:hive2://10.168.X.X:XX/default;thrift.client.max.message.size=2000000000' \
  --hiveconf kyuubi.operation.result.format=arrow \
  -n test -p 'testpass' \
  --outputformat=csv2 -e "select * from db.table" > /tmp/test.csv

Log output

25/11/13 13:52:57 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 200000, lastBatchRowCount:200000, estimatedBatchSize: 145600000 maxEstimatedBatchSize: 4,maxRecordsPerBatch:10000
25/11/13 13:52:57 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 200000, lastBatchRowCount:200000, estimatedBatchSize: 145600000

Original Code

while (rowIter.hasNext && (
rowCountInLastBatch == 0 && maxEstimatedBatchSize > 0 ||
estimatedBatchSize <= 0 ||
estimatedBatchSize < maxEstimatedBatchSize ||
maxRecordsPerBatch <= 0 ||
rowCountInLastBatch < maxRecordsPerBatch ||
rowCount < limit ||
limit < 0))

When the limit is not set, i.e., -1, all data will be retrieved at once. If the row count is too large, the following three problems will occur:
(1) Driver/executor oom
(2) Array oom cause of array length is not enough
(3) Transfer data slowly

After updating the code, the log output is as follows:

25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 5762, rowCountInLastBatch:5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch:10000
25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 11524, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000 
25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 17286, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000

The estimatedBatchSize is slightly larger than the maxEstimatedBatchSize. Data can be written in batches as expected.

Fix #7245.

How was this patch tested?

Test data: 1.6 million rows, 30 columns per row.

25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 5762, rowCountInLastBatch:5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch:10000
25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 11524, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000 
25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 17286, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000

Was this patch authored or co-authored using generative AI tooling?

No

pan3793 · 2025-11-17T12:04:59Z

@echo567, please keep the PR template and fill in seriously, especially "Was this patch authored or co-authored using generative AI tooling?", it does matter for legal purposes.

codecov-commenter · 2025-11-17T14:45:00Z

Codecov Report

❌ Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 0.00%. Comparing base (3b205a3) to head (479d7e4).
⚠️ Report is 6 commits behind head on master.

Files with missing lines	Patch %	Lines
...rk/sql/execution/arrow/KyuubiArrowConverters.scala	0.00%	5 Missing ⚠️
...g/apache/spark/sql/kyuubi/SparkDatasetHelper.scala	0.00%	1 Missing ⚠️

Additional details and impacted files

@@          Coverage Diff           @@
##           master   #7246   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files         696     696           
  Lines       43530   43527    -3     
  Branches     5883    5879    -4     
======================================
+ Misses      43530   43527    -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

echo567 · 2025-11-19T02:46:31Z

@echo567, please keep the PR template and fill in seriously, especially "Was this patch authored or co-authored using generative AI tooling?", it does matter for legal purposes.

Sorry, the changes have been made.

pan3793 · 2025-11-19T03:34:05Z

The code is copied from Spark, seems it was changed at SPARK-44657. Can we just follow that?

fix(spark): fix arrow batch converter error

479d7e4

github-actions bot added the module:spark label Nov 16, 2025

pan3793 changed the title ~~[KYUUBI apache#7245] fix arrow batch converter error~~ [KYUUBI #7245] Fix arrow batch converter error Nov 17, 2025

pan3793 requested a review from cfmcgrady November 19, 2025 03:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[KYUUBI #7245] Fix arrow batch converter error #7246

[KYUUBI #7245] Fix arrow batch converter error #7246

echo567 commented Nov 16, 2025 •

edited by pan3793

Loading

Uh oh!

pan3793 commented Nov 17, 2025

Uh oh!

codecov-commenter commented Nov 17, 2025

Uh oh!

echo567 commented Nov 19, 2025

Uh oh!

pan3793 commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[KYUUBI #7245] Fix arrow batch converter error #7246

Are you sure you want to change the base?

[KYUUBI #7245] Fix arrow batch converter error #7246

Conversation

echo567 commented Nov 16, 2025 • edited by pan3793 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are the changes needed?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

pan3793 commented Nov 17, 2025

Uh oh!

codecov-commenter commented Nov 17, 2025

Codecov Report

Uh oh!

echo567 commented Nov 19, 2025

Uh oh!

pan3793 commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

echo567 commented Nov 16, 2025 •

edited by pan3793

Loading