Skip to content

Conversation

@echo567
Copy link

@echo567 echo567 commented Nov 16, 2025

Why are the changes needed?

Control the amount of data to prevent memory overflow and increase to initial speed.

When kyuubi.operation.result.format=arrow, spark.connect.grpc.arrow.maxBatchSize does not work as expected.

Reproduction:
You can debug KyuubiArrowConverters or add the following log to line 300 of KyuubiArrowConverters:

logInfo(s"Total limit: ${limit}, rowCount: ${rowCount}, " +
s"rowCountInLastBatch:${rowCountInLastBatch}," +
s"estimatedBatchSize: ${estimatedBatchSize}," +
s"maxEstimatedBatchSize: ${maxEstimatedBatchSize}," +
s"maxRecordsPerBatch:${maxRecordsPerBatch}")

Test data: 1.6 million rows, 30 columns per row. Command executed:

bin/beeline \
  -u 'jdbc:hive2://10.168.X.X:XX/default;thrift.client.max.message.size=2000000000' \
  --hiveconf kyuubi.operation.result.format=arrow \
  -n test -p 'testpass' \
  --outputformat=csv2 -e "select * from db.table" > /tmp/test.csv

Log output

25/11/13 13:52:57 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 200000, lastBatchRowCount:200000, estimatedBatchSize: 145600000 maxEstimatedBatchSize: 4,maxRecordsPerBatch:10000
25/11/13 13:52:57 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 200000, lastBatchRowCount:200000, estimatedBatchSize: 145600000

Original Code

while (rowIter.hasNext && (
rowCountInLastBatch == 0 && maxEstimatedBatchSize > 0 ||
estimatedBatchSize <= 0 ||
estimatedBatchSize < maxEstimatedBatchSize ||
maxRecordsPerBatch <= 0 ||
rowCountInLastBatch < maxRecordsPerBatch ||
rowCount < limit ||
limit < 0))

When the limit is not set, i.e., -1, all data will be retrieved at once. If the row count is too large, the following three problems will occur:
(1) Driver/executor oom
(2) Array oom cause of array length is not enough
(3) Transfer data slowly

After updating the code, the log output is as follows:

25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 5762, rowCountInLastBatch:5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch:10000
25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 11524, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000 
25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 17286, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000

The estimatedBatchSize is slightly larger than the maxEstimatedBatchSize. Data can be written in batches as expected.

Fix #7245.

How was this patch tested?

Test data: 1.6 million rows, 30 columns per row.

25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 5762, rowCountInLastBatch:5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch:10000
25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 11524, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000 
25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 17286, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000

Was this patch authored or co-authored using generative AI tooling?

No

@pan3793 pan3793 changed the title [KYUUBI apache#7245] fix arrow batch converter error [KYUUBI #7245] Fix arrow batch converter error Nov 17, 2025
@pan3793
Copy link
Member

pan3793 commented Nov 17, 2025

@echo567, please keep the PR template and fill in seriously, especially "Was this patch authored or co-authored using generative AI tooling?", it does matter for legal purposes.

@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 0.00%. Comparing base (3b205a3) to head (479d7e4).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
...rk/sql/execution/arrow/KyuubiArrowConverters.scala 0.00% 5 Missing ⚠️
...g/apache/spark/sql/kyuubi/SparkDatasetHelper.scala 0.00% 1 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff           @@
##           master   #7246   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files         696     696           
  Lines       43530   43527    -3     
  Branches     5883    5879    -4     
======================================
+ Misses      43530   43527    -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@echo567
Copy link
Author

echo567 commented Nov 19, 2025

@echo567, please keep the PR template and fill in seriously, especially "Was this patch authored or co-authored using generative AI tooling?", it does matter for legal purposes.

Sorry, the changes have been made.

@pan3793 pan3793 requested a review from cfmcgrady November 19, 2025 03:26
@pan3793
Copy link
Member

pan3793 commented Nov 19, 2025

The code is copied from Spark, seems it was changed at SPARK-44657. Can we just follow that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] arrow batch converter error

3 participants