Add benchmark script and documentation for maximizing CPU usage in DataFusion Python #1216

kosiew · 2025-08-29T12:48:39Z

Which issue does this PR close?

Closes part of export to arrow generate OOM #1206

was doing some testing and notice that datafusion don't seems to be using all cores in my notebook runtime

Rationale for this change

This change provides users with practical guidance and examples for tuning DataFusion’s parallelism to maximize CPU utilization. By documenting configuration options and including a benchmark script, users can better understand how to configure partitions and repartitioning to improve query performance.

What changes are included in this PR?

Added a new benchmark script benchmarks/max_cpu_usage.py showing how to configure DataFusion for optimal parallelism and measure performance impact.
Updated README.md with a reference to the new documentation section.
Expanded user guide (docs/source/user-guide/configuration.rst) with a new section Maximizing CPU Usage, including:
- Examples of tuning SessionConfig for higher partition counts.
- Enabling automatic repartitioning for joins, aggregations, and window functions.
- Manual repartitioning examples.
- Benchmark usage instructions and performance comparison examples.

Are these changes tested?

The new benchmarks/max_cpu_usage.py script serves as a functional test and demonstration of configuration options. It generates synthetic data and measures query performance, showcasing partitioning impacts. While not a formal unit test, it validates correct behavior of partitioning and parallelism features.

Are there any user-facing changes?

Yes:

New documentation in the configuration guide explaining CPU usage optimization.
A new benchmark script available under benchmarks/ for users to run and test parallelism configuration.

No breaking API changes are introduced.

…rk script

timsaucer

This is an excellent addition!

I think it could benefit from a little extra text in the online documentation or the script itself to tell the users that this benchmark is an example of one type of operation. The actual performance they see can be impacted by a variety of factors, including the types of table providers they are using, what IO that must happen for their setup, and what operations they are performing. It is recommended that the user build a similar benchmark for themself to evaluate using their own hardware and work loads.

… CPU usage

kosiew · 2025-08-31T07:58:46Z

@timsaucer ,
I implemented your excellent suggestion.

kosiew added 2 commits August 29, 2025 20:36

docs: add configuration tips for maximizing CPU usage and new benchma…

a9ad2a9

…rk script

docs: enhance benchmark example for maximizing CPU usage in DataFusion

ba11206

timsaucer reviewed Aug 30, 2025

View reviewed changes

docs: enhance benchmark script and configuration guide for maximizing…

1d0228b

… CPU usage

kosiew force-pushed the cpu-1206 branch from e8048c7 to 1d0228b Compare August 31, 2025 07:56

kosiew requested a review from timsaucer August 31, 2025 07:58

timsaucer approved these changes Aug 31, 2025

View reviewed changes

timsaucer merged commit 61f981b into apache:main Aug 31, 2025
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add benchmark script and documentation for maximizing CPU usage in DataFusion Python #1216

Add benchmark script and documentation for maximizing CPU usage in DataFusion Python #1216

kosiew commented Aug 29, 2025 •

edited

Loading

Uh oh!

timsaucer left a comment

Uh oh!

kosiew commented Aug 31, 2025

Uh oh!

Uh oh!

Uh oh!

Add benchmark script and documentation for maximizing CPU usage in DataFusion Python #1216

Add benchmark script and documentation for maximizing CPU usage in DataFusion Python #1216

Conversation

kosiew commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

timsaucer left a comment

Choose a reason for hiding this comment

Uh oh!

kosiew commented Aug 31, 2025

Uh oh!

Uh oh!

Uh oh!

kosiew commented Aug 29, 2025 •

edited

Loading