Skip to content

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Aug 29, 2025

Which issue does this PR close?

was doing some testing and notice that datafusion don't seems to be using all cores in my notebook runtime

Rationale for this change

This change provides users with practical guidance and examples for tuning DataFusion’s parallelism to maximize CPU utilization. By documenting configuration options and including a benchmark script, users can better understand how to configure partitions and repartitioning to improve query performance.

What changes are included in this PR?

  • Added a new benchmark script benchmarks/max_cpu_usage.py showing how to configure DataFusion for optimal parallelism and measure performance impact.

  • Updated README.md with a reference to the new documentation section.

  • Expanded user guide (docs/source/user-guide/configuration.rst) with a new section Maximizing CPU Usage, including:

    • Examples of tuning SessionConfig for higher partition counts.
    • Enabling automatic repartitioning for joins, aggregations, and window functions.
    • Manual repartitioning examples.
    • Benchmark usage instructions and performance comparison examples.

Are these changes tested?

The new benchmarks/max_cpu_usage.py script serves as a functional test and demonstration of configuration options. It generates synthetic data and measures query performance, showcasing partitioning impacts. While not a formal unit test, it validates correct behavior of partitioning and parallelism features.

Are there any user-facing changes?

Yes:

  • New documentation in the configuration guide explaining CPU usage optimization.
  • A new benchmark script available under benchmarks/ for users to run and test parallelism configuration.

No breaking API changes are introduced.

Copy link
Contributor

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent addition!

I think it could benefit from a little extra text in the online documentation or the script itself to tell the users that this benchmark is an example of one type of operation. The actual performance they see can be impacted by a variety of factors, including the types of table providers they are using, what IO that must happen for their setup, and what operations they are performing. It is recommended that the user build a similar benchmark for themself to evaluate using their own hardware and work loads.

@kosiew
Copy link
Contributor Author

kosiew commented Aug 31, 2025

@timsaucer ,
I implemented your excellent suggestion.

@kosiew kosiew requested a review from timsaucer August 31, 2025 07:58
@timsaucer timsaucer merged commit 61f981b into apache:main Aug 31, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants