Add benchmark script and documentation for maximizing CPU usage in DataFusion Python #1216
+247
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
This change provides users with practical guidance and examples for tuning DataFusion’s parallelism to maximize CPU utilization. By documenting configuration options and including a benchmark script, users can better understand how to configure partitions and repartitioning to improve query performance.
What changes are included in this PR?
Added a new benchmark script
benchmarks/max_cpu_usage.py
showing how to configure DataFusion for optimal parallelism and measure performance impact.Updated README.md with a reference to the new documentation section.
Expanded user guide (
docs/source/user-guide/configuration.rst
) with a new section Maximizing CPU Usage, including:SessionConfig
for higher partition counts.Are these changes tested?
The new
benchmarks/max_cpu_usage.py
script serves as a functional test and demonstration of configuration options. It generates synthetic data and measures query performance, showcasing partitioning impacts. While not a formal unit test, it validates correct behavior of partitioning and parallelism features.Are there any user-facing changes?
Yes:
benchmarks/
for users to run and test parallelism configuration.No breaking API changes are introduced.