Skip to content

Commit ba11206

Browse files
committed
docs: enhance benchmark example for maximizing CPU usage in DataFusion
1 parent a9ad2a9 commit ba11206

File tree

1 file changed

+48
-1
lines changed

1 file changed

+48
-1
lines changed

docs/source/user-guide/configuration.rst

Lines changed: 48 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,5 +95,52 @@ control:
9595
result = df.collect()
9696
9797
98-
You can read more about available :py:class:`~datafusion.context.SessionConfig` options in the `rust DataFusion Configuration guide <https://arrow.apache.org/datafusion/user-guide/configs.html>`_,
98+
Benchmark Example
99+
^^^^^^^^^^^^^^^^^
100+
101+
The repository includes a benchmark script that demonstrates how to maximize CPU usage
102+
with DataFusion. The :code:`benchmarks/max_cpu_usage.py` script shows a practical example
103+
of configuring DataFusion for optimal parallelism.
104+
105+
You can run the benchmark script to see the impact of different configuration settings:
106+
107+
.. code-block:: bash
108+
109+
# Run with default settings (uses all CPU cores)
110+
python benchmarks/max_cpu_usage.py
111+
112+
# Run with specific number of rows and partitions
113+
python benchmarks/max_cpu_usage.py --rows 5000000 --partitions 16
114+
115+
# See all available options
116+
python benchmarks/max_cpu_usage.py --help
117+
118+
Here's an example showing the performance difference between single and multiple partitions:
119+
120+
.. code-block:: bash
121+
122+
# Single partition - slower processing
123+
$ python benchmarks/max_cpu_usage.py --rows=10000000 --partitions 1
124+
Processed 10000000 rows using 1 partitions in 0.107s
125+
126+
# Multiple partitions - faster processing
127+
$ python benchmarks/max_cpu_usage.py --rows=10000000 --partitions 10
128+
Processed 10000000 rows using 10 partitions in 0.038s
129+
130+
This example demonstrates nearly 3x performance improvement (0.107s vs 0.038s) when using
131+
10 partitions instead of 1, showcasing how proper partitioning can significantly improve
132+
CPU utilization and query performance.
133+
134+
The script demonstrates several key optimization techniques:
135+
136+
1. **Higher target partition count**: Uses :code:`with_target_partitions()` to set the number of concurrent partitions
137+
2. **Automatic repartitioning**: Enables repartitioning for joins, aggregations, and window functions
138+
3. **Manual repartitioning**: Uses :code:`repartition()` to ensure all partitions are utilized
139+
4. **CPU-intensive operations**: Performs aggregations that can benefit from parallelization
140+
141+
The benchmark creates synthetic data and measures the time taken to perform a sum aggregation
142+
across the specified number of partitions. This helps you understand how partition configuration
143+
affects performance on your specific hardware.
144+
145+
For more information about available :py:class:`~datafusion.context.SessionConfig` options, see the `rust DataFusion Configuration guide <https://arrow.apache.org/datafusion/user-guide/configs.html>`_,
99146
and about :code:`RuntimeEnvBuilder` options in the rust `online API documentation <https://docs.rs/datafusion/latest/datafusion/execution/runtime_env/struct.RuntimeEnvBuilder.html>`_.

0 commit comments

Comments
 (0)