Skip to content

Commit 1d0228b

Browse files
committed
docs: enhance benchmark script and configuration guide for maximizing CPU usage
1 parent ba11206 commit 1d0228b

File tree

2 files changed

+74
-3
lines changed

2 files changed

+74
-3
lines changed

benchmarks/max_cpu_usage.py

Lines changed: 34 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,31 @@
1414
# KIND, either express or implied. See the License for the
1515
# specific language governing permissions and limitations
1616
# under the License.
17-
"""Benchmark script showing how to maximize CPU usage."""
17+
"""Benchmark script showing how to maximize CPU usage.
18+
19+
This script demonstrates one example of tuning DataFusion for improved parallelism
20+
and CPU utilization. It uses synthetic in-memory data and performs simple aggregation
21+
operations to showcase the impact of partitioning configuration.
22+
23+
IMPORTANT: This is a simplified example designed to illustrate partitioning concepts.
24+
Actual performance in your applications may vary significantly based on many factors:
25+
26+
- Type of table providers (Parquet files, CSV, databases, etc.)
27+
- I/O operations and storage characteristics (local disk, network, cloud storage)
28+
- Query complexity and operation types (joins, window functions, complex expressions)
29+
- Data distribution and size characteristics
30+
- Memory available and hardware specifications
31+
- Network latency for distributed data sources
32+
33+
It is strongly recommended that you create similar benchmarks tailored to your specific:
34+
- Hardware configuration
35+
- Data sources and formats
36+
- Typical query patterns and workloads
37+
- Performance requirements
38+
39+
This will give you more accurate insights into how DataFusion configuration options
40+
will affect your particular use case.
41+
"""
1842

1943
from __future__ import annotations
2044

@@ -28,8 +52,15 @@
2852

2953

3054
def main(num_rows: int, partitions: int) -> None:
31-
"""Run a simple aggregation after repartitioning."""
32-
# Create some example data
55+
"""Run a simple aggregation after repartitioning.
56+
57+
This function demonstrates basic partitioning concepts using synthetic data.
58+
Real-world performance will depend on your specific data sources, query types,
59+
and system configuration.
60+
"""
61+
# Create some example data (synthetic in-memory data for demonstration)
62+
# Note: Real applications typically work with files, databases, or other
63+
# data sources that have different I/O and distribution characteristics
3364
array = pa.array(range(num_rows))
3465
batch = pa.record_batch([array], names=["a"])
3566

docs/source/user-guide/configuration.rst

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -142,5 +142,45 @@ The benchmark creates synthetic data and measures the time taken to perform a su
142142
across the specified number of partitions. This helps you understand how partition configuration
143143
affects performance on your specific hardware.
144144

145+
Important Considerations
146+
""""""""""""""""""""""""
147+
148+
The provided benchmark script demonstrates partitioning concepts using synthetic in-memory data
149+
and simple aggregation operations. While useful for understanding basic configuration principles,
150+
actual performance in production environments may vary significantly based on numerous factors:
151+
152+
**Data Sources and I/O Characteristics:**
153+
154+
- **Table providers**: Performance differs greatly between Parquet files, CSV files, databases, and cloud storage
155+
- **Storage type**: Local SSD, network-attached storage, and cloud storage have vastly different characteristics
156+
- **Network latency**: Remote data sources introduce additional latency considerations
157+
- **File sizes and distribution**: Large files may benefit differently from partitioning than many small files
158+
159+
**Query and Workload Characteristics:**
160+
161+
- **Operation complexity**: Simple aggregations versus complex joins, window functions, or nested queries
162+
- **Data distribution**: Skewed data may not partition evenly, affecting parallel efficiency
163+
- **Memory usage**: Large datasets may require different memory management strategies
164+
- **Concurrent workloads**: Multiple queries running simultaneously affect resource allocation
165+
166+
**Hardware and Environment Factors:**
167+
168+
- **CPU architecture**: Different processors have varying parallel processing capabilities
169+
- **Available memory**: Limited RAM may require different optimization strategies
170+
- **System load**: Other applications competing for resources affect DataFusion performance
171+
172+
**Recommendations for Production Use:**
173+
174+
To optimize DataFusion for your specific use case, it is strongly recommended to:
175+
176+
1. **Create custom benchmarks** using your actual data sources, formats, and query patterns
177+
2. **Test with representative data volumes** that match your production workloads
178+
3. **Measure end-to-end performance** including data loading, processing, and result handling
179+
4. **Evaluate different configuration combinations** for your specific hardware and workload
180+
5. **Monitor resource utilization** (CPU, memory, I/O) to identify bottlenecks in your environment
181+
182+
This approach will provide more accurate insights into how DataFusion configuration options
183+
will impact your particular applications and infrastructure.
184+
145185
For more information about available :py:class:`~datafusion.context.SessionConfig` options, see the `rust DataFusion Configuration guide <https://arrow.apache.org/datafusion/user-guide/configs.html>`_,
146186
and about :code:`RuntimeEnvBuilder` options in the rust `online API documentation <https://docs.rs/datafusion/latest/datafusion/execution/runtime_env/struct.RuntimeEnvBuilder.html>`_.

0 commit comments

Comments
 (0)