docs: enhance benchmark script and configuration guide for maximizing CPU usage

kosiew · kosiew · commit 1d0228b1ba8a · 2025-08-31T15:56:00.000+08:00
diff --git a/benchmarks/max_cpu_usage.py b/benchmarks/max_cpu_usage.py
@@ -14,7 +14,31 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
-"""Benchmark script showing how to maximize CPU usage."""
+"""Benchmark script showing how to maximize CPU usage.
+
+This script demonstrates one example of tuning DataFusion for improved parallelism
+and CPU utilization. It uses synthetic in-memory data and performs simple aggregation
+operations to showcase the impact of partitioning configuration.
+
+IMPORTANT: This is a simplified example designed to illustrate partitioning concepts.
+Actual performance in your applications may vary significantly based on many factors:
+
+- Type of table providers (Parquet files, CSV, databases, etc.)
+- I/O operations and storage characteristics (local disk, network, cloud storage)
+- Query complexity and operation types (joins, window functions, complex expressions)
+- Data distribution and size characteristics
+- Memory available and hardware specifications
+- Network latency for distributed data sources
+
+It is strongly recommended that you create similar benchmarks tailored to your specific:
+- Hardware configuration
+- Data sources and formats
+- Typical query patterns and workloads
+- Performance requirements
+
+This will give you more accurate insights into how DataFusion configuration options
+will affect your particular use case.
+"""
 
 from __future__ import annotations
 
@@ -28,8 +52,15 @@
 
 
 def main(num_rows: int, partitions: int) -> None:
-    """Run a simple aggregation after repartitioning."""
-    # Create some example data
+    """Run a simple aggregation after repartitioning.
+    
+    This function demonstrates basic partitioning concepts using synthetic data.
+    Real-world performance will depend on your specific data sources, query types,
+    and system configuration.
+    """
+    # Create some example data (synthetic in-memory data for demonstration)
+    # Note: Real applications typically work with files, databases, or other
+    # data sources that have different I/O and distribution characteristics
     array = pa.array(range(num_rows))
     batch = pa.record_batch([array], names=["a"])
 
diff --git a/docs/source/user-guide/configuration.rst b/docs/source/user-guide/configuration.rst
@@ -142,5 +142,45 @@ The benchmark creates synthetic data and measures the time taken to perform a su
 across the specified number of partitions. This helps you understand how partition configuration
 affects performance on your specific hardware.
 
+Important Considerations
+""""""""""""""""""""""""
+
+The provided benchmark script demonstrates partitioning concepts using synthetic in-memory data
+and simple aggregation operations. While useful for understanding basic configuration principles,
+actual performance in production environments may vary significantly based on numerous factors:
+
+**Data Sources and I/O Characteristics:**
+
+- **Table providers**: Performance differs greatly between Parquet files, CSV files, databases, and cloud storage
+- **Storage type**: Local SSD, network-attached storage, and cloud storage have vastly different characteristics  
+- **Network latency**: Remote data sources introduce additional latency considerations
+- **File sizes and distribution**: Large files may benefit differently from partitioning than many small files
+
+**Query and Workload Characteristics:**
+
+- **Operation complexity**: Simple aggregations versus complex joins, window functions, or nested queries
+- **Data distribution**: Skewed data may not partition evenly, affecting parallel efficiency
+- **Memory usage**: Large datasets may require different memory management strategies
+- **Concurrent workloads**: Multiple queries running simultaneously affect resource allocation
+
+**Hardware and Environment Factors:**
+
+- **CPU architecture**: Different processors have varying parallel processing capabilities
+- **Available memory**: Limited RAM may require different optimization strategies
+- **System load**: Other applications competing for resources affect DataFusion performance
+
+**Recommendations for Production Use:**
+
+To optimize DataFusion for your specific use case, it is strongly recommended to:
+
+1. **Create custom benchmarks** using your actual data sources, formats, and query patterns
+2. **Test with representative data volumes** that match your production workloads  
+3. **Measure end-to-end performance** including data loading, processing, and result handling
+4. **Evaluate different configuration combinations** for your specific hardware and workload
+5. **Monitor resource utilization** (CPU, memory, I/O) to identify bottlenecks in your environment
+
+This approach will provide more accurate insights into how DataFusion configuration options
+will impact your particular applications and infrastructure.
+
 For more information about available :py:class:`~datafusion.context.SessionConfig` options, see the `rust DataFusion Configuration guide <https://arrow.apache.org/datafusion/user-guide/configs.html>`_,
 and about :code:`RuntimeEnvBuilder` options in the rust `online API documentation <https://docs.rs/datafusion/latest/datafusion/execution/runtime_env/struct.RuntimeEnvBuilder.html>`_.