From b4efdb736caabbcc2612b6b4450ab87b133a1d8e Mon Sep 17 00:00:00 2001 From: Chao Sun Date: Thu, 20 Aug 2020 16:19:21 -0700 Subject: [PATCH 1/3] [SPARK-32646][DOC] Add suggestion for parallel directory listing in tuning doc --- docs/tuning.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/tuning.md b/docs/tuning.md index 8e29e5d2e9e7..0b9366912bf4 100644 --- a/docs/tuning.md +++ b/docs/tuning.md @@ -264,6 +264,13 @@ parent RDD's number of partitions. You can pass the level of parallelism as a se or set the config property `spark.default.parallelism` to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster. +Sometimes you may also need to increase directory listing parallelism when job input has large number of directories, +otherwise the process could take a very long time, especially when against object store like S3. +If your job works on RDD with Hadoop input formats (e.g., via `SparkContext#sequenceFile`), the parallelism is +controlled via `spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads` (default is 1). For other +cases such as Spark SQL, you can tune `spark.sql.sources.parallelPartitionDiscovery.threshold` to improve the listing +performance. + ## Memory Usage of Reduce Tasks Sometimes, you will get an OutOfMemoryError not because your RDDs don't fit in memory, but because the From 2dfe36f9da0fdeafc1c5f32c06e44fd282f3e2c9 Mon Sep 17 00:00:00 2001 From: Chao Sun Date: Thu, 20 Aug 2020 21:48:21 -0700 Subject: [PATCH 2/3] Address comments --- docs/sql-performance-tuning.md | 22 ++++++++++++++++++++++ docs/tuning.md | 12 ++++++++---- 2 files changed, 30 insertions(+), 4 deletions(-) diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md index 5e6f049a51e9..9cb642dbddc4 100644 --- a/docs/sql-performance-tuning.md +++ b/docs/sql-performance-tuning.md @@ -114,6 +114,28 @@ that these options will be deprecated in future release as more optimizations ar 1.1.0 + + spark.sql.sources.parallelPartitionDiscovery.threshold + 32 + + Configures the threshold to enable parallel listing for job input paths. If the number of + input paths is larger than this threshold, Spark will use parallel listing on the driver side. + Otherwise, it will fallback to sequential listing. This configuration is only effective when + using file-based data sources such as Parquet, ORC and JSON. + + 1.5.0 + + + spark.sql.sources.parallelPartitionDiscovery.parallelism + 10000 + + Configures the maximum listing parallelism for job input paths. In case the number of input + paths is larger than this value, it will be throttled down to use this value. Same as above, + this configuration is only effective when using file-based data sources such as Parquet, ORC + and JSON. + + 2.1.1 + ## Join Strategy Hints for SQL Queries diff --git a/docs/tuning.md b/docs/tuning.md index 0b9366912bf4..c2f3b5f62b73 100644 --- a/docs/tuning.md +++ b/docs/tuning.md @@ -264,12 +264,16 @@ parent RDD's number of partitions. You can pass the level of parallelism as a se or set the config property `spark.default.parallelism` to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster. +# Parallel Listing on Input Paths + Sometimes you may also need to increase directory listing parallelism when job input has large number of directories, otherwise the process could take a very long time, especially when against object store like S3. -If your job works on RDD with Hadoop input formats (e.g., via `SparkContext#sequenceFile`), the parallelism is -controlled via `spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads` (default is 1). For other -cases such as Spark SQL, you can tune `spark.sql.sources.parallelPartitionDiscovery.threshold` to improve the listing -performance. +If your job works on RDD with Hadoop input formats (e.g., via `SparkContext.sequenceFile`), the parallelism is +controlled via [`spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads`](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml) (currently default is 1). + +For Spark SQL with file-based data sources, you can tune `spark.sql.sources.parallelPartitionDiscovery.threshold` and +`spark.sql.sources.parallelPartitionDiscovery.parallelism` to improve listing parallelism. Please +refer to [Spark SQL performance tuning guide](sql-performance-tuning.html) for more details. ## Memory Usage of Reduce Tasks From c4622716b9c7b6e15f6d4f4724e57323d147dd97 Mon Sep 17 00:00:00 2001 From: Chao Sun Date: Thu, 20 Aug 2020 23:57:51 -0700 Subject: [PATCH 3/3] Address comments --- docs/sql-performance-tuning.md | 2 +- docs/tuning.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md index 9cb642dbddc4..5d8c3b698c70 100644 --- a/docs/sql-performance-tuning.md +++ b/docs/sql-performance-tuning.md @@ -119,7 +119,7 @@ that these options will be deprecated in future release as more optimizations ar 32 Configures the threshold to enable parallel listing for job input paths. If the number of - input paths is larger than this threshold, Spark will use parallel listing on the driver side. + input paths is larger than this threshold, Spark will list the files by using Spark distributed job. Otherwise, it will fallback to sequential listing. This configuration is only effective when using file-based data sources such as Parquet, ORC and JSON. diff --git a/docs/tuning.md b/docs/tuning.md index c2f3b5f62b73..18d4a6205f4f 100644 --- a/docs/tuning.md +++ b/docs/tuning.md @@ -264,7 +264,7 @@ parent RDD's number of partitions. You can pass the level of parallelism as a se or set the config property `spark.default.parallelism` to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster. -# Parallel Listing on Input Paths +## Parallel Listing on Input Paths Sometimes you may also need to increase directory listing parallelism when job input has large number of directories, otherwise the process could take a very long time, especially when against object store like S3.