@@ -8,7 +8,7 @@ title: Spark Configuration
88Spark provides three locations to configure the system:
99
1010* [ Spark properties] ( #spark-properties ) control most application parameters and can be set by using
11- a [ SparkConf] ( api/core /index.html#org.apache.spark.SparkConf ) object, or through Java
11+ a [ SparkConf] ( api/scala /index.html#org.apache.spark.SparkConf ) object, or through Java
1212 system properties.
1313* [ Environment variables] ( #environment-variables ) can be used to set per-machine settings, such as
1414 the IP address, through the ` conf/spark-env.sh ` script on each node.
@@ -23,8 +23,8 @@ application. These properties can be set directly on a
2323(e.g. master URL and application name), as well as arbitrary key-value pairs through the
2424` set() ` method. For example, we could initialize an application with two threads as follows:
2525
26- Note that we run with local[ 2] , meaning two threads - which represents "minimal" parallelism,
27- which can help detect bugs that only exist when we run in a distributed context.
26+ Note that we run with local[ 2] , meaning two threads - which represents "minimal" parallelism,
27+ which can help detect bugs that only exist when we run in a distributed context.
2828
2929{% highlight scala %}
3030val conf = new SparkConf()
@@ -35,7 +35,7 @@ val sc = new SparkContext(conf)
3535{% endhighlight %}
3636
3737Note that we can have more than 1 thread in local mode, and in cases like spark streaming, we may actually
38- require one to prevent any sort of starvation issues.
38+ require one to prevent any sort of starvation issues.
3939
4040## Dynamically Loading Spark Properties
4141In some cases, you may want to avoid hard-coding certain configurations in a ` SparkConf ` . For
@@ -48,8 +48,8 @@ val sc = new SparkContext(new SparkConf())
4848
4949Then, you can supply configuration values at runtime:
5050{% highlight bash %}
51- ./bin/spark-submit --name "My app" --master local[ 4] --conf spark.shuffle.spill=false
52- --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
51+ ./bin/spark-submit --name "My app" --master local[ 4] --conf spark.shuffle.spill=false
52+ --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
5353{% endhighlight %}
5454
5555The Spark shell and [ ` spark-submit ` ] ( submitting-applications.html )
@@ -123,7 +123,7 @@ of the most common options to set are:
123123 <td >
124124 Limit of total size of serialized results of all partitions for each Spark action (e.g. collect).
125125 Should be at least 1M, or 0 for unlimited. Jobs will be aborted if the total size
126- is above this limit.
126+ is above this limit.
127127 Having a high limit may cause out-of-memory errors in driver (depends on spark.driver.memory
128128 and memory overhead of objects in JVM). Setting a proper limit can protect the driver from
129129 out-of-memory errors.
@@ -217,6 +217,45 @@ Apart from these, the following properties are also available, and may be useful
217217 Set a special library path to use when launching executor JVM's.
218218 </td >
219219</tr >
220+ <tr >
221+ <td ><code >spark.executor.logs.rolling.strategy</code ></td >
222+ <td >(none)</td >
223+ <td >
224+ Set the strategy of rolling of executor logs. By default it is disabled. It can
225+ be set to "time" (time-based rolling) or "size" (size-based rolling). For "time",
226+ use <code>spark.executor.logs.rolling.time.interval</code> to set the rolling interval.
227+ For "size", use <code>spark.executor.logs.rolling.size.maxBytes</code> to set
228+ the maximum file size for rolling.
229+ </td >
230+ </tr >
231+ <tr >
232+ <td ><code >spark.executor.logs.rolling.time.interval</code ></td >
233+ <td >daily</td >
234+ <td >
235+ Set the time interval by which the executor logs will be rolled over.
236+ Rolling is disabled by default. Valid values are `daily`, `hourly`, `minutely` or
237+ any interval in seconds. See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
238+ for automatic cleaning of old logs.
239+ </td >
240+ </tr >
241+ <tr >
242+ <td ><code >spark.executor.logs.rolling.size.maxBytes</code ></td >
243+ <td >(none)</td >
244+ <td >
245+ Set the max size of the file by which the executor logs will be rolled over.
246+ Rolling is disabled by default. Value is set in terms of bytes.
247+ See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
248+ for automatic cleaning of old logs.
249+ </td >
250+ </tr >
251+ <tr >
252+ <td ><code >spark.executor.logs.rolling.maxRetainedFiles</code ></td >
253+ <td >(none)</td >
254+ <td >
255+ Sets the number of latest rolling log files that are going to be retained by the system.
256+ Older log files will be deleted. Disabled by default.
257+ </td >
258+ </tr >
220259<tr >
221260 <td ><code >spark.files.userClassPathFirst</code ></td >
222261 <td >false</td >
@@ -250,10 +289,11 @@ Apart from these, the following properties are also available, and may be useful
250289 <td ><code >spark.python.profile.dump</code ></td >
251290 <td >(none)</td >
252291 <td >
253- The directory which is used to dump the profile result before driver exiting.
292+ The directory which is used to dump the profile result before driver exiting.
254293 The results will be dumped as separated file for each RDD. They can be loaded
255294 by ptats.Stats(). If this is specified, the profile result will not be displayed
256295 automatically.
296+ </td >
257297</tr >
258298<tr >
259299 <td ><code >spark.python.worker.reuse</code ></td >
@@ -269,8 +309,8 @@ Apart from these, the following properties are also available, and may be useful
269309 <td ><code >spark.executorEnv.[EnvironmentVariableName]</code ></td >
270310 <td >(none)</td >
271311 <td >
272- Add the environment variable specified by <code>EnvironmentVariableName</code> to the Executor
273- process. The user can specify multiple of these and to set multiple environment variables.
312+ Add the environment variable specified by <code>EnvironmentVariableName</code> to the Executor
313+ process. The user can specify multiple of these and to set multiple environment variables.
274314 </td >
275315</tr >
276316<tr >
@@ -475,9 +515,9 @@ Apart from these, the following properties are also available, and may be useful
475515 <td >
476516 The codec used to compress internal data such as RDD partitions, broadcast variables and
477517 shuffle outputs. By default, Spark provides three codecs: <code>lz4</code>, <code>lzf</code>,
478- and <code>snappy</code>. You can also use fully qualified class names to specify the codec,
479- e.g.
480- <code>org.apache.spark.io.LZ4CompressionCodec</code>,
518+ and <code>snappy</code>. You can also use fully qualified class names to specify the codec,
519+ e.g.
520+ <code>org.apache.spark.io.LZ4CompressionCodec</code>,
481521 <code>org.apache.spark.io.LZFCompressionCodec</code>,
482522 and <code>org.apache.spark.io.SnappyCompressionCodec</code>.
483523 </td >
@@ -945,7 +985,7 @@ Apart from these, the following properties are also available, and may be useful
945985 (resources are executors in yarn mode, CPU cores in standalone mode)
946986 to wait for before scheduling begins. Specified as a double between 0.0 and 1.0.
947987 Regardless of whether the minimum ratio of resources has been reached,
948- the maximum amount of time it will wait before scheduling begins is controlled by config
988+ the maximum amount of time it will wait before scheduling begins is controlled by config
949989 <code>spark.scheduler.maxRegisteredResourcesWaitingTime</code>.
950990 </td >
951991</tr >
@@ -954,7 +994,7 @@ Apart from these, the following properties are also available, and may be useful
954994 <td >30000</td >
955995 <td >
956996 Maximum amount of time to wait for resources to register before scheduling begins
957- (in milliseconds).
997+ (in milliseconds).
958998 </td >
959999</tr >
9601000<tr >
@@ -1023,7 +1063,7 @@ Apart from these, the following properties are also available, and may be useful
10231063 <td >false</td >
10241064 <td >
10251065 Whether Spark acls should are enabled. If enabled, this checks to see if the user has
1026- access permissions to view or modify the job. Note this requires the user to be known,
1066+ access permissions to view or modify the job. Note this requires the user to be known,
10271067 so if the user comes across as null no checks are done. Filters can be used with the UI
10281068 to authenticate and set the user.
10291069 </td >
@@ -1062,17 +1102,31 @@ Apart from these, the following properties are also available, and may be useful
10621102 <td ><code >spark.streaming.blockInterval</code ></td >
10631103 <td >200</td >
10641104 <td >
1065- Interval (milliseconds) at which data received by Spark Streaming receivers is coalesced
1066- into blocks of data before storing them in Spark.
1105+ Interval (milliseconds) at which data received by Spark Streaming receivers is chunked
1106+ into blocks of data before storing them in Spark. Minimum recommended - 50 ms. See the
1107+ <a href="streaming-programming-guide.html#level-of-parallelism-in-data-receiving">performance
1108+ tuning</a> section in the Spark Streaming programing guide for more details.
10671109 </td >
10681110</tr >
10691111<tr >
10701112 <td ><code >spark.streaming.receiver.maxRate</code ></td >
10711113 <td >infinite</td >
10721114 <td >
1073- Maximum rate ( per second) at which each receiver will push data into blocks. Effectively,
1074- each stream will consume at most this number of records per second.
1115+ Maximum number records per second at which each receiver will receive data.
1116+ Effectively, each stream will consume at most this number of records per second.
10751117 Setting this configuration to 0 or a negative number will put no limit on the rate.
1118+ See the <a href="streaming-programming-guide.html#deploying-applications">deployment guide</a>
1119+ in the Spark Streaming programing guide for mode details.
1120+ </td >
1121+ </tr >
1122+ <tr >
1123+ <td ><code >spark.streaming.receiver.writeAheadLogs.enable</code ></td >
1124+ <td >false</td >
1125+ <td >
1126+ Enable write ahead logs for receivers. All the input data received through receivers
1127+ will be saved to write ahead logs that will allow it to be recovered after driver failures.
1128+ See the <a href="streaming-programming-guide.html#deploying-applications">deployment guide</a>
1129+ in the Spark Streaming programing guide for more details.
10761130 </td >
10771131</tr >
10781132<tr >
@@ -1086,45 +1140,6 @@ Apart from these, the following properties are also available, and may be useful
10861140 higher memory usage in Spark.
10871141 </td >
10881142</tr >
1089- <tr >
1090- <td ><code >spark.executor.logs.rolling.strategy</code ></td >
1091- <td >(none)</td >
1092- <td >
1093- Set the strategy of rolling of executor logs. By default it is disabled. It can
1094- be set to "time" (time-based rolling) or "size" (size-based rolling). For "time",
1095- use <code>spark.executor.logs.rolling.time.interval</code> to set the rolling interval.
1096- For "size", use <code>spark.executor.logs.rolling.size.maxBytes</code> to set
1097- the maximum file size for rolling.
1098- </td >
1099- </tr >
1100- <tr >
1101- <td ><code >spark.executor.logs.rolling.time.interval</code ></td >
1102- <td >daily</td >
1103- <td >
1104- Set the time interval by which the executor logs will be rolled over.
1105- Rolling is disabled by default. Valid values are `daily`, `hourly`, `minutely` or
1106- any interval in seconds. See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
1107- for automatic cleaning of old logs.
1108- </td >
1109- </tr >
1110- <tr >
1111- <td ><code >spark.executor.logs.rolling.size.maxBytes</code ></td >
1112- <td >(none)</td >
1113- <td >
1114- Set the max size of the file by which the executor logs will be rolled over.
1115- Rolling is disabled by default. Value is set in terms of bytes.
1116- See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
1117- for automatic cleaning of old logs.
1118- </td >
1119- </tr >
1120- <tr >
1121- <td ><code >spark.executor.logs.rolling.maxRetainedFiles</code ></td >
1122- <td >(none)</td >
1123- <td >
1124- Sets the number of latest rolling log files that are going to be retained by the system.
1125- Older log files will be deleted. Disabled by default.
1126- </td >
1127- </tr >
11281143</table >
11291144
11301145#### Cluster Managers
0 commit comments