diff --git a/docs/configuration.md b/docs/configuration.md index dc5553f3da770..1f4f89b14361d 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -353,16 +353,6 @@ Apart from these, the following properties are also available, and may be useful Port for the driver to listen on. - - spark.cleaner.ttl - (infinite) - - Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks generated, etc.). - Periodic cleanups will ensure that metadata older than this duration will be forgetten. This is - useful for running Spark for many hours / days (for example, running 24/7 in case of Spark Streaming - applications). Note that any RDD that persists in memory for more than this duration will be cleared as well. - - spark.streaming.blockInterval 200 @@ -487,6 +477,88 @@ Apart from these, the following properties are also available, and may be useful + +The following are the properties that can be used to schedule cleanup jobs at different levels. +The below mentioned metadata tuning parameters should be set with a lot of consideration and only where required. +Scheduling metadata cleaning in the middle of job can result in a lot of unnecessary re-computations. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Property NameDefaultMeaning
spark.cleaner.ttl(infinite) + Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks generated, etc.). + Periodic cleanups will ensure that metadata older than this duration will be forgetten. This is + useful for running Spark for many hours / days (for example, running 24/7 in case of Spark Streaming + applications). Note that any RDD that persists in memory for more than this duration will be cleared as well. +
spark.cleaner.ttl.MAP_OUTPUT_TRACKERspark.cleaner.ttl, with a min. value of 10 secs + Cleans up the map containing the information of the mapper (the input block manager Id and the output result size) corresponding to a shuffle Id. +
spark.cleaner.ttl.SHUFFLE_MAP_TASKspark.cleaner.ttl, with a min. value of 10 secs + Clears up the cache used for shuffled tasks (tasks present in the earlier stages of the job) - a map that maps stageId to the serialised byte array of the shuffled task. +
spark.cleaner.ttl.RESULT_TASKspark.cleaner.ttl, with a min. value of 10 secs + Clears up the cache used to store the final tasks (tasks present in the last stage of the job) - a map that maps stageId to the serialised byte array of the final task. +
spark.cleaner.ttl.SPARK_CONTEXTspark.cleaner.ttl, with a min. value of 10 secs + Cleans up all the old persistent (cached) RDDs. +
spark.cleaner.ttl.HTTP_BROADCASTspark.cleaner.ttl, with a min. value of 10 secs + Cleans up all broadcast files which are timestamped older than the assigned cleanup value. +
spark.cleaner.ttl.DAG_SCHEDULERspark.cleaner.ttl, with a min. value of 10 secs + Clears up all the maps saved inside the DAG Scheduler such as - stageIdToStage, pendingTasks, stageIdToJobIds etc which are timestamped older than the assigned cleanup value. +
spark.cleaner.ttl.BLOCK_MANAGERspark.cleaner.ttl, with a min. value of 10 secs + Clears the old non broadcast blocks from memory. +
spark.cleaner.ttl.BROADCAST_VARSspark.cleaner.ttl, with a min. value of 10 secs + Clears the old broadcast blocks from memory. +
spark.cleaner.ttl.SHUFFLE_BLOCK_MANAGERspark.cleaner.ttl, with a min. value of 10 secs + Deletes the old physical files stored on the disk created as a result of shuffling transformations/actions such as a reduce job. +
+ ## Viewing Spark Properties The application web UI at `http://:4040` lists Spark properties in the "Environment" tab.