From 8af169fdbb5b9b4c7de170068699a9209d48f416 Mon Sep 17 00:00:00 2001 From: Purav Aggarwal Date: Thu, 6 Mar 2014 09:13:39 +0530 Subject: [PATCH 1/3] MetadataCleaner - fine control cleanup documentation --- docs/configuration.md | 99 +++++++++++++++++++++++++++++++++++-------- 1 file changed, 82 insertions(+), 17 deletions(-) diff --git a/docs/configuration.md b/docs/configuration.md index dc5553f3da770..5e06033606ea7 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -201,13 +201,6 @@ Apart from these, the following properties are also available, and may be useful multi-user services. - - spark.scheduler.revive.interval - 1000 - - The interval length for the scheduler to revive the worker resource offers to run tasks. (in milliseconds) - - spark.reducer.maxMbInFlight 48 @@ -353,16 +346,6 @@ Apart from these, the following properties are also available, and may be useful Port for the driver to listen on. - - spark.cleaner.ttl - (infinite) - - Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks generated, etc.). - Periodic cleanups will ensure that metadata older than this duration will be forgetten. This is - useful for running Spark for many hours / days (for example, running 24/7 in case of Spark Streaming - applications). Note that any RDD that persists in memory for more than this duration will be cleared as well. - - spark.streaming.blockInterval 200 @@ -487,6 +470,88 @@ Apart from these, the following properties are also available, and may be useful + +The following are the properties that can be used to schedule cleanup jobs at different levels. +The below mentioned metadata tuning parameters should be set with a lot of consideration and only where required. +Scheduling metadata cleaning in the middle of job can result in a lot of unnecessary re-computations. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Property NameDefaultMeaning
spark.cleaner.ttl(infinite) + Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks generated, etc.). + Periodic cleanups will ensure that metadata older than this duration will be forgetten. This is + useful for running Spark for many hours / days (for example, running 24/7 in case of Spark Streaming + applications). Note that any RDD that persists in memory for more than this duration will be cleared as well. +
spark.cleaner.ttl.MAP_OUTPUT_TRACKERspark.cleaner.ttl, with a min. value of 10 secs + Cleans up the map containing the information of the mapper (the input block manager Id and the output result size) corresponding to a shuffle Id. +
spark.cleaner.ttl.SHUFFLE_MAP_TASKspark.cleaner.ttl, with a min. value of 10 secs + Clears up the cache used for shuffled tasks (tasks present in the earlier stages of the job) - a map that maps stageId to the serialised byte array of the shuffled task. +
spark.cleaner.ttl.RESULT_TASKspark.cleaner.ttl, with a min. value of 10 secs + Clears up the cache used to store the final tasks (tasks present in the last stage of the job) - a map that maps stageId to the serialised byte array of the final task. +
spark.cleaner.ttl.SPARK_CONTEXTspark.cleaner.ttl, with a min. value of 10 secs + Cleans up all the old persistent (cached) RDDs. +
spark.cleaner.ttl.HTTP_BROADCASTspark.cleaner.ttl, with a min. value of 10 secs + Cleans up all broadcast files which are timestamped older than the assigned cleanup value. +
spark.cleaner.ttl.DAG_SCHEDULERspark.cleaner.ttl, with a min. value of 10 secs + Clears up all the maps saved inside the DAG Scheduler such as - stageIdToStage, pendingTasks, stageIdToJobIds etc which are timestamped older than the assigned cleanup value. +
spark.cleaner.ttl.BLOCK_MANAGERspark.cleaner.ttl, with a min. value of 10 secs + Clears the old non broadcast blocks from memory. +
spark.cleaner.ttl.BROADCAST_VARSspark.cleaner.ttl, with a min. value of 10 secs + Clears the old broadcast blocks from memory. +
spark.cleaner.ttl.SHUFFLE_BLOCK_MANAGERspark.cleaner.ttl, with a min. value of 10 secs + Deletes the old physical files stored on the disk created as a result of shuffling transformations/actions such as a reduce job. +
+ ## Viewing Spark Properties The application web UI at `http://:4040` lists Spark properties in the "Environment" tab. From adc1d84e46aad6960e7d85246e86f39eb905ee12 Mon Sep 17 00:00:00 2001 From: Purav Aggarwal Date: Thu, 6 Mar 2014 09:20:05 +0530 Subject: [PATCH 2/3] MetadataCleaner - fine control cleanup documentation, putting back the unintended removal of a property. --- docs/configuration.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/configuration.md b/docs/configuration.md index 5e06033606ea7..a0ee66e3f9db0 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -201,6 +201,13 @@ Apart from these, the following properties are also available, and may be useful multi-user services. + + spark.scheduler.revive.interval + 1000 + + The interval length for the scheduler to revive the worker resource offers to run tasks. (in milliseconds) + + spark.reducer.maxMbInFlight 48 From 58eab7dc4d48ce37227775318f0835c60c1567a0 Mon Sep 17 00:00:00 2001 From: Purav Aggarwal Date: Thu, 6 Mar 2014 09:23:05 +0530 Subject: [PATCH 3/3] MetadataCleaner - fine control cleanup documentation, rectifying the indentation --- docs/configuration.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/configuration.md b/docs/configuration.md index a0ee66e3f9db0..1f4f89b14361d 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -202,11 +202,11 @@ Apart from these, the following properties are also available, and may be useful - spark.scheduler.revive.interval - 1000 - - The interval length for the scheduler to revive the worker resource offers to run tasks. (in milliseconds) - + spark.scheduler.revive.interval + 1000 + + The interval length for the scheduler to revive the worker resource offers to run tasks. (in milliseconds) + spark.reducer.maxMbInFlight