From 8af169fdbb5b9b4c7de170068699a9209d48f416 Mon Sep 17 00:00:00 2001
From: Purav Aggarwal <Purav.Aggarwal@guavus.com>
Date: Thu, 6 Mar 2014 09:13:39 +0530
Subject: [PATCH 1/3] MetadataCleaner - fine control cleanup documentation

---
 docs/configuration.md | 99 +++++++++++++++++++++++++++++++++++--------
 1 file changed, 82 insertions(+), 17 deletions(-)
diff --git a/docs/configuration.md b/docs/configuration.md
index dc5553f3da770..5e06033606ea7 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -201,13 +201,6 @@ Apart from these, the following properties are also available, and may be useful
     multi-user services.
   </td>
 </tr>
-<tr>
-  <td>spark.scheduler.revive.interval</td>
-  <td>1000</td>
-  <td>
-    The interval length for the scheduler to revive the worker resource offers to run tasks. (in milliseconds)
-  </td>
-</tr>
 <tr>
   <td>spark.reducer.maxMbInFlight</td>
   <td>48</td>
@@ -353,16 +346,6 @@ Apart from these, the following properties are also available, and may be useful
     Port for the driver to listen on.
   </td>
 </tr>
-<tr>
-  <td>spark.cleaner.ttl</td>
-  <td>(infinite)</td>
-  <td>
-    Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks generated, etc.).
-    Periodic cleanups will ensure that metadata older than this duration will be forgetten. This is
-    useful for running Spark for many hours / days (for example, running 24/7 in case of Spark Streaming
-    applications). Note that any RDD that persists in memory for more than this duration will be cleared as well.
-  </td>
-</tr>
 <tr>
   <td>spark.streaming.blockInterval</td>
   <td>200</td>
@@ -487,6 +470,88 @@ Apart from these, the following properties are also available, and may be useful
 </tr>
 </table>
 
+
+The following are the properties that can be used to schedule cleanup jobs at different levels.
+The below mentioned metadata tuning parameters should be set with a lot of consideration and only where required.
+Scheduling metadata cleaning in the middle of job can result in a lot of unnecessary re-computations.
+
+<table class="table">
+<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+<tr>
+  <td>spark.cleaner.ttl</td>
+  <td>(infinite)</td>
+  <td>
+    Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks generated, etc.).
+    Periodic cleanups will ensure that metadata older than this duration will be forgetten. This is
+    useful for running Spark for many hours / days (for example, running 24/7 in case of Spark Streaming
+    applications). Note that any RDD that persists in memory for more than this duration will be cleared as well.
+  </td>
+</tr>
+<tr>
+  <td>spark.cleaner.ttl.MAP_OUTPUT_TRACKER</td>
+  <td>spark.cleaner.ttl, with a min. value of 10 secs</td>
+  <td>
+    Cleans up the map containing the information of the mapper (the input block manager Id and the output result size) corresponding to a shuffle Id.
+  </td>
+</tr>
+<tr>
+  <td>spark.cleaner.ttl.SHUFFLE_MAP_TASK</td>
+  <td>spark.cleaner.ttl, with a min. value of 10 secs</td>
+  <td>
+    Clears up the cache used for shuffled tasks (tasks present in the earlier stages of the job) - a map that maps stageId to the serialised byte array of the shuffled task.
+  </td>
+</tr>
+<tr>
+  <td>spark.cleaner.ttl.RESULT_TASK</td>
+  <td>spark.cleaner.ttl, with a min. value of 10 secs</td>
+  <td>
+    Clears up the cache used to store the final tasks (tasks present in the last stage of the job) - a map that maps stageId to the serialised byte array of the final task.
+  </td>
+</tr>
+<tr>
+  <td>spark.cleaner.ttl.SPARK_CONTEXT</td>
+  <td>spark.cleaner.ttl, with a min. value of 10 secs</td>
+  <td>
+    Cleans up all the old persistent (cached) RDDs.
+  </td>
+</tr>
+<tr>
+  <td>spark.cleaner.ttl.HTTP_BROADCAST</td>
+  <td>spark.cleaner.ttl, with a min. value of 10 secs</td>
+  <td>
+    Cleans up all broadcast files which are timestamped older than the assigned cleanup value.
+  </td>
+</tr>
+<tr>
+  <td>spark.cleaner.ttl.DAG_SCHEDULER</td>
+  <td>spark.cleaner.ttl, with a min. value of 10 secs</td>
+  <td>
+    Clears up all the maps saved inside the DAG Scheduler such as - stageIdToStage, pendingTasks, stageIdToJobIds etc which are timestamped older than the assigned cleanup value.
+  </td>
+</tr>
+<tr>
+  <td>spark.cleaner.ttl.BLOCK_MANAGER</td>
+  <td>spark.cleaner.ttl, with a min. value of 10 secs</td>
+  <td>
+    Clears the old non broadcast blocks from memory.
+  </td>
+</tr>
+<tr>
+  <td>spark.cleaner.ttl.BROADCAST_VARS</td>
+  <td>spark.cleaner.ttl, with a min. value of 10 secs</td>
+  <td>
+    Clears the old broadcast blocks from memory.
+  </td>
+</tr>
+<tr>
+  <td>spark.cleaner.ttl.SHUFFLE_BLOCK_MANAGER</td>
+  <td>spark.cleaner.ttl, with a min. value of 10 secs</td>
+  <td>
+   Deletes the old physical files stored on the disk created as a result of shuffling transformations/actions such as a reduce job. 
+  </td>
+</tr>
+</table>
+
 ## Viewing Spark Properties
 
 The application web UI at `http://<driver>:4040` lists Spark properties in the "Environment" tab.

From adc1d84e46aad6960e7d85246e86f39eb905ee12 Mon Sep 17 00:00:00 2001
From: Purav Aggarwal <Purav.Aggarwal@guavus.com>
Date: Thu, 6 Mar 2014 09:20:05 +0530
Subject: [PATCH 2/3] MetadataCleaner - fine control cleanup documentation,
 putting back the unintended removal of a property.

---
 docs/configuration.md | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/docs/configuration.md b/docs/configuration.md
index 5e06033606ea7..a0ee66e3f9db0 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -201,6 +201,13 @@ Apart from these, the following properties are also available, and may be useful
     multi-user services.
   </td>
 </tr>
+<tr>
+   <td>spark.scheduler.revive.interval</td>
+   <td>1000</td>
+   <td>
+     The interval length for the scheduler to revive the worker resource offers to run tasks. (in milliseconds)
+   </td>
+</tr>
 <tr>
   <td>spark.reducer.maxMbInFlight</td>
   <td>48</td>

From 58eab7dc4d48ce37227775318f0835c60c1567a0 Mon Sep 17 00:00:00 2001
From: Purav Aggarwal <Purav.Aggarwal@guavus.com>
Date: Thu, 6 Mar 2014 09:23:05 +0530
Subject: [PATCH 3/3] MetadataCleaner - fine control cleanup documentation,
 rectifying the indentation

---
 docs/configuration.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index a0ee66e3f9db0..1f4f89b14361d 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -202,11 +202,11 @@ Apart from these, the following properties are also available, and may be useful
   </td>
 </tr>
 <tr>
-   <td>spark.scheduler.revive.interval</td>
-   <td>1000</td>
-   <td>
-     The interval length for the scheduler to revive the worker resource offers to run tasks. (in milliseconds)
-   </td>
+  <td>spark.scheduler.revive.interval</td>
+  <td>1000</td>
+  <td>
+    The interval length for the scheduler to revive the worker resource offers to run tasks. (in milliseconds)
+  </td>
 </tr>
 <tr>
   <td>spark.reducer.maxMbInFlight</td>

Property Name	Default	Meaning
spark.cleaner.ttl	(infinite)	+ Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks generated, etc.). + Periodic cleanups will ensure that metadata older than this duration will be forgetten. This is + useful for running Spark for many hours / days (for example, running 24/7 in case of Spark Streaming + applications). Note that any RDD that persists in memory for more than this duration will be cleared as well. +
spark.cleaner.ttl.MAP_OUTPUT_TRACKER	spark.cleaner.ttl, with a min. value of 10 secs	+ Cleans up the map containing the information of the mapper (the input block manager Id and the output result size) corresponding to a shuffle Id. +
spark.cleaner.ttl.SHUFFLE_MAP_TASK	spark.cleaner.ttl, with a min. value of 10 secs	+ Clears up the cache used for shuffled tasks (tasks present in the earlier stages of the job) - a map that maps stageId to the serialised byte array of the shuffled task. +
spark.cleaner.ttl.RESULT_TASK	spark.cleaner.ttl, with a min. value of 10 secs	+ Clears up the cache used to store the final tasks (tasks present in the last stage of the job) - a map that maps stageId to the serialised byte array of the final task. +
spark.cleaner.ttl.SPARK_CONTEXT	spark.cleaner.ttl, with a min. value of 10 secs	+ Cleans up all the old persistent (cached) RDDs. +
spark.cleaner.ttl.HTTP_BROADCAST	spark.cleaner.ttl, with a min. value of 10 secs	+ Cleans up all broadcast files which are timestamped older than the assigned cleanup value. +
spark.cleaner.ttl.DAG_SCHEDULER	spark.cleaner.ttl, with a min. value of 10 secs	+ Clears up all the maps saved inside the DAG Scheduler such as - stageIdToStage, pendingTasks, stageIdToJobIds etc which are timestamped older than the assigned cleanup value. +
spark.cleaner.ttl.BLOCK_MANAGER	spark.cleaner.ttl, with a min. value of 10 secs	+ Clears the old non broadcast blocks from memory. +
spark.cleaner.ttl.BROADCAST_VARS	spark.cleaner.ttl, with a min. value of 10 secs	+ Clears the old broadcast blocks from memory. +
spark.cleaner.ttl.SHUFFLE_BLOCK_MANAGER	spark.cleaner.ttl, with a min. value of 10 secs	+ Deletes the old physical files stored on the disk created as a result of shuffling transformations/actions such as a reduce job. +