Skip to content

Commit 51e2775

Browse files
esjewettpwendell
authored andcommitted
Proposal: clarify Scala programming guide on caching ...
... with regards to saved map output. Wording taken partially from Matei Zaharia's email to the Spark user list. http://apache-spark-user-list.1001560.n3.nabble.com/performance-improvement-on-second-operation-without-caching-td5227.html Author: Ethan Jewett <[email protected]> Closes #668 from esjewett/Doc-update and squashes the following commits: 11793ce [Ethan Jewett] Update based on feedback 171e670 [Ethan Jewett] Clarify Scala programming guide on caching ... (cherry picked from commit 48ba3b8) Signed-off-by: Patrick Wendell <[email protected]>
1 parent 514ee93 commit 51e2775

File tree

1 file changed

+5
-3
lines changed

1 file changed

+5
-3
lines changed

docs/scala-programming-guide.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -145,7 +145,7 @@ RDDs support two types of operations: *transformations*, which create a new data
145145

146146
All transformations in Spark are <i>lazy</i>, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently -- for example, we can realize that a dataset created through `map` will be used in a `reduce` and return only the result of the `reduce` to the driver, rather than the larger mapped dataset.
147147

148-
By default, each transformed RDD is recomputed each time you run an action on it. However, you may also *persist* an RDD in memory using the `persist` (or `cache`) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting datasets on disk, or replicated across the cluster. The next section in this document describes these options.
148+
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also *persist* an RDD in memory using the `persist` (or `cache`) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting datasets on disk, or replicated across the cluster. The next section in this document describes these options.
149149

150150
The following tables list the transformations and actions currently supported (see also the [RDD API doc](api/scala/index.html#org.apache.spark.rdd.RDD) for details):
151151

@@ -279,8 +279,8 @@ it is computed in an action, it will be kept in memory on the nodes. The cache i
279279
if any partition of an RDD is lost, it will automatically be recomputed using the transformations
280280
that originally created it.
281281

282-
In addition, each RDD can be stored using a different *storage level*, allowing you, for example, to
283-
persist the dataset on disk, or persist it in memory but as serialized Java objects (to save space),
282+
In addition, each persisted RDD can be stored using a different *storage level*, allowing you, for example,
283+
to persist the dataset on disk, or persist it in memory but as serialized Java objects (to save space),
284284
or replicate it across nodes, or store the data in off-heap memory in [Tachyon](http://tachyon-project.org/).
285285
These levels are chosen by passing a
286286
[`org.apache.spark.storage.StorageLevel`](api/scala/index.html#org.apache.spark.storage.StorageLevel)
@@ -330,6 +330,8 @@ available storage levels is:
330330
</tr>
331331
</table>
332332

333+
Spark sometimes automatically persists intermediate state from RDD operations, even without users calling persist() or cache(). In particular, if a shuffle happens when computing an RDD, Spark will keep the outputs from the map side of the shuffle on disk to avoid re-computing the entire dependency graph if an RDD is re-used. We still recommend users call persist() if they plan to re-use an RDD iteratively.
334+
333335
### Which Storage Level to Choose?
334336

335337
Spark's storage levels are meant to provide different trade-offs between memory usage and CPU

0 commit comments

Comments
 (0)