Skip to content

Conversation

@esjewett
Copy link
Member

@esjewett esjewett commented May 6, 2014

... with regards to saved map output. Wording taken partially from Matei Zaharia's email to the Spark user list. http://apache-spark-user-list.1001560.n3.nabble.com/performance-improvement-on-second-operation-without-caching-td5227.html

... with regards to saved map output. Wording taken partially from Matei Zaharia's email to the Spark user list. http://apache-spark-user-list.1001560.n3.nabble.com/performance-improvement-on-second-operation-without-caching-td5227.html
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not your change, but I think this should say "will be persisted in memory or on disk on the nodes"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry - nevermind, this is explained below and this case only refers to calling persist() without arguments.

@esjewett
Copy link
Member Author

esjewett commented May 6, 2014

Just putting it out there: I'm not attached to any of this wording, so change away, or don't accept it. No problem either way. I just thought my question on the user list as to whether the programming guide could be updated was better stated as a pull request ;-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a great idea to have this here. This is a totally non-obvious fact and I think many users would like to know this.

My only thought is, would you mind moving this to the end of the "RDD Persistence" section. Also, at this point in the guide I don't think the concept of stages or jobs has been introduced. So it might be good to have something like:

Spark sometimes automatically persists intermediate state from RDD operations, even without users calling
persist() or cache(). In particular, if a shuffle happens when computing an RDD, Spark will keep the outputs
from the map side of the shuffle on disk to avoid re-computing the entire dependency graph if an RDD
 is re-used. We still recommend users call persist() if they plan to re-use an RDD iteratively.

Text taken primarily from Patrick Wendell's comment on the pull request. Also changed wording in "RDD Operations" section so as not to imply a guarantee that RDDs are reprocessed if persist() is not run.
@esjewett
Copy link
Member Author

esjewett commented May 6, 2014

@pwendell I like your wording. Switched to use it, and moved it to the end of the "RDD Persistence" section as requested. I also updated the "RDD Operations" section with a small change so as not to imply that RDDs that aren't persist()ed will always be reprocessed.

@pwendell
Copy link
Contributor

pwendell commented May 7, 2014

Okay I can merge this, thanks!

asfgit pushed a commit that referenced this pull request May 7, 2014
... with regards to saved map output. Wording taken partially from Matei Zaharia's email to the Spark user list. http://apache-spark-user-list.1001560.n3.nabble.com/performance-improvement-on-second-operation-without-caching-td5227.html

Author: Ethan Jewett <[email protected]>

Closes #668 from esjewett/Doc-update and squashes the following commits:

11793ce [Ethan Jewett] Update based on feedback
171e670 [Ethan Jewett] Clarify Scala programming guide on caching ...
(cherry picked from commit 48ba3b8)

Signed-off-by: Patrick Wendell <[email protected]>
@asfgit asfgit closed this in 48ba3b8 May 7, 2014
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
... with regards to saved map output. Wording taken partially from Matei Zaharia's email to the Spark user list. http://apache-spark-user-list.1001560.n3.nabble.com/performance-improvement-on-second-operation-without-caching-td5227.html

Author: Ethan Jewett <[email protected]>

Closes apache#668 from esjewett/Doc-update and squashes the following commits:

11793ce [Ethan Jewett] Update based on feedback
171e670 [Ethan Jewett] Clarify Scala programming guide on caching ...
agirish pushed a commit to HPEEzmeral/apache-spark that referenced this pull request May 5, 2022
udaynpusa pushed a commit to mapr/spark that referenced this pull request Jan 30, 2024
mapr-devops pushed a commit to mapr/spark that referenced this pull request May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants