-
Notifications
You must be signed in to change notification settings - Fork 28.9k
Proposal: clarify Scala programming guide on caching ... #668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
... with regards to saved map output. Wording taken partially from Matei Zaharia's email to the Spark user list. http://apache-spark-user-list.1001560.n3.nabble.com/performance-improvement-on-second-operation-without-caching-td5227.html
|
Can one of the admins verify this patch? |
docs/scala-programming-guide.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not your change, but I think this should say "will be persisted in memory or on disk on the nodes"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry - nevermind, this is explained below and this case only refers to calling persist() without arguments.
|
Just putting it out there: I'm not attached to any of this wording, so change away, or don't accept it. No problem either way. I just thought my question on the user list as to whether the programming guide could be updated was better stated as a pull request ;-) |
docs/scala-programming-guide.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a great idea to have this here. This is a totally non-obvious fact and I think many users would like to know this.
My only thought is, would you mind moving this to the end of the "RDD Persistence" section. Also, at this point in the guide I don't think the concept of stages or jobs has been introduced. So it might be good to have something like:
Spark sometimes automatically persists intermediate state from RDD operations, even without users calling
persist() or cache(). In particular, if a shuffle happens when computing an RDD, Spark will keep the outputs
from the map side of the shuffle on disk to avoid re-computing the entire dependency graph if an RDD
is re-used. We still recommend users call persist() if they plan to re-use an RDD iteratively.
Text taken primarily from Patrick Wendell's comment on the pull request. Also changed wording in "RDD Operations" section so as not to imply a guarantee that RDDs are reprocessed if persist() is not run.
|
@pwendell I like your wording. Switched to use it, and moved it to the end of the "RDD Persistence" section as requested. I also updated the "RDD Operations" section with a small change so as not to imply that RDDs that aren't persist()ed will always be reprocessed. |
|
Okay I can merge this, thanks! |
... with regards to saved map output. Wording taken partially from Matei Zaharia's email to the Spark user list. http://apache-spark-user-list.1001560.n3.nabble.com/performance-improvement-on-second-operation-without-caching-td5227.html Author: Ethan Jewett <[email protected]> Closes #668 from esjewett/Doc-update and squashes the following commits: 11793ce [Ethan Jewett] Update based on feedback 171e670 [Ethan Jewett] Clarify Scala programming guide on caching ... (cherry picked from commit 48ba3b8) Signed-off-by: Patrick Wendell <[email protected]>
... with regards to saved map output. Wording taken partially from Matei Zaharia's email to the Spark user list. http://apache-spark-user-list.1001560.n3.nabble.com/performance-improvement-on-second-operation-without-caching-td5227.html Author: Ethan Jewett <[email protected]> Closes apache#668 from esjewett/Doc-update and squashes the following commits: 11793ce [Ethan Jewett] Update based on feedback 171e670 [Ethan Jewett] Clarify Scala programming guide on caching ...
…r log4j-1.2.17.jar (apache#664)" (apache#668) This reverts commit 8948477.
…r log4j-1.2.17.jar (apache#664)" (apache#668) This reverts commit 8948477.
…r log4j-1.2.17.jar (apache#664)" (apache#668) This reverts commit 8948477.
... with regards to saved map output. Wording taken partially from Matei Zaharia's email to the Spark user list. http://apache-spark-user-list.1001560.n3.nabble.com/performance-improvement-on-second-operation-without-caching-td5227.html