-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-10008] Ensure shuffle locality doesn't take precedence over narrow deps #8220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The shuffle locality patch made the DAGScheduler aware of shuffle data, but for RDDs that have both narrow and shuffle dependencies, it can cause them to place tasks based on the shuffle dependency instead of the narrow one. This case is common in iterative join-based algorithms like PageRank and ALS, where one RDD is hash-partitioned and one isn't.
|
@shivaram here it is.. we should merge this into branch-1.5 too if it's good. |
|
Test build #40941 has finished for PR 8220 at commit
|
|
Jenkins, retest this please |
|
Test build #40946 has finished for PR 8220 at commit
|
|
LGTM. I think we should merge this into |
|
Jenkins, retest this please |
|
Sounds good.. I'll merge it once tests pass. |
|
Test build #1626 has finished for PR 8220 at commit
|
…rrow deps The shuffle locality patch made the DAGScheduler aware of shuffle data, but for RDDs that have both narrow and shuffle dependencies, it can cause them to place tasks based on the shuffle dependency instead of the narrow one. This case is common in iterative join-based algorithms like PageRank and ALS, where one RDD is hash-partitioned and one isn't. Author: Matei Zaharia <[email protected]> Closes #8220 from mateiz/shuffle-loc-fix. (cherry picked from commit cf01607) Signed-off-by: Matei Zaharia <[email protected]>
…rrow deps The shuffle locality patch made the DAGScheduler aware of shuffle data, but for RDDs that have both narrow and shuffle dependencies, it can cause them to place tasks based on the shuffle dependency instead of the narrow one. This case is common in iterative join-based algorithms like PageRank and ALS, where one RDD is hash-partitioned and one isn't. Author: Matei Zaharia <[email protected]> Closes apache#8220 from mateiz/shuffle-loc-fix.
The shuffle locality patch made the DAGScheduler aware of shuffle data,
but for RDDs that have both narrow and shuffle dependencies, it can
cause them to place tasks based on the shuffle dependency instead of the
narrow one. This case is common in iterative join-based algorithms like
PageRank and ALS, where one RDD is hash-partitioned and one isn't.