Skip to content

Conversation

@mateiz
Copy link
Contributor

@mateiz mateiz commented Aug 15, 2015

The shuffle locality patch made the DAGScheduler aware of shuffle data,
but for RDDs that have both narrow and shuffle dependencies, it can
cause them to place tasks based on the shuffle dependency instead of the
narrow one. This case is common in iterative join-based algorithms like
PageRank and ALS, where one RDD is hash-partitioned and one isn't.

The shuffle locality patch made the DAGScheduler aware of shuffle data,
but for RDDs that have both narrow and shuffle dependencies, it can
cause them to place tasks based on the shuffle dependency instead of the
narrow one. This case is common in iterative join-based algorithms like
PageRank and ALS, where one RDD is hash-partitioned and one isn't.
@mateiz
Copy link
Contributor Author

mateiz commented Aug 15, 2015

@shivaram here it is.. we should merge this into branch-1.5 too if it's good.

@SparkQA
Copy link

SparkQA commented Aug 15, 2015

Test build #40941 has finished for PR 8220 at commit 99dd008.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Aug 15, 2015

Test build #40946 has finished for PR 8220 at commit 99dd008.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StringIndexerModel (

@shivaram
Copy link
Contributor

LGTM. I think we should merge this into branch-1.5 as well.

@shivaram
Copy link
Contributor

Jenkins, retest this please

@mateiz
Copy link
Contributor Author

mateiz commented Aug 15, 2015

Sounds good.. I'll merge it once tests pass.

@SparkQA
Copy link

SparkQA commented Aug 15, 2015

Test build #1626 has finished for PR 8220 at commit a7c02dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StringIndexerModel (

asfgit pushed a commit that referenced this pull request Aug 16, 2015
…rrow deps

The shuffle locality patch made the DAGScheduler aware of shuffle data,
but for RDDs that have both narrow and shuffle dependencies, it can
cause them to place tasks based on the shuffle dependency instead of the
narrow one. This case is common in iterative join-based algorithms like
PageRank and ALS, where one RDD is hash-partitioned and one isn't.

Author: Matei Zaharia <[email protected]>

Closes #8220 from mateiz/shuffle-loc-fix.

(cherry picked from commit cf01607)
Signed-off-by: Matei Zaharia <[email protected]>
@asfgit asfgit closed this in cf01607 Aug 16, 2015
CodingCat pushed a commit to CodingCat/spark that referenced this pull request Aug 17, 2015
…rrow deps

The shuffle locality patch made the DAGScheduler aware of shuffle data,
but for RDDs that have both narrow and shuffle dependencies, it can
cause them to place tasks based on the shuffle dependency instead of the
narrow one. This case is common in iterative join-based algorithms like
PageRank and ALS, where one RDD is hash-partitioned and one isn't.

Author: Matei Zaharia <[email protected]>

Closes apache#8220 from mateiz/shuffle-loc-fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants