[SPARK-12662][SQL] Fix DataFrame.randomSplit to avoid creating overlapping splits #10626

sameeragarwal · 2016-01-07T04:03:38Z

https://issues.apache.org/jira/browse/SPARK-12662

cc @yhuai

rxin · 2016-01-07T04:49:12Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala

Do we need to start a whole new process to test this? I think we can just run randomSplit in the normal DataFrameSuite?

We have not figure out a case that can trigger the problem in the local mode.

sc.parallelize(1 to 10).mapPartitions(scala.util.Random.shuffle(_)).collect()

oh, right. We missed it. It is a good one.

That's neat! Converted it into a unit test in DataFrameStatSuite.

SparkQA · 2016-01-07T05:55:39Z

Test build #48893 has finished for PR 10626 at commit 8da496f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-01-07T06:12:19Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala

since the tests are run so frequently, I don't think you need to try these many times ... doing it once should be enough.

This is the just the size of the dataset. We do however test for 5 different seeds. Should I just test for 1?

yea 1 is fine.

also you can just run it twice and make sure the result is deterministic, i.e.

val a = df.randomSplit(...).toSeq.map(_.collect()) val b = df.randomSplit(...).toSeq.map(_.collect()) assert(a == b)

as long as these are scala collections, I think they will work.

sure, but to be fair, this new test does test a new codepath (that of inserting a sampling operator after a shuffle)

isn't that the same code path?

once we implement sample pushdown in catalyst, it shouldn't be: http://research.microsoft.com/pubs/76565/sig99sam.pdf :)

what do u mean? the shuffle happens outside of catalyst, so the optimizer can't push it beneath it.

to be clear, i'm suggesting removing everything the previous test case already tests, and only keep

// Verify that the results are deterministic across multiple runs val data = sparkContext.parallelize(1 to n, 2).mapPartitions(scala.util.Random.shuffle(_)).toDF("id") val splits = data.randomSplit(Array[Double](1, 2, 3), seed = 1) val firstRun = splits.toSeq.map(_.collect().toSeq) val secondRun = data.randomSplit(Array[Double](1, 2, 3), seed = 1).toSeq.map(_.collect().toSeq) assert(firstRun == secondRun)

SparkQA · 2016-01-07T08:26:31Z

Test build #48903 has finished for PR 10626 at commit eba673d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-07T08:50:14Z

Test build #48908 has finished for PR 10626 at commit 9a77c40.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-07T10:22:44Z

Test build #48924 has finished for PR 10626 at commit 1b30119.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-07T10:24:24Z

Test build #48901 has finished for PR 10626 at commit 6e211ff.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-07T10:40:18Z

Test build #48919 has finished for PR 10626 at commit 252dbc3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sameeragarwal · 2016-01-07T18:34:31Z

All comments addressed!

rxin · 2016-01-07T18:37:14Z

Thanks - I'm going to merge this.

…pping splits https://issues.apache.org/jira/browse/SPARK-12662 cc yhuai Author: Sameer Agarwal <[email protected]> Closes #10626 from sameeragarwal/randomsplit. (cherry picked from commit f194d99) Signed-off-by: Reynold Xin <[email protected]>

gatorsmile · 2016-01-08T07:33:58Z

@sameeragarwal Could you take a look at the following test failure?
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49007/consoleFull

It sounds like this is caused by this fix. If you are busy, I can work on it. Thanks!

gatorsmile · 2016-01-08T08:14:14Z

I might not pick up your latest code changes. Let me merge the code. Thanks!

sameeragarwal · 2016-01-08T08:20:33Z

@gatorsmile it seems like your PR is changing the behavior of SQL intersect that this test relies on. I can take a closer look at the PR but if you think this test is using intersect in a way that is not supported in SparkSQL, we can change DataFrameStatSuite:L76 to assert(splits(0).collect().intersect(splits(1).collect()).isEmpty) to make this work.

rxin · 2016-01-08T08:30:45Z

It is best for us to use local collection's intersect rather than relying on dataframe's.

sameeragarwal · 2016-01-08T08:43:42Z

@gatorsmile I pulled your changes and verified that the new intersect implementation no longer works when there are deterministic sampling operators in the plan, for e.g.,

val plan = sparkContext.parallelize(1 to 600, 1).toDF("id").logicalPlan
val sample1Plan = new DataFrame(sqlContext, Sample(0.0, 0.1, false, 1, plan))
val sample2Plan = new DataFrame(sqlContext, Sample(0.1, 1, false, 1, plan))
assert(sample1Plan.intersect(sample2Plan).collect().isEmpty) //FAILS
assert(sample1Plan.collect().intersect(sample2Plan.collect()).isEmpty) //SUCCEEDS

If that's intentional, please let me know if you'd like me to fix the test or you'd rather fold the fix as part of your PR. Thanks!

Edit: fixed example.

gatorsmile · 2016-01-08T13:50:10Z

Thank you @sameeragarwal for investigating this! Sorry to bring this to you at midnight.

For helping anyone understand the problem, let me post the logical plan if we do not collect the data to the local node:

Aggregate [id#1], [id#1]
+- Join LeftSemi, None
   :- Filter (id#1 <=> id#1)
   :  +- Sample 0.0, 0.4, false, 1
   :     +- Sort [id#1 ASC], false
   :        +- Project [_1#0 AS id#1]
   :           +- LogicalRDD [_1#0], MapPartitionsRDD[2] at apply at Transformer.scala:22
   +- Sample 0.4, 1.0, false, 1
      +- Project
         +- Sort [id#1 ASC], false
            +- Project [_1#0 AS id#1]
               +- LogicalRDD [_1#0], MapPartitionsRDD[2] at apply at Transformer.scala:22

This does not look right. Expression IDs should be different. Let me see how to fix this issue. Thanks!

gatorsmile · 2016-01-08T14:55:20Z

The fix is done. Thank you! It is not related to your PR. Sorry : (

After the fix, the plan should be like

Aggregate [id#1], [id#1]
+- Join LeftSemi, Some((id#1 <=> id#5))
   :- Sample 0.0, 0.4, false, 1
   :  +- Sort [id#1 ASC], false
   :     +- Project [_1#0 AS id#1]
   :        +- LogicalRDD [_1#0], MapPartitionsRDD[2] at apply at Transformer.scala:22
   +- Sample 0.4, 1.0, false, 1
      +- Sort [id#5 ASC], false
         +- Project [_1#0 AS id#5]
            +- LogicalRDD [_1#0], MapPartitionsRDD[2] at apply at Transformer.scala:22

yhuai · 2016-01-08T16:59:45Z

@gatorsmile where is the fix?

gatorsmile · 2016-01-08T17:08:47Z

@yhuai I just submitted it to my original PR: #10630

rxin reviewed Jan 7, 2016
View reviewed changes

sameeragarwal force-pushed the randomsplit branch from d866339 to 252dbc3 Compare January 7, 2016 07:52

sameeragarwal added 6 commits January 7, 2016 00:33

Fix DataFrame.randomSplit to avoid creating overlapping splits

27288a3

Reynold's comments

6336832

s/logicalPlanWithLocalSort/sorted

8c3293c

test for single seed

be9630f

single seed + check for deterministic results across multiple runs

56d9bd7

Remove size checks

3af6a2d

sameeragarwal force-pushed the randomsplit branch from 8e28f15 to ded1bfa Compare January 7, 2016 08:33

Simplify test

1b30119

sameeragarwal force-pushed the randomsplit branch from ded1bfa to 1b30119 Compare January 7, 2016 08:35

asfgit closed this in f194d99 Jan 7, 2016

gatorsmile mentioned this pull request Jan 8, 2016

[SPARK-12656] [SQL] Implement Intersect with Left-semi Join #10630

Closed

[SPARK-12662][SQL] Fix DataFrame.randomSplit to avoid creating overlapping splits #10626

[SPARK-12662][SQL] Fix DataFrame.randomSplit to avoid creating overlapping splits #10626

Uh oh!

Conversation

sameeragarwal commented Jan 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 7, 2016

Uh oh!

SparkQA commented Jan 7, 2016

Uh oh!

SparkQA commented Jan 7, 2016

Uh oh!

SparkQA commented Jan 7, 2016

Uh oh!

SparkQA commented Jan 7, 2016

Uh oh!

sameeragarwal commented Jan 7, 2016

Uh oh!

rxin commented Jan 7, 2016

Uh oh!

gatorsmile commented Jan 8, 2016

Uh oh!

gatorsmile commented Jan 8, 2016

Uh oh!

sameeragarwal commented Jan 8, 2016

Uh oh!

rxin commented Jan 8, 2016

Uh oh!

sameeragarwal commented Jan 8, 2016

Uh oh!

gatorsmile commented Jan 8, 2016

Uh oh!

gatorsmile commented Jan 8, 2016

Uh oh!

yhuai commented Jan 8, 2016

Uh oh!

gatorsmile commented Jan 8, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants