Skip to content

Conversation

@sameeragarwal
Copy link
Member

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to start a whole new process to test this? I think we can just run randomSplit in the normal DataFrameSuite?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have not figure out a case that can trigger the problem in the local mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sc.parallelize(1 to 10).mapPartitions(scala.util.Random.shuffle(_)).collect()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, right. We missed it. It is a good one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's neat! Converted it into a unit test in DataFrameStatSuite.

@SparkQA
Copy link

SparkQA commented Jan 7, 2016

Test build #48893 has finished for PR 10626 at commit 8da496f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the tests are run so frequently, I don't think you need to try these many times ... doing it once should be enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the just the size of the dataset. We do however test for 5 different seeds. Should I just test for 1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea 1 is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also you can just run it twice and make sure the result is deterministic, i.e.

val a = df.randomSplit(...).toSeq.map(_.collect())
val b = df.randomSplit(...).toSeq.map(_.collect())
assert(a == b)

as long as these are scala collections, I think they will work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, but to be fair, this new test does test a new codepath (that of inserting a sampling operator after a shuffle)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't that the same code path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once we implement sample pushdown in catalyst, it shouldn't be: http://research.microsoft.com/pubs/76565/sig99sam.pdf :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do u mean? the shuffle happens outside of catalyst, so the optimizer can't push it beneath it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be clear, i'm suggesting removing everything the previous test case already tests, and only keep

// Verify that the results are deterministic across multiple runs
val data = sparkContext.parallelize(1 to n, 2).mapPartitions(scala.util.Random.shuffle(_)).toDF("id")
val splits = data.randomSplit(Array[Double](1, 2, 3), seed = 1)
val firstRun = splits.toSeq.map(_.collect().toSeq)
val secondRun = data.randomSplit(Array[Double](1, 2, 3), seed = 1).toSeq.map(_.collect().toSeq)
assert(firstRun == secondRun)

@SparkQA
Copy link

SparkQA commented Jan 7, 2016

Test build #48903 has finished for PR 10626 at commit eba673d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 7, 2016

Test build #48908 has finished for PR 10626 at commit 9a77c40.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 7, 2016

Test build #48924 has finished for PR 10626 at commit 1b30119.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 7, 2016

Test build #48901 has finished for PR 10626 at commit 6e211ff.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 7, 2016

Test build #48919 has finished for PR 10626 at commit 252dbc3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sameeragarwal
Copy link
Member Author

All comments addressed!

@rxin
Copy link
Contributor

rxin commented Jan 7, 2016

Thanks - I'm going to merge this.

asfgit pushed a commit that referenced this pull request Jan 7, 2016
…pping splits

https://issues.apache.org/jira/browse/SPARK-12662

cc yhuai

Author: Sameer Agarwal <[email protected]>

Closes #10626 from sameeragarwal/randomsplit.

(cherry picked from commit f194d99)
Signed-off-by: Reynold Xin <[email protected]>
@asfgit asfgit closed this in f194d99 Jan 7, 2016
@gatorsmile
Copy link
Member

@sameeragarwal Could you take a look at the following test failure?
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49007/consoleFull

It sounds like this is caused by this fix. If you are busy, I can work on it. Thanks!

@gatorsmile
Copy link
Member

I might not pick up your latest code changes. Let me merge the code. Thanks!

@sameeragarwal
Copy link
Member Author

@gatorsmile it seems like your PR is changing the behavior of SQL intersect that this test relies on. I can take a closer look at the PR but if you think this test is using intersect in a way that is not supported in SparkSQL, we can change DataFrameStatSuite:L76 to assert(splits(0).collect().intersect(splits(1).collect()).isEmpty) to make this work.

@rxin
Copy link
Contributor

rxin commented Jan 8, 2016

It is best for us to use local collection's intersect rather than relying on dataframe's.

@sameeragarwal
Copy link
Member Author

@gatorsmile I pulled your changes and verified that the new intersect implementation no longer works when there are deterministic sampling operators in the plan, for e.g.,

val plan = sparkContext.parallelize(1 to 600, 1).toDF("id").logicalPlan
val sample1Plan = new DataFrame(sqlContext, Sample(0.0, 0.1, false, 1, plan))
val sample2Plan = new DataFrame(sqlContext, Sample(0.1, 1, false, 1, plan))
assert(sample1Plan.intersect(sample2Plan).collect().isEmpty) //FAILS
assert(sample1Plan.collect().intersect(sample2Plan.collect()).isEmpty) //SUCCEEDS

If that's intentional, please let me know if you'd like me to fix the test or you'd rather fold the fix as part of your PR. Thanks!

Edit: fixed example.

@gatorsmile
Copy link
Member

Thank you @sameeragarwal for investigating this! Sorry to bring this to you at midnight.

For helping anyone understand the problem, let me post the logical plan if we do not collect the data to the local node:

Aggregate [id#1], [id#1]
+- Join LeftSemi, None
   :- Filter (id#1 <=> id#1)
   :  +- Sample 0.0, 0.4, false, 1
   :     +- Sort [id#1 ASC], false
   :        +- Project [_1#0 AS id#1]
   :           +- LogicalRDD [_1#0], MapPartitionsRDD[2] at apply at Transformer.scala:22
   +- Sample 0.4, 1.0, false, 1
      +- Project
         +- Sort [id#1 ASC], false
            +- Project [_1#0 AS id#1]
               +- LogicalRDD [_1#0], MapPartitionsRDD[2] at apply at Transformer.scala:22

This does not look right. Expression IDs should be different. Let me see how to fix this issue. Thanks!

@gatorsmile
Copy link
Member

The fix is done. Thank you! It is not related to your PR. Sorry : (

After the fix, the plan should be like

Aggregate [id#1], [id#1]
+- Join LeftSemi, Some((id#1 <=> id#5))
   :- Sample 0.0, 0.4, false, 1
   :  +- Sort [id#1 ASC], false
   :     +- Project [_1#0 AS id#1]
   :        +- LogicalRDD [_1#0], MapPartitionsRDD[2] at apply at Transformer.scala:22
   +- Sample 0.4, 1.0, false, 1
      +- Sort [id#5 ASC], false
         +- Project [_1#0 AS id#5]
            +- LogicalRDD [_1#0], MapPartitionsRDD[2] at apply at Transformer.scala:22

@yhuai
Copy link
Contributor

yhuai commented Jan 8, 2016

@gatorsmile where is the fix?

@gatorsmile
Copy link
Member

@yhuai I just submitted it to my original PR: #10630

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants