Skip to content

Conversation

@maropu
Copy link
Member

@maropu maropu commented Aug 25, 2016

What changes were proposed in this pull request?

This pr to fix a bug below in sampling with replacement

val df = Seq((1, 0), (2, 0), (3, 0)).toDF("a", "b")
df.sample(true, 2.0).withColumn("c", monotonically_increasing_id).select($"c").show
+---+
|  c|
+---+
|  0|
|  1|
|  1|
|  1|
|  2|
+---+

How was this patch tested?

Added a test in DataFrameSuite.

@maropu
Copy link
Member Author

maropu commented Aug 25, 2016

@HyukjinKwon @rxin could you check this?

@HyukjinKwon
Copy link
Member

Thank you for cc me @maropu
I like this change. BTW, we may have to add a condition for enforcing sampling ratio <= 1.0 too.

@rxin
Copy link
Contributor

rxin commented Aug 25, 2016

Can you describe the bug?

@maropu
Copy link
Member Author

maropu commented Aug 25, 2016

SampleExec possibly generates multiple rows from a single row when withReplacement=true.
So, CodegenContext#copyResult should be true.
See: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L107

@maropu
Copy link
Member Author

maropu commented Aug 25, 2016

@HyukjinKwon thanks your comment. I'll add the requirement.

@SparkQA
Copy link

SparkQA commented Aug 25, 2016

Test build #64392 has finished for PR 14800 at commit 97dde82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 25, 2016

Test build #64400 has finished for PR 14800 at commit 81c41d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 25, 2016

Also, is it really necessary to limit the sample rate to be <= 1? It's not incoherent to want to sample 200% of a data set if it is with replacement. You'd just be generating a data set 2x the size drawn from the same empirical distribution.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Aug 25, 2016

@srowen Actually, we are already enforcing it to 100% when the replacement is disabled. So, I suggested this to match this up when it is enabled. Yes, it seems not related with the bug this PR is trying to fix. I apologise for the irrelevant comment.

Ah, it is enforced into 100% when the replacement is disabled because there should be replacements when it exceeds. I see. I thought sampling is to have a representative smaller population from a larger one and therefore, it is not sensible when it exceeds 100%.

@maropu
Copy link
Member Author

maropu commented Aug 25, 2016

In the definition of statistic terms, Sampling is to select a subset of whole data
So, I think the sample rate to be <= 1 is more reasonable.
See: https://en.wikipedia.org/wiki/Sampling_(statistics)

@srowen
Copy link
Member

srowen commented Aug 25, 2016

True, but, in the with-replacement case, you're no longer selecting a subset to begin with, because an element can appear twice. "Sample" does generally mean "take a smaller set" but it also means things like "sampling from a distribution".

I wouldn't feel strongly about it except that we're taking away behavior that worked fine.
The RDD API for example, doesn't enforce that the rate must be <= 1 (even for without replacement, which is wrong).

@maropu
Copy link
Member Author

maropu commented Aug 25, 2016

yea, I see. I also have no strong opinion on this. So, both is okay to me.
For now, I'll remove the requirement. What do u think? cc: @HyukjinKwon

@HyukjinKwon
Copy link
Member

I am okay with both too. I apologise for the irrelevant comment @maropu .

@maropu
Copy link
Member Author

maropu commented Aug 25, 2016

No problem, thanks your attention :) okay, I'll remove this.

@srowen
Copy link
Member

srowen commented Aug 25, 2016

LGTM as a targeted fix

@SparkQA
Copy link

SparkQA commented Aug 25, 2016

Test build #64412 has finished for PR 14800 at commit 97dde82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 27, 2016

Merged to master/2.0

asfgit pushed a commit that referenced this pull request Aug 27, 2016
## What changes were proposed in this pull request?
This pr to fix a bug below in sampling with replacement
```
val df = Seq((1, 0), (2, 0), (3, 0)).toDF("a", "b")
df.sample(true, 2.0).withColumn("c", monotonically_increasing_id).select($"c").show
+---+
|  c|
+---+
|  0|
|  1|
|  1|
|  1|
|  2|
+---+
```

## How was this patch tested?
Added a test in `DataFrameSuite`.

Author: Takeshi YAMAMURO <[email protected]>

Closes #14800 from maropu/FixSampleBug.

(cherry picked from commit cd0ed31)
Signed-off-by: Sean Owen <[email protected]>
@asfgit asfgit closed this in cd0ed31 Aug 27, 2016
@rxin
Copy link
Contributor

rxin commented Aug 27, 2016

@maropu when I asked to describe the bug, I was referring to updating the pull request description to include the description of the bug. Please do that in the future. Thanks.

@maropu
Copy link
Member Author

maropu commented Aug 27, 2016

@rxin okay, I'll do that next time. thanks!

@maropu maropu deleted the FixSampleBug branch July 5, 2017 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants