-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-15382][SQL] Fix a bug in sampling with replacement #14800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@HyukjinKwon @rxin could you check this? |
|
Thank you for cc me @maropu |
|
Can you describe the bug? |
|
|
|
@HyukjinKwon thanks your comment. I'll add the requirement. |
|
Test build #64392 has finished for PR 14800 at commit
|
|
Test build #64400 has finished for PR 14800 at commit
|
|
Also, is it really necessary to limit the sample rate to be <= 1? It's not incoherent to want to sample 200% of a data set if it is with replacement. You'd just be generating a data set 2x the size drawn from the same empirical distribution. |
|
@srowen Actually, we are already enforcing it to 100% when the replacement is disabled. So, I suggested this to match this up when it is enabled. Yes, it seems not related with the bug this PR is trying to fix. I apologise for the irrelevant comment. Ah, it is enforced into 100% when the replacement is disabled because there should be replacements when it exceeds. I see. I thought sampling is to have a representative smaller population from a larger one and therefore, it is not sensible when it exceeds 100%. |
|
In the definition of statistic terms, Sampling is to select a |
|
True, but, in the with-replacement case, you're no longer selecting a subset to begin with, because an element can appear twice. "Sample" does generally mean "take a smaller set" but it also means things like "sampling from a distribution". I wouldn't feel strongly about it except that we're taking away behavior that worked fine. |
|
yea, I see. I also have no strong opinion on this. So, both is okay to me. |
|
I am okay with both too. I apologise for the irrelevant comment @maropu . |
|
No problem, thanks your attention :) okay, I'll remove this. |
|
LGTM as a targeted fix |
|
Test build #64412 has finished for PR 14800 at commit
|
|
Merged to master/2.0 |
## What changes were proposed in this pull request?
This pr to fix a bug below in sampling with replacement
```
val df = Seq((1, 0), (2, 0), (3, 0)).toDF("a", "b")
df.sample(true, 2.0).withColumn("c", monotonically_increasing_id).select($"c").show
+---+
| c|
+---+
| 0|
| 1|
| 1|
| 1|
| 2|
+---+
```
## How was this patch tested?
Added a test in `DataFrameSuite`.
Author: Takeshi YAMAMURO <[email protected]>
Closes #14800 from maropu/FixSampleBug.
(cherry picked from commit cd0ed31)
Signed-off-by: Sean Owen <[email protected]>
|
@maropu when I asked to describe the bug, I was referring to updating the pull request description to include the description of the bug. Please do that in the future. Thanks. |
|
@rxin okay, I'll do that next time. thanks! |
What changes were proposed in this pull request?
This pr to fix a bug below in sampling with replacement
How was this patch tested?
Added a test in
DataFrameSuite.