-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-22797][PySpark] Bucketizer support multi-column #19892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #84478 has finished for PR 19892 at commit
|
|
Test build #84479 has finished for PR 19892 at commit
|
|
Test build #84480 has finished for PR 19892 at commit
|
|
This PR is currently blocked by #19894 (comment) |
|
retest this please |
|
Test build #84898 has finished for PR 19892 at commit
|
|
retest this please |
|
Test build #84900 has finished for PR 19892 at commit
|
|
ping @holdenk , can you help reviewing this? |
python/pyspark/ml/feature.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update this doc too?
python/pyspark/ml/param/__init__.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toListFloat requires each list entry is a numeric (TypeConverters._is_numeric). Should toListListFloat have such requirement?
|
This needs an individual JIRA. @MLnick created SPARK-22797 for this. Please use it. |
|
@viirya Thanks a lot for reviewing this! I will update the title to use the new ticket. |
5efc94e to
e1fb379
Compare
|
Test build #85087 has finished for PR 19892 at commit
|
python/pyspark/ml/feature.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: there is a work to change this behavior to throw an exception, instead of a log warning. We should change this document later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@holdenk @zhengruifeng this comment will need to be changed as per #19993 - but that has not been merged yet. I think #19993 will block 2.3 though, so we could preemptively change the doc here to match the Scala side in #19993 about throwing an exception.
|
One minor comment, otherwise LGTM. |
|
ping @MLnick ? |
e1fb379 to
4b2fc6f
Compare
|
Test build #86009 has finished for PR 19892 at commit
|
python/pyspark/ml/feature.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it would be cleaner to do a df.show() here? Likewise above for bucketed we could change that part of the doctest too.
python/pyspark/ml/feature.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values1 & values2?
python/pyspark/ml/param/__init__.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a test case in ParamTypeConversionTests for this new method; see test_list_float for reference.
4b2fc6f to
e869e75
Compare
|
Test build #86168 has finished for PR 19892 at commit
|
|
Test build #86169 has finished for PR 19892 at commit
|
|
@MLnick Thanks for your reviewing and suggestions. I have updated this PR |
MLnick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor comment on the updated doctest.
I don't think this will make it into 2.3 given the code freeze and branch has been cut already. In which case we will need to change the @since tags.
Pending the error throwing PR for Scala Bucketizer, we can update the doc here.
python/pyspark/ml/feature.py
Outdated
| >>> bucketed[3].buckets | ||
| 2.0 | ||
| ... inputCol="values1", outputCol="buckets") | ||
| >>> bucketed = bucketizer.setHandleInvalid("keep").transform(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may actually be neater to show only values1 and bucketed - so perhaps .transform(df.select('values1'))?
|
Test build #86170 has finished for PR 19892 at commit
|
|
I mean I think it might have a chance, generally speaking we've allowed outstanding PRs to be merged after the freeze. Since there are outstanding blockers on the branch preventing us from cutting RC2 maybe its ok to move forward if we can do it quickly? Of course I defer to MLNick :) |
|
I’m generally ok with these small python api wrapper additions getting
merged as long as the risk of breaking anything is low - and here it is
since it’s just api parity
…On Fri, 19 Jan 2018 at 06:08, Holden Karau ***@***.***> wrote:
I mean I think it might have a chance, generally speaking we've allowed
outstanding PRs to be merged after the freeze. Since there are outstanding
blockers on the branch preventing us from cutting RC2 maybe its ok to move
forward if we can do it quickly? Of course I defer to MLNick :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19892 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA_SBy2cr5MUJ9rN7egqwHf9GCLH0tCiks5tMBVSgaJpZM4Q2CRd>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and I'll merge today (Australia time) if there are no objections. (note: this means also waiting on @MLnick switching from -1 to approve).
|
If it is going to get merged to |
|
Test build #86464 has finished for PR 19892 at commit
|
|
@MLnick you ok with this then? |
|
@holdenk everything except my comment in #19892 (comment). I'd propose to just preemptively update the doc about an exception being thrown. |
|
Test build #86519 has finished for PR 19892 at commit
|
|
RC2 has been cut - @jkbradley do you see #19993 as a blocker? I think it should be merged for |
|
Merged to master / branch-2.3. Thanks! |
## What changes were proposed in this pull request? Bucketizer support multi-column in the python side ## How was this patch tested? existing tests and added tests Author: Zheng RuiFeng <[email protected]> Closes #19892 from zhengruifeng/20542_py. (cherry picked from commit c22eaa9) Signed-off-by: Nick Pentreath <[email protected]>
|
I reverted this (see #20410 for details) - we can re-open it once that issue is solved. |
|
Can this be re-open now? |
|
Has there been any update on re-opening this? And also adding multiple column support to QuantileDiscretizer? |
|
I am sorry, I am afread I can not re-open this PR beacuse I deleted it by mistake. |
What changes were proposed in this pull request?
Bucketizer support multi-column in the python side
How was this patch tested?
existing tests and added tests