-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19732][SQL][PYSPARK] Add fill functions for nulls in bool fields of datasets #18164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| self.assertEqual(row.name, None) | ||
| self.assertEqual(row.age, None) | ||
| self.assertEqual(row.height, None) | ||
| self.assertEqual(row.spy, True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be None or an argument subset of fillna() above should be ['name', 'spy']?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ueshin indeed! Thanks for catching this, I have modified the test. BUT, this test, as it stands on your comment, should have failed, doesn't it? The subset should not have been applied to spy (so, spy should have been None, and the assertion should have been marked as false, but either the test passed or the test didn't run), if I understood correctly how subsetting fillna's work. But this is weird, since I didn't change any internals of how it works, I just created the methods to enable it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I think this fails :
======================================================================
ERROR [0.452s]: test_fillna (pyspark.sql.tests.SQLTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File ".../spark/python/pyspark/sql/tests.py", line 1749, in test_fillna
self.assertEqual(row.spy, True)
AssertionError: None != True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rberenguel I'm sorry but I didn't understand what you are getting at.
I guess if the subset is ['name', 'spy'] as you updated, row.spy will become True because the row.spy is BooleanType and the value is boolean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon I passed the test in my local environment after I updated to the latest commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I meant your initial comment was right ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Thanks.
|
@ueshin, do you think it is okay to add this? I want to help review here if so. |
|
@HyukjinKwon Yes, I think it's okay to add this. |
|
ok to test |
|
Test build #77674 has finished for PR 18164 at commit
|
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your quick response @ueshin.
I left some comments. @rberenguel Could you check out those please?
python/pyspark/sql/dataframe.py
Outdated
| """ | ||
| if not isinstance(value, (float, int, long, basestring, dict)): | ||
| raise ValueError("value should be a float, int, long, string, or dict") | ||
| raise ValueError("value should be a float, int, long, string, boolean or dict") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use the same term, bool or boolean (:param value: above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will change
python/pyspark/sql/dataframe.py
Outdated
| | 50| null|unknown| | ||
| +---+------+-------+ | ||
| """ | ||
| if not isinstance(value, (float, int, long, basestring, dict)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know a bool in Python inherits an int but wouldn't it be more clear if we explicitly mention it here? I don't strongly feel about this.
BTW, this rings a bell - some Python APIs take a bool in this way and work unexpectedly in some cases IIRC ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I omitted it just because it wasn't failing for this if, but indeed, I'm a bit more on the side of putting it in even if just for completeness. Makes reading the code much saner if we have the if for bool
python/pyspark/sql/dataframe.py
Outdated
| if isinstance(value, (int, long)): | ||
| if isinstance(value, bool): | ||
| pass | ||
| elif isinstance(value, (int, long)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just make this not isinstance(value, bool) and isinstance(value, (int, long)) (maybe with a small comment)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, indeed makes sense and makes it a bit nicer than having a pass.
|
|
||
| /** | ||
| * Returns a new `DataFrame` that replaces null values in boolean columns with `value`. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks we need @since 2.3.0 for this and the same instances below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure about this, wanted to ask actually. Thanks!
| */ | ||
| def fill(value: Boolean): DataFrame = fill(value, df.columns) | ||
|
|
||
| /** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a boolean column could not have "NaN values".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right. I copied the defs and docs from double, as it shows. Will change, NaN booleans would be weird indeed
| ).toDF("name", "age", "height") | ||
| } | ||
|
|
||
| def createBooleanDF(): DataFrame = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks this functions is only used once. I think we could just move the lines in the functions into the test, "fill".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, right. I added it on top to keep both together, but it's only used for the boolean tests
| // boolean | ||
| checkAnswer( | ||
| boolInput.na.fill(true).select("spy"), | ||
| Row(false) :: Row(true) :: Row(true) :: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could make this inlined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, what do you mean by inlined here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I meant ...
Row(false) :: Row(true) :: Row(true) :: Row(true) :: Nil
because it does not look exceeding the length limit, 100 - https://github.com/apache/spark/blob/master/scalastyle-config.xml#L78
|
@ueshin @HyukjinKwon thanks for giving it a very thorough look and sorry for my previous comment, that was terribly unclear. I was confused because the Appveyor tick mark was green for commit 076ebed and I had run the tests locally (forgot linting, though), so I was pretty sure the test was right and I was confused about how the subset wrong still had a passing test. I probably skipped the wrong step for testing the Python tests (I'm still figuring out which corners I can cut to avoid a full compile/build cycle for the whole project, which takes ages for me) so I didn't see my local test failing, but the remote one was more puzzling, I guess appveyor had a hiccup here. Sorry again for the confused and confusing statements above :) |
|
Ah... I see. Sorry, I misunderstood. BTW, AppVeyor only runs SparkR tests on Windows currently. |
|
BTW, mind fixng the title/description of the PR to be a bit more descriptive, for example, saying "null" instead of "NA"? Not a big deal but non R guys might get confused ... |
|
@HyukjinKwon I changed it, does it look any clearer? I have always thought of |
|
Aaa...okay that's fine to me. NA always reminds me of R first :). |
|
Test build #77675 has finished for PR 18164 at commit
|
|
Test build #77685 has finished for PR 18164 at commit
|
|
For me, it looks good. Please let me leave this to @ueshin. |
|
LGTM. |
|
Thanks! Merging to master. |
…d fillna ## What changes were proposed in this pull request? #18164 introduces the behavior changes. We need to document it. ## How was this patch tested? N/A Author: gatorsmile <[email protected]> Closes #20234 from gatorsmile/docBehaviorChange. (cherry picked from commit b46e58b) Signed-off-by: hyukjinkwon <[email protected]>
…d fillna ## What changes were proposed in this pull request? #18164 introduces the behavior changes. We need to document it. ## How was this patch tested? N/A Author: gatorsmile <[email protected]> Closes #20234 from gatorsmile/docBehaviorChange.
What changes were proposed in this pull request?
Allow fill/replace of NAs with booleans, both in Python and Scala
How was this patch tested?
Unit tests, doctests
This PR is original work from me and I license this work to the Spark project