Skip to content

Conversation

@rberenguel
Copy link
Contributor

What changes were proposed in this pull request?

Allow fill/replace of NAs with booleans, both in Python and Scala

How was this patch tested?

Unit tests, doctests

This PR is original work from me and I license this work to the Spark project

@rberenguel rberenguel changed the title [Spark-19732][SQL][PYSPARK] fillna bools [SPARK-19732][SQL][PYSPARK] fillna bools May 31, 2017
self.assertEqual(row.name, None)
self.assertEqual(row.age, None)
self.assertEqual(row.height, None)
self.assertEqual(row.spy, True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be None or an argument subset of fillna() above should be ['name', 'spy']?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ueshin indeed! Thanks for catching this, I have modified the test. BUT, this test, as it stands on your comment, should have failed, doesn't it? The subset should not have been applied to spy (so, spy should have been None, and the assertion should have been marked as false, but either the test passed or the test didn't run), if I understood correctly how subsetting fillna's work. But this is weird, since I didn't change any internals of how it works, I just created the methods to enable it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think this fails :

======================================================================
ERROR [0.452s]: test_fillna (pyspark.sql.tests.SQLTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File ".../spark/python/pyspark/sql/tests.py", line 1749, in test_fillna
    self.assertEqual(row.spy, True)
AssertionError: None != True

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rberenguel I'm sorry but I didn't understand what you are getting at.
I guess if the subset is ['name', 'spy'] as you updated, row.spy will become True because the row.spy is BooleanType and the value is boolean.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon I passed the test in my local environment after I updated to the latest commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I meant your initial comment was right ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Thanks.

@HyukjinKwon
Copy link
Member

@ueshin, do you think it is okay to add this? I want to help review here if so.

@ueshin
Copy link
Member

ueshin commented Jun 2, 2017

@HyukjinKwon Yes, I think it's okay to add this.

@ueshin
Copy link
Member

ueshin commented Jun 2, 2017

ok to test

@SparkQA
Copy link

SparkQA commented Jun 2, 2017

Test build #77674 has finished for PR 18164 at commit 1b3c712.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your quick response @ueshin.

I left some comments. @rberenguel Could you check out those please?

"""
if not isinstance(value, (float, int, long, basestring, dict)):
raise ValueError("value should be a float, int, long, string, or dict")
raise ValueError("value should be a float, int, long, string, boolean or dict")
Copy link
Member

@HyukjinKwon HyukjinKwon Jun 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use the same term, bool or boolean (:param value: above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will change

| 50| null|unknown|
+---+------+-------+
"""
if not isinstance(value, (float, int, long, basestring, dict)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know a bool in Python inherits an int but wouldn't it be more clear if we explicitly mention it here? I don't strongly feel about this.

BTW, this rings a bell - some Python APIs take a bool in this way and work unexpectedly in some cases IIRC ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I omitted it just because it wasn't failing for this if, but indeed, I'm a bit more on the side of putting it in even if just for completeness. Makes reading the code much saner if we have the if for bool

if isinstance(value, (int, long)):
if isinstance(value, bool):
pass
elif isinstance(value, (int, long)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just make this not isinstance(value, bool) and isinstance(value, (int, long)) (maybe with a small comment)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, indeed makes sense and makes it a bit nicer than having a pass.


/**
* Returns a new `DataFrame` that replaces null values in boolean columns with `value`.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks we need @since 2.3.0 for this and the same instances below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure about this, wanted to ask actually. Thanks!

*/
def fill(value: Boolean): DataFrame = fill(value, df.columns)

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a boolean column could not have "NaN values".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right. I copied the defs and docs from double, as it shows. Will change, NaN booleans would be weird indeed

).toDF("name", "age", "height")
}

def createBooleanDF(): DataFrame = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks this functions is only used once. I think we could just move the lines in the functions into the test, "fill".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, right. I added it on top to keep both together, but it's only used for the boolean tests

// boolean
checkAnswer(
boolInput.na.fill(true).select("spy"),
Row(false) :: Row(true) :: Row(true) ::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could make this inlined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what do you mean by inlined here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I meant ...

Row(false) :: Row(true) :: Row(true) :: Row(true) :: Nil

because it does not look exceeding the length limit, 100 - https://github.com/apache/spark/blob/master/scalastyle-config.xml#L78

@rberenguel
Copy link
Contributor Author

@ueshin @HyukjinKwon thanks for giving it a very thorough look and sorry for my previous comment, that was terribly unclear. I was confused because the Appveyor tick mark was green for commit 076ebed and I had run the tests locally (forgot linting, though), so I was pretty sure the test was right and I was confused about how the subset wrong still had a passing test.

I probably skipped the wrong step for testing the Python tests (I'm still figuring out which corners I can cut to avoid a full compile/build cycle for the whole project, which takes ages for me) so I didn't see my local test failing, but the remote one was more puzzling, I guess appveyor had a hiccup here. Sorry again for the confused and confusing statements above :)

@HyukjinKwon
Copy link
Member

Ah... I see. Sorry, I misunderstood. BTW, AppVeyor only runs SparkR tests on Windows currently.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jun 2, 2017

BTW, mind fixng the title/description of the PR to be a bit more descriptive, for example, saying "null" instead of "NA"? Not a big deal but non R guys might get confused ...

@rberenguel rberenguel changed the title [SPARK-19732][SQL][PYSPARK] fillna bools [SPARK-19732][SQL][PYSPARK] Add fill functions for nulls in bool fields of datasets Jun 2, 2017
@rberenguel
Copy link
Contributor Author

rberenguel commented Jun 2, 2017

@HyukjinKwon I changed it, does it look any clearer? I have always thought of na in terms of Python (pandas) and not R anyway :)

@HyukjinKwon
Copy link
Member

Aaa...okay that's fine to me. NA always reminds me of R first :).

@SparkQA
Copy link

SparkQA commented Jun 2, 2017

Test build #77675 has finished for PR 18164 at commit 21b4f67.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 2, 2017

Test build #77685 has finished for PR 18164 at commit fb65d34.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

For me, it looks good. Please let me leave this to @ueshin.

@ueshin
Copy link
Member

ueshin commented Jun 3, 2017

LGTM.

@ueshin
Copy link
Member

ueshin commented Jun 3, 2017

Thanks! Merging to master.

@asfgit asfgit closed this in 6cbc61d Jun 3, 2017
asfgit pushed a commit that referenced this pull request Jan 11, 2018
…d fillna

## What changes were proposed in this pull request?
#18164 introduces the behavior changes. We need to document it.

## How was this patch tested?
N/A

Author: gatorsmile <[email protected]>

Closes #20234 from gatorsmile/docBehaviorChange.

(cherry picked from commit b46e58b)
Signed-off-by: hyukjinkwon <[email protected]>
asfgit pushed a commit that referenced this pull request Jan 11, 2018
…d fillna

## What changes were proposed in this pull request?
#18164 introduces the behavior changes. We need to document it.

## How was this patch tested?
N/A

Author: gatorsmile <[email protected]>

Closes #20234 from gatorsmile/docBehaviorChange.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants