[SPARK-19732][SQL][PYSPARK] Add fill functions for nulls in bool fields of datasets #18164

rberenguel · 2017-05-31T15:59:54Z

What changes were proposed in this pull request?

Allow fill/replace of NAs with booleans, both in Python and Scala

How was this patch tested?

Unit tests, doctests

This PR is original work from me and I license this work to the Spark project

… adding to PySpark

ueshin · 2017-06-02T06:45:26Z

python/pyspark/sql/tests.py

+        self.assertEqual(row.name, None)
+        self.assertEqual(row.age, None)
+        self.assertEqual(row.height, None)
+        self.assertEqual(row.spy, True)


This should be None or an argument subset of fillna() above should be ['name', 'spy']?

Hi @ueshin indeed! Thanks for catching this, I have modified the test. BUT, this test, as it stands on your comment, should have failed, doesn't it? The subset should not have been applied to spy (so, spy should have been None, and the assertion should have been marked as false, but either the test passed or the test didn't run), if I understood correctly how subsetting fillna's work. But this is weird, since I didn't change any internals of how it works, I just created the methods to enable it.

Well, I think this fails :

====================================================================== ERROR [0.452s]: test_fillna (pyspark.sql.tests.SQLTests) ---------------------------------------------------------------------- Traceback (most recent call last): File ".../spark/python/pyspark/sql/tests.py", line 1749, in test_fillna self.assertEqual(row.spy, True) AssertionError: None != True

@rberenguel I'm sorry but I didn't understand what you are getting at.
I guess if the subset is ['name', 'spy'] as you updated, row.spy will become True because the row.spy is BooleanType and the value is boolean.

@HyukjinKwon I passed the test in my local environment after I updated to the latest commit.

Yea, I meant your initial comment was right ...

Ah, I see. Thanks.

…uel/spark into SPARK-19732-fillna-bools

HyukjinKwon · 2017-06-02T08:01:35Z

@ueshin, do you think it is okay to add this? I want to help review here if so.

ueshin · 2017-06-02T08:37:34Z

@HyukjinKwon Yes, I think it's okay to add this.

ueshin · 2017-06-02T08:48:56Z

ok to test

SparkQA · 2017-06-02T08:54:35Z

Test build #77674 has finished for PR 18164 at commit 1b3c712.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

Thanks for your quick response @ueshin.

I left some comments. @rberenguel Could you check out those please?

HyukjinKwon · 2017-06-02T08:43:55Z

python/pyspark/sql/dataframe.py

        """
        if not isinstance(value, (float, int, long, basestring, dict)):
-            raise ValueError("value should be a float, int, long, string, or dict")
+            raise ValueError("value should be a float, int, long, string, boolean or dict")


I think we should use the same term, bool or boolean (:param value: above)

Thanks, will change

HyukjinKwon · 2017-06-02T08:45:38Z

python/pyspark/sql/dataframe.py

        | 50|  null|unknown|
        +---+------+-------+
        """
        if not isinstance(value, (float, int, long, basestring, dict)):


I know a bool in Python inherits an int but wouldn't it be more clear if we explicitly mention it here? I don't strongly feel about this.

BTW, this rings a bell - some Python APIs take a bool in this way and work unexpectedly in some cases IIRC ...

I omitted it just because it wasn't failing for this if, but indeed, I'm a bit more on the side of putting it in even if just for completeness. Makes reading the code much saner if we have the if for bool

HyukjinKwon · 2017-06-02T08:48:53Z

python/pyspark/sql/dataframe.py

-        if isinstance(value, (int, long)):
+        if isinstance(value, bool):
+            pass
+        elif isinstance(value, (int, long)):


Could we just make this not isinstance(value, bool) and isinstance(value, (int, long)) (maybe with a small comment)?

Thanks, indeed makes sense and makes it a bit nicer than having a pass.

HyukjinKwon · 2017-06-02T08:55:13Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala


+  /**
+   * Returns a new `DataFrame` that replaces null values in boolean columns with `value`.
+   */


Looks we need @since 2.3.0 for this and the same instances below.

I wasn't sure about this, wanted to ask actually. Thanks!

HyukjinKwon · 2017-06-02T08:56:59Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala

+   */
+  def fill(value: Boolean): DataFrame = fill(value, df.columns)
+
+  /**


I think a boolean column could not have "NaN values".

Oh, right. I copied the defs and docs from double, as it shows. Will change, NaN booleans would be weird indeed

HyukjinKwon · 2017-06-02T08:58:31Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala

      ).toDF("name", "age", "height")
  }

+  def createBooleanDF(): DataFrame = {


It looks this functions is only used once. I think we could just move the lines in the functions into the test, "fill".

Yup, right. I added it on top to keep both together, but it's only used for the boolean tests

HyukjinKwon · 2017-06-02T09:00:59Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala

+    // boolean
+    checkAnswer(
+      boolInput.na.fill(true).select("spy"),
+      Row(false) :: Row(true) :: Row(true) ::


I think we could make this inlined.

Sorry, what do you mean by inlined here?

Ah, I meant ...

Row(false) :: Row(true) :: Row(true) :: Row(true) :: Nil

because it does not look exceeding the length limit, 100 - https://github.com/apache/spark/blob/master/scalastyle-config.xml#L78

rberenguel · 2017-06-02T09:14:39Z

@ueshin @HyukjinKwon thanks for giving it a very thorough look and sorry for my previous comment, that was terribly unclear. I was confused because the Appveyor tick mark was green for commit 076ebed and I had run the tests locally (forgot linting, though), so I was pretty sure the test was right and I was confused about how the subset wrong still had a passing test.

I probably skipped the wrong step for testing the Python tests (I'm still figuring out which corners I can cut to avoid a full compile/build cycle for the whole project, which takes ages for me) so I didn't see my local test failing, but the remote one was more puzzling, I guess appveyor had a hiccup here. Sorry again for the confused and confusing statements above :)

HyukjinKwon · 2017-06-02T09:17:21Z

Ah... I see. Sorry, I misunderstood. BTW, AppVeyor only runs SparkR tests on Windows currently.

HyukjinKwon · 2017-06-02T10:05:35Z

BTW, mind fixng the title/description of the PR to be a bit more descriptive, for example, saying "null" instead of "NA"? Not a big deal but non R guys might get confused ...

rberenguel · 2017-06-02T10:20:37Z

@HyukjinKwon I changed it, does it look any clearer? I have always thought of na in terms of Python (pandas) and not R anyway :)

HyukjinKwon · 2017-06-02T11:03:35Z

Aaa...okay that's fine to me. NA always reminds me of R first :).

SparkQA · 2017-06-02T11:18:52Z

Test build #77675 has finished for PR 18164 at commit 21b4f67.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-02T22:17:23Z

Test build #77685 has finished for PR 18164 at commit fb65d34.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-06-03T01:53:24Z

For me, it looks good. Please let me leave this to @ueshin.

ueshin · 2017-06-03T05:33:45Z

LGTM.

ueshin · 2017-06-03T05:57:55Z

Thanks! Merging to master.

…d fillna ## What changes were proposed in this pull request? #18164 introduces the behavior changes. We need to document it. ## How was this patch tested? N/A Author: gatorsmile <[email protected]> Closes #20234 from gatorsmile/docBehaviorChange. (cherry picked from commit b46e58b) Signed-off-by: hyukjinkwon <[email protected]>

…d fillna ## What changes were proposed in this pull request? #18164 introduces the behavior changes. We need to document it. ## How was this patch tested? N/A Author: gatorsmile <[email protected]> Closes #20234 from gatorsmile/docBehaviorChange.

rberenguel added 3 commits May 31, 2017 00:22

SPARK-19732 fillna for booleans in place. Tomorrow testing for it and…

60b60e3

… adding to PySpark

SPARK-19732 Fillna for booleans (Spark and Scala)

fb0904b

Merge branch 'master' into SPARK-19732-fillna-bools

076ebed

rberenguel changed the title ~~[Spark-19732][SQL][PYSPARK] fillna bools~~ [SPARK-19732][SQL][PYSPARK] fillna bools May 31, 2017

ueshin reviewed Jun 2, 2017

View reviewed changes

rberenguel added 2 commits June 2, 2017 08:43

SPARK-19732 Typo in subsetting for fillna test

4c6666b

Merge branch 'SPARK-19732-fillna-bools' of https://github.com/rbereng…

1b3c712

…uel/spark into SPARK-19732-fillna-bools

SPARK-19732 Remove some long line fluff for the linter

21b4f67

HyukjinKwon reviewed Jun 2, 2017

View reviewed changes

rberenguel changed the title ~~[SPARK-19732][SQL][PYSPARK] fillna bools~~ [SPARK-19732][SQL][PYSPARK] Add fill functions for nulls in bool fields of datasets Jun 2, 2017

SPARK-19732 Implementing code review suggestions

fb65d34

asfgit closed this in 6cbc61d Jun 3, 2017

gatorsmile mentioned this pull request Jan 11, 2018

[SPARK-19732] [Follow-up] Document behavior changes made in na.fill and fillna #20234

Closed

[SPARK-19732][SQL][PYSPARK] Add fill functions for nulls in bool fields of datasets #18164

[SPARK-19732][SQL][PYSPARK] Add fill functions for nulls in bool fields of datasets #18164

Uh oh!

Conversation

rberenguel commented May 31, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jun 2, 2017

Uh oh!

ueshin commented Jun 2, 2017

Uh oh!

ueshin commented Jun 2, 2017

Uh oh!

SparkQA commented Jun 2, 2017

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jun 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rberenguel commented Jun 2, 2017

Uh oh!

HyukjinKwon commented Jun 2, 2017

Uh oh!

HyukjinKwon commented Jun 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rberenguel commented Jun 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Jun 2, 2017

Uh oh!

SparkQA commented Jun 2, 2017

Uh oh!

SparkQA commented Jun 2, 2017

HyukjinKwon Jun 2, 2017 •

edited

Loading

HyukjinKwon commented Jun 2, 2017 •

edited

Loading

rberenguel commented Jun 2, 2017 •

edited

Loading