[SPARK-31256][SQL] DataFrameNaFunctions.drop should work for nested columns #28266

imback82 · 2020-04-19T21:27:18Z

What changes were proposed in this pull request?

#26700 removed the ability to drop a row whose nested column value is null.

For example, for the following df:

val schema = new StructType()
  .add("c1", new StructType()
    .add("c1-1", StringType)
    .add("c1-2", StringType))
val data = Seq(Row(Row(null, "a2")), Row(Row("b1", "b2")), Row(null))
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.show
+--------+
|      c1|
+--------+
|  [, a2]|
|[b1, b2]|
|    null|
+--------+

In Spark 2.4.4,

df.na.drop("any", Seq("c1.c1-1")).show
+--------+
|      c1|
+--------+
|[b1, b2]|
+--------+

In Spark 2.4.5 or Spark 3.0.0-preview2, if nested columns are specified, they are ignored.

df.na.drop("any", Seq("c1.c1-1")).show
+--------+
|      c1|
+--------+
|  [, a2]|
|[b1, b2]|
|    null|
+--------+

Why are the changes needed?

This seems like a regression.

Does this PR introduce any user-facing change?

Now, the nested column can be specified:

df.na.drop("any", Seq("c1.c1-1")).show
+--------+
|      c1|
+--------+
|[b1, b2]|
+--------+

Also, if * is specified as a column, it will throw an AnalysisException that * cannot be resolved, which was the behavior in 2.4.4. Currently, in master, it has no effect.

How was this patch tested?

Updated existing tests.

imback82 · 2020-04-19T21:30:13Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala

+    val exception = intercept[AnalysisException] {
+      df.na.drop("any", Seq("*"))
+    }
+    assert(exception.getMessage.contains("Cannot resolve column name \"*\""))


Note that this was the behavior in Spark 2.4.4. We can handle this more gracefully (e.g., use outputAttributes) if we need to.

On a side note, for fill, * is ignored in Spark 2.4.4.

imback82 · 2020-04-19T21:31:04Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala

-    val df = spark.createDataFrame(
-      spark.sparkContext.parallelize(data), schema)
+    // Nested columns are ignored for fill().
+    checkAnswer(df.na.fill("a1", Seq("c1.c1-1")), df)


Note that nested columns are ignored for fill in Spark 2.4.4.

imback82 · 2020-04-19T21:31:50Z

@cloud-fan Please let me know if this PR (going back to 2.4.4 behavior) makes sense. Thanks!

TJX2014 · 2020-04-20T01:08:21Z

nice,duplicate columns is same as struct alias which not works at toAttributes method in DataFrameNaFunctions.

SparkQA · 2020-04-20T02:02:58Z

Test build #121490 has finished for PR 28266 at commit 283fee1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan

good catch!

…olumns ### What changes were proposed in this pull request? #26700 removed the ability to drop a row whose nested column value is null. For example, for the following `df`: ``` val schema = new StructType() .add("c1", new StructType() .add("c1-1", StringType) .add("c1-2", StringType)) val data = Seq(Row(Row(null, "a2")), Row(Row("b1", "b2")), Row(null)) val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) df.show +--------+ | c1| +--------+ | [, a2]| |[b1, b2]| | null| +--------+ ``` In Spark 2.4.4, ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ |[b1, b2]| +--------+ ``` In Spark 2.4.5 or Spark 3.0.0-preview2, if nested columns are specified, they are ignored. ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ | [, a2]| |[b1, b2]| | null| +--------+ ``` ### Why are the changes needed? This seems like a regression. ### Does this PR introduce any user-facing change? Now, the nested column can be specified: ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ |[b1, b2]| +--------+ ``` Also, if `*` is specified as a column, it will throw an `AnalysisException` that `*` cannot be resolved, which was the behavior in 2.4.4. Currently, in master, it has no effect. ### How was this patch tested? Updated existing tests. Closes #28266 from imback82/SPARK-31256. Authored-by: Terry Kim <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit d7499ae) Signed-off-by: Wenchen Fan <[email protected]>

maropu · 2020-04-20T02:59:45Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala


-    checkAnswer(df.select("c1.c1-1"),
-      Row(null) :: Row("b1") :: Row(null) :: Nil)
+  test("drop with nested columns") {


nit: This looks like a bug, so could you add the prefix: SPARK-31256.

nvm, a bit late...

…olumns For example, for the following `df`: ``` val schema = new StructType() .add("c1", new StructType() .add("c1-1", StringType) .add("c1-2", StringType)) val data = Seq(Row(Row(null, "a2")), Row(Row("b1", "b2")), Row(null)) val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) df.show +--------+ | c1| +--------+ | [, a2]| |[b1, b2]| | null| +--------+ ``` In Spark 2.4.4, ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ |[b1, b2]| +--------+ ``` In Spark 2.4.5 or Spark 3.0.0-preview2, if nested columns are specified, they are ignored. ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ | [, a2]| |[b1, b2]| | null| +--------+ ``` This seems like a regression. Now, the nested column can be specified: ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ |[b1, b2]| +--------+ ``` Also, if `*` is specified as a column, it will throw an `AnalysisException` that `*` cannot be resolved, which was the behavior in 2.4.4. Currently, in master, it has no effect. Updated existing tests. Closes #28266 from imback82/SPARK-31256. Authored-by: Terry Kim <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit d7499ae) Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan · 2020-04-20T03:08:16Z

merging to master/3.0/2.4

dongjoon-hyun · 2020-04-20T05:15:23Z

So, SPARK-31256 made a regression at 2.4.5 and this recovers it?

cloud-fan · 2020-04-20T05:37:34Z

@dongjoon-hyun yes

dongjoon-hyun · 2020-04-20T06:11:26Z

Thank you for confirmation~

initial checkin

283fee1

probot-autolabeler bot added the SQL label Apr 19, 2020

imback82 commented Apr 19, 2020

View reviewed changes

imback82 changed the title ~~[SPARK-31256][SQL] Dropna should work for nested columns~~ [SPARK-31256][SQL] DataFrameNaFunctions.drop should work for nested columns Apr 19, 2020

cloud-fan approved these changes Apr 20, 2020

View reviewed changes

cloud-fan closed this in d7499ae Apr 20, 2020

maropu reviewed Apr 20, 2020

View reviewed changes

maropu approved these changes Apr 20, 2020

View reviewed changes

[SPARK-31256][SQL] DataFrameNaFunctions.drop should work for nested columns #28266

[SPARK-31256][SQL] DataFrameNaFunctions.drop should work for nested columns #28266

Uh oh!

Conversation

imback82 commented Apr 19, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

imback82 Apr 19, 2020

Choose a reason for hiding this comment

Uh oh!

imback82 Apr 19, 2020

Choose a reason for hiding this comment

Uh oh!

imback82 commented Apr 19, 2020

Uh oh!

TJX2014 commented Apr 20, 2020

Uh oh!

SparkQA commented Apr 20, 2020

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

maropu Apr 20, 2020

Choose a reason for hiding this comment

Uh oh!

maropu Apr 20, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan Apr 20, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Apr 20, 2020

Uh oh!

dongjoon-hyun commented Apr 20, 2020

Uh oh!

cloud-fan commented Apr 20, 2020

Uh oh!

dongjoon-hyun commented Apr 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants