-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-31256][SQL] DataFrameNaFunctions.drop should work for nested columns #28266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| val exception = intercept[AnalysisException] { | ||
| df.na.drop("any", Seq("*")) | ||
| } | ||
| assert(exception.getMessage.contains("Cannot resolve column name \"*\"")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this was the behavior in Spark 2.4.4. We can handle this more gracefully (e.g., use outputAttributes) if we need to.
On a side note, for fill, * is ignored in Spark 2.4.4.
| val df = spark.createDataFrame( | ||
| spark.sparkContext.parallelize(data), schema) | ||
| // Nested columns are ignored for fill(). | ||
| checkAnswer(df.na.fill("a1", Seq("c1.c1-1")), df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that nested columns are ignored for fill in Spark 2.4.4.
|
@cloud-fan Please let me know if this PR (going back to 2.4.4 behavior) makes sense. Thanks! |
|
nice,duplicate columns is same as struct alias which not works at toAttributes method in DataFrameNaFunctions. |
|
Test build #121490 has finished for PR 28266 at commit
|
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
…olumns ### What changes were proposed in this pull request? #26700 removed the ability to drop a row whose nested column value is null. For example, for the following `df`: ``` val schema = new StructType() .add("c1", new StructType() .add("c1-1", StringType) .add("c1-2", StringType)) val data = Seq(Row(Row(null, "a2")), Row(Row("b1", "b2")), Row(null)) val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) df.show +--------+ | c1| +--------+ | [, a2]| |[b1, b2]| | null| +--------+ ``` In Spark 2.4.4, ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ |[b1, b2]| +--------+ ``` In Spark 2.4.5 or Spark 3.0.0-preview2, if nested columns are specified, they are ignored. ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ | [, a2]| |[b1, b2]| | null| +--------+ ``` ### Why are the changes needed? This seems like a regression. ### Does this PR introduce any user-facing change? Now, the nested column can be specified: ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ |[b1, b2]| +--------+ ``` Also, if `*` is specified as a column, it will throw an `AnalysisException` that `*` cannot be resolved, which was the behavior in 2.4.4. Currently, in master, it has no effect. ### How was this patch tested? Updated existing tests. Closes #28266 from imback82/SPARK-31256. Authored-by: Terry Kim <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit d7499ae) Signed-off-by: Wenchen Fan <[email protected]>
|
|
||
| checkAnswer(df.select("c1.c1-1"), | ||
| Row(null) :: Row("b1") :: Row(null) :: Nil) | ||
| test("drop with nested columns") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This looks like a bug, so could you add the prefix: SPARK-31256.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, a bit late...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😂
…olumns
For example, for the following `df`:
```
val schema = new StructType()
.add("c1", new StructType()
.add("c1-1", StringType)
.add("c1-2", StringType))
val data = Seq(Row(Row(null, "a2")), Row(Row("b1", "b2")), Row(null))
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.show
+--------+
| c1|
+--------+
| [, a2]|
|[b1, b2]|
| null|
+--------+
```
In Spark 2.4.4,
```
df.na.drop("any", Seq("c1.c1-1")).show
+--------+
| c1|
+--------+
|[b1, b2]|
+--------+
```
In Spark 2.4.5 or Spark 3.0.0-preview2, if nested columns are specified, they are ignored.
```
df.na.drop("any", Seq("c1.c1-1")).show
+--------+
| c1|
+--------+
| [, a2]|
|[b1, b2]|
| null|
+--------+
```
This seems like a regression.
Now, the nested column can be specified:
```
df.na.drop("any", Seq("c1.c1-1")).show
+--------+
| c1|
+--------+
|[b1, b2]|
+--------+
```
Also, if `*` is specified as a column, it will throw an `AnalysisException` that `*` cannot be resolved, which was the behavior in 2.4.4. Currently, in master, it has no effect.
Updated existing tests.
Closes #28266 from imback82/SPARK-31256.
Authored-by: Terry Kim <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit d7499ae)
Signed-off-by: Wenchen Fan <[email protected]>
|
merging to master/3.0/2.4 |
|
So, SPARK-31256 made a regression at 2.4.5 and this recovers it? |
|
@dongjoon-hyun yes |
|
Thank you for confirmation~ |
What changes were proposed in this pull request?
#26700 removed the ability to drop a row whose nested column value is null.
For example, for the following
df:In Spark 2.4.4,
In Spark 2.4.5 or Spark 3.0.0-preview2, if nested columns are specified, they are ignored.
Why are the changes needed?
This seems like a regression.
Does this PR introduce any user-facing change?
Now, the nested column can be specified:
Also, if
*is specified as a column, it will throw anAnalysisExceptionthat*cannot be resolved, which was the behavior in 2.4.4. Currently, in master, it has no effect.How was this patch tested?
Updated existing tests.