Skip to content

Conversation

@imback82
Copy link
Contributor

What changes were proposed in this pull request?

#26700 removed the ability to drop a row whose nested column value is null.

For example, for the following df:

val schema = new StructType()
  .add("c1", new StructType()
    .add("c1-1", StringType)
    .add("c1-2", StringType))
val data = Seq(Row(Row(null, "a2")), Row(Row("b1", "b2")), Row(null))
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.show
+--------+
|      c1|
+--------+
|  [, a2]|
|[b1, b2]|
|    null|
+--------+

In Spark 2.4.4,

df.na.drop("any", Seq("c1.c1-1")).show
+--------+
|      c1|
+--------+
|[b1, b2]|
+--------+

In Spark 2.4.5 or Spark 3.0.0-preview2, if nested columns are specified, they are ignored.

df.na.drop("any", Seq("c1.c1-1")).show
+--------+
|      c1|
+--------+
|  [, a2]|
|[b1, b2]|
|    null|
+--------+

Why are the changes needed?

This seems like a regression.

Does this PR introduce any user-facing change?

Now, the nested column can be specified:

df.na.drop("any", Seq("c1.c1-1")).show
+--------+
|      c1|
+--------+
|[b1, b2]|
+--------+

Also, if * is specified as a column, it will throw an AnalysisException that * cannot be resolved, which was the behavior in 2.4.4. Currently, in master, it has no effect.

How was this patch tested?

Updated existing tests.

val exception = intercept[AnalysisException] {
df.na.drop("any", Seq("*"))
}
assert(exception.getMessage.contains("Cannot resolve column name \"*\""))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this was the behavior in Spark 2.4.4. We can handle this more gracefully (e.g., use outputAttributes) if we need to.

On a side note, for fill, * is ignored in Spark 2.4.4.

val df = spark.createDataFrame(
spark.sparkContext.parallelize(data), schema)
// Nested columns are ignored for fill().
checkAnswer(df.na.fill("a1", Seq("c1.c1-1")), df)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that nested columns are ignored for fill in Spark 2.4.4.

@imback82
Copy link
Contributor Author

@cloud-fan Please let me know if this PR (going back to 2.4.4 behavior) makes sense. Thanks!

@imback82 imback82 changed the title [SPARK-31256][SQL] Dropna should work for nested columns [SPARK-31256][SQL] DataFrameNaFunctions.drop should work for nested columns Apr 19, 2020
@TJX2014
Copy link
Contributor

TJX2014 commented Apr 20, 2020

nice,duplicate columns is same as struct alias which not works at toAttributes method in DataFrameNaFunctions.

@SparkQA
Copy link

SparkQA commented Apr 20, 2020

Test build #121490 has finished for PR 28266 at commit 283fee1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

@cloud-fan cloud-fan closed this in d7499ae Apr 20, 2020
cloud-fan pushed a commit that referenced this pull request Apr 20, 2020
…olumns

### What changes were proposed in this pull request?

#26700 removed the ability to drop a row whose nested column value is null.

For example, for the following `df`:
```
val schema = new StructType()
  .add("c1", new StructType()
    .add("c1-1", StringType)
    .add("c1-2", StringType))
val data = Seq(Row(Row(null, "a2")), Row(Row("b1", "b2")), Row(null))
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.show
+--------+
|      c1|
+--------+
|  [, a2]|
|[b1, b2]|
|    null|
+--------+
```
In Spark 2.4.4,
```
df.na.drop("any", Seq("c1.c1-1")).show
+--------+
|      c1|
+--------+
|[b1, b2]|
+--------+
```
In Spark 2.4.5 or Spark 3.0.0-preview2, if nested columns are specified, they are ignored.
```
df.na.drop("any", Seq("c1.c1-1")).show
+--------+
|      c1|
+--------+
|  [, a2]|
|[b1, b2]|
|    null|
+--------+
```
### Why are the changes needed?

This seems like a regression.

### Does this PR introduce any user-facing change?

Now, the nested column can be specified:
```
df.na.drop("any", Seq("c1.c1-1")).show
+--------+
|      c1|
+--------+
|[b1, b2]|
+--------+
```

Also, if `*` is specified as a column, it will throw an `AnalysisException` that `*` cannot be resolved, which was the behavior in 2.4.4. Currently, in master, it has no effect.

### How was this patch tested?

Updated existing tests.

Closes #28266 from imback82/SPARK-31256.

Authored-by: Terry Kim <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit d7499ae)
Signed-off-by: Wenchen Fan <[email protected]>

checkAnswer(df.select("c1.c1-1"),
Row(null) :: Row("b1") :: Row(null) :: Nil)
test("drop with nested columns") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This looks like a bug, so could you add the prefix: SPARK-31256.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, a bit late...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😂

cloud-fan pushed a commit that referenced this pull request Apr 20, 2020
…olumns

For example, for the following `df`:
```
val schema = new StructType()
  .add("c1", new StructType()
    .add("c1-1", StringType)
    .add("c1-2", StringType))
val data = Seq(Row(Row(null, "a2")), Row(Row("b1", "b2")), Row(null))
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.show
+--------+
|      c1|
+--------+
|  [, a2]|
|[b1, b2]|
|    null|
+--------+
```
In Spark 2.4.4,
```
df.na.drop("any", Seq("c1.c1-1")).show
+--------+
|      c1|
+--------+
|[b1, b2]|
+--------+
```
In Spark 2.4.5 or Spark 3.0.0-preview2, if nested columns are specified, they are ignored.
```
df.na.drop("any", Seq("c1.c1-1")).show
+--------+
|      c1|
+--------+
|  [, a2]|
|[b1, b2]|
|    null|
+--------+
```

This seems like a regression.

Now, the nested column can be specified:
```
df.na.drop("any", Seq("c1.c1-1")).show
+--------+
|      c1|
+--------+
|[b1, b2]|
+--------+
```

Also, if `*` is specified as a column, it will throw an `AnalysisException` that `*` cannot be resolved, which was the behavior in 2.4.4. Currently, in master, it has no effect.

Updated existing tests.

Closes #28266 from imback82/SPARK-31256.

Authored-by: Terry Kim <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit d7499ae)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

merging to master/3.0/2.4

@dongjoon-hyun
Copy link
Member

So, SPARK-31256 made a regression at 2.4.5 and this recovers it?

@cloud-fan
Copy link
Contributor

@dongjoon-hyun yes

@dongjoon-hyun
Copy link
Member

Thank you for confirmation~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants