-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-34833][SQL] Apply right-padding correctly for correlated subqueries #31940
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc: @cloud-fan @yaooqinn |
sql/core/src/test/scala/org/apache/spark/sql/CharVarcharTestSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Show resolved
Hide resolved
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
cc @peter-toth |
|
Test build #136386 has finished for PR 31940 at commit
|
|
Test build #136389 has finished for PR 31940 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
The PR LGTM. But I have a quick question, a bit unrelated to this PR but related to fixed-length string columns. Is this expected that we have in Spark: But in PostgreSQL |
sql/core/src/test/scala/org/apache/spark/sql/CharVarcharTestSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/CharVarcharTestSuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #136405 has finished for PR 31940 at commit
|
@peter-toth Yea, that should be a bug. Do you want to fix it? If you don't have time, I'll take it. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
cc @cloud-fan, we seemed to have discussed these when we prohibited char/varchar in UDFs |
|
Test build #136427 has finished for PR 31940 at commit
|
This is a bit hard to generalize and I can't come up with a general pattern that can trigger char type padding. What about something like |
sql/core/src/test/scala/org/apache/spark/sql/CharVarcharTestSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
|
Test build #136462 has started for PR 31940 at commit |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
…eries
### What changes were proposed in this pull request?
This PR intends to fix the bug that does not apply right-padding for char types inside correlated subquries.
For example, a query below returns nothing in master, but a correct result is `c`.
```
scala> sql(s"CREATE TABLE t1(v VARCHAR(3), c CHAR(5)) USING parquet")
scala> sql(s"CREATE TABLE t2(v VARCHAR(5), c CHAR(7)) USING parquet")
scala> sql("INSERT INTO t1 VALUES ('c', 'b')")
scala> sql("INSERT INTO t2 VALUES ('a', 'b')")
scala> val df = sql("""
|SELECT v FROM t1
|WHERE 'a' IN (SELECT v FROM t2 WHERE t2.c = t1.c )""".stripMargin)
scala> df.show()
+---+
| v|
+---+
+---+
```
This is because `ApplyCharTypePadding` does not handle the case above to apply right-padding into `'abc'`. This PR modifies the code in `ApplyCharTypePadding` for handling it correctly.
```
// Before this PR:
scala> df.explain(true)
== Analyzed Logical Plan ==
v: string
Project [v#13]
+- Filter a IN (list#12 [c#14])
: +- Project [v#15]
: +- Filter (c#16 = outer(c#14))
: +- SubqueryAlias spark_catalog.default.t2
: +- Relation default.t2[v#15,c#16] parquet
+- SubqueryAlias spark_catalog.default.t1
+- Relation default.t1[v#13,c#14] parquet
scala> df.show()
+---+
| v|
+---+
+---+
// After this PR:
scala> df.explain(true)
== Analyzed Logical Plan ==
v: string
Project [v#43]
+- Filter a IN (list#42 [c#44])
: +- Project [v#45]
: +- Filter (c#46 = rpad(outer(c#44), 7, ))
: +- SubqueryAlias spark_catalog.default.t2
: +- Relation default.t2[v#45,c#46] parquet
+- SubqueryAlias spark_catalog.default.t1
+- Relation default.t1[v#43,c#44] parquet
scala> df.show()
+---+
| v|
+---+
| c|
+---+
```
This fix is lated to TPCDS q17; the query returns nothing because of this bug: https://github.com/apache/spark/pull/31886/files#r599333799
### Why are the changes needed?
Bugfix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests added.
Closes #31940 from maropu/FixCharPadding.
Authored-by: Takeshi Yamamuro <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
(cherry picked from commit 150769b)
Signed-off-by: Takeshi Yamamuro <[email protected]>
|
Thanks for the reviews, all! Merged to master/branch-3.1. |
### What changes were proposed in this pull request? This is a follow-up of #31940 . This PR generalizes the matching of attributes and outer references, so that outer references are handled everywhere. Note that, currently correlated subquery has a lot of limitations in Spark, and the newly covered cases are not possible to happen. So this PR is a code refactor. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #31959 from cloud-fan/follow. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>
### What changes were proposed in this pull request? This is a follow-up of #31940 . This PR generalizes the matching of attributes and outer references, so that outer references are handled everywhere. Note that, currently correlated subquery has a lot of limitations in Spark, and the newly covered cases are not possible to happen. So this PR is a code refactor. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #31959 from cloud-fan/follow. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit 658e95c) Signed-off-by: Takeshi Yamamuro <[email protected]>
…eries
### What changes were proposed in this pull request?
This PR intends to fix the bug that does not apply right-padding for char types inside correlated subquries.
For example, a query below returns nothing in master, but a correct result is `c`.
```
scala> sql(s"CREATE TABLE t1(v VARCHAR(3), c CHAR(5)) USING parquet")
scala> sql(s"CREATE TABLE t2(v VARCHAR(5), c CHAR(7)) USING parquet")
scala> sql("INSERT INTO t1 VALUES ('c', 'b')")
scala> sql("INSERT INTO t2 VALUES ('a', 'b')")
scala> val df = sql("""
|SELECT v FROM t1
|WHERE 'a' IN (SELECT v FROM t2 WHERE t2.c = t1.c )""".stripMargin)
scala> df.show()
+---+
| v|
+---+
+---+
```
This is because `ApplyCharTypePadding` does not handle the case above to apply right-padding into `'abc'`. This PR modifies the code in `ApplyCharTypePadding` for handling it correctly.
```
// Before this PR:
scala> df.explain(true)
== Analyzed Logical Plan ==
v: string
Project [v#13]
+- Filter a IN (list#12 [c#14])
: +- Project [v#15]
: +- Filter (c#16 = outer(c#14))
: +- SubqueryAlias spark_catalog.default.t2
: +- Relation default.t2[v#15,c#16] parquet
+- SubqueryAlias spark_catalog.default.t1
+- Relation default.t1[v#13,c#14] parquet
scala> df.show()
+---+
| v|
+---+
+---+
// After this PR:
scala> df.explain(true)
== Analyzed Logical Plan ==
v: string
Project [v#43]
+- Filter a IN (list#42 [c#44])
: +- Project [v#45]
: +- Filter (c#46 = rpad(outer(c#44), 7, ))
: +- SubqueryAlias spark_catalog.default.t2
: +- Relation default.t2[v#45,c#46] parquet
+- SubqueryAlias spark_catalog.default.t1
+- Relation default.t1[v#43,c#44] parquet
scala> df.show()
+---+
| v|
+---+
| c|
+---+
```
This fix is lated to TPCDS q17; the query returns nothing because of this bug: https://github.com/apache/spark/pull/31886/files#r599333799
### Why are the changes needed?
Bugfix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests added.
Closes apache#31940 from maropu/FixCharPadding.
Authored-by: Takeshi Yamamuro <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
(cherry picked from commit 150769b)
Signed-off-by: Takeshi Yamamuro <[email protected]>
### What changes were proposed in this pull request? This is a follow-up of apache#31940 . This PR generalizes the matching of attributes and outer references, so that outer references are handled everywhere. Note that, currently correlated subquery has a lot of limitations in Spark, and the newly covered cases are not possible to happen. So this PR is a code refactor. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes apache#31959 from cloud-fan/follow. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit 658e95c) Signed-off-by: Takeshi Yamamuro <[email protected]>
…eries
### What changes were proposed in this pull request?
This PR intends to fix the bug that does not apply right-padding for char types inside correlated subquries.
For example, a query below returns nothing in master, but a correct result is `c`.
```
scala> sql(s"CREATE TABLE t1(v VARCHAR(3), c CHAR(5)) USING parquet")
scala> sql(s"CREATE TABLE t2(v VARCHAR(5), c CHAR(7)) USING parquet")
scala> sql("INSERT INTO t1 VALUES ('c', 'b')")
scala> sql("INSERT INTO t2 VALUES ('a', 'b')")
scala> val df = sql("""
|SELECT v FROM t1
|WHERE 'a' IN (SELECT v FROM t2 WHERE t2.c = t1.c )""".stripMargin)
scala> df.show()
+---+
| v|
+---+
+---+
```
This is because `ApplyCharTypePadding` does not handle the case above to apply right-padding into `'abc'`. This PR modifies the code in `ApplyCharTypePadding` for handling it correctly.
```
// Before this PR:
scala> df.explain(true)
== Analyzed Logical Plan ==
v: string
Project [v#13]
+- Filter a IN (list#12 [c#14])
: +- Project [v#15]
: +- Filter (c#16 = outer(c#14))
: +- SubqueryAlias spark_catalog.default.t2
: +- Relation default.t2[v#15,c#16] parquet
+- SubqueryAlias spark_catalog.default.t1
+- Relation default.t1[v#13,c#14] parquet
scala> df.show()
+---+
| v|
+---+
+---+
// After this PR:
scala> df.explain(true)
== Analyzed Logical Plan ==
v: string
Project [v#43]
+- Filter a IN (list#42 [c#44])
: +- Project [v#45]
: +- Filter (c#46 = rpad(outer(c#44), 7, ))
: +- SubqueryAlias spark_catalog.default.t2
: +- Relation default.t2[v#45,c#46] parquet
+- SubqueryAlias spark_catalog.default.t1
+- Relation default.t1[v#43,c#44] parquet
scala> df.show()
+---+
| v|
+---+
| c|
+---+
```
This fix is lated to TPCDS q17; the query returns nothing because of this bug: https://github.com/apache/spark/pull/31886/files#r599333799
### Why are the changes needed?
Bugfix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests added.
Closes apache#31940 from maropu/FixCharPadding.
Authored-by: Takeshi Yamamuro <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
(cherry picked from commit 150769b)
Signed-off-by: Takeshi Yamamuro <[email protected]>
### What changes were proposed in this pull request? This is a follow-up of apache#31940 . This PR generalizes the matching of attributes and outer references, so that outer references are handled everywhere. Note that, currently correlated subquery has a lot of limitations in Spark, and the newly covered cases are not possible to happen. So this PR is a code refactor. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes apache#31959 from cloud-fan/follow. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit 658e95c) Signed-off-by: Takeshi Yamamuro <[email protected]>
What changes were proposed in this pull request?
This PR intends to fix the bug that does not apply right-padding for char types inside correlated subquries.
For example, a query below returns nothing in master, but a correct result is
c.This is because
ApplyCharTypePaddingdoes not handle the case above to apply right-padding into'abc'. This PR modifies the code inApplyCharTypePaddingfor handling it correctly.This fix is lated to TPCDS q17; the query returns nothing because of this bug: https://github.com/apache/spark/pull/31886/files#r599333799
Why are the changes needed?
Bugfix.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit tests added.