[SPARK-34833][SQL] Apply right-padding correctly for correlated subqueries #31940

maropu · 2021-03-23T07:06:51Z

What changes were proposed in this pull request?

This PR intends to fix the bug that does not apply right-padding for char types inside correlated subquries.
For example, a query below returns nothing in master, but a correct result is c.

scala> sql(s"CREATE TABLE t1(v VARCHAR(3), c CHAR(5)) USING parquet")
scala> sql(s"CREATE TABLE t2(v VARCHAR(5), c CHAR(7)) USING parquet")
scala> sql("INSERT INTO t1 VALUES ('c', 'b')")
scala> sql("INSERT INTO t2 VALUES ('a', 'b')")
scala> val df = sql("""
  |SELECT v FROM t1
  |WHERE 'a' IN (SELECT v FROM t2 WHERE t2.c = t1.c )""".stripMargin)

scala> df.show()
+---+
|  v|
+---+
+---+

This is because ApplyCharTypePadding does not handle the case above to apply right-padding into 'abc'. This PR modifies the code in ApplyCharTypePadding for handling it correctly.

// Before this PR:
scala> df.explain(true)
== Analyzed Logical Plan ==
v: string
Project [v#13]
+- Filter a IN (list#12 [c#14])
   :  +- Project [v#15]
   :     +- Filter (c#16 = outer(c#14))
   :        +- SubqueryAlias spark_catalog.default.t2
   :           +- Relation default.t2[v#15,c#16] parquet
   +- SubqueryAlias spark_catalog.default.t1
      +- Relation default.t1[v#13,c#14] parquet

scala> df.show()
+---+
|  v|
+---+
+---+

// After this PR:
scala> df.explain(true)
== Analyzed Logical Plan ==
v: string
Project [v#43]
+- Filter a IN (list#42 [c#44])
   :  +- Project [v#45]
   :     +- Filter (c#46 = rpad(outer(c#44), 7,  ))
   :        +- SubqueryAlias spark_catalog.default.t2
   :           +- Relation default.t2[v#45,c#46] parquet
   +- SubqueryAlias spark_catalog.default.t1
      +- Relation default.t1[v#43,c#44] parquet

scala> df.show()
+---+
|  v|
+---+
|  c|
+---+

This fix is lated to TPCDS q17; the query returns nothing because of this bug: https://github.com/apache/spark/pull/31886/files#r599333799

Why are the changes needed?

Bugfix.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit tests added.

maropu · 2021-03-23T07:09:46Z

cc: @cloud-fan @yaooqinn

sql/core/src/test/scala/org/apache/spark/sql/CharVarcharTestSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

SparkQA · 2021-03-23T08:12:37Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40970/

SparkQA · 2021-03-23T08:22:21Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40970/

SparkQA · 2021-03-23T08:45:20Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40973/

SparkQA · 2021-03-23T10:03:58Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40973/

dongjoon-hyun · 2021-03-23T11:28:01Z

cc @peter-toth

SparkQA · 2021-03-23T13:24:04Z

Test build #136386 has finished for PR 31940 at commit da849aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-23T14:40:51Z

Test build #136389 has finished for PR 31940 at commit d02fbec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-23T15:35:29Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40989/

SparkQA · 2021-03-23T16:51:34Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40989/

peter-toth · 2021-03-23T16:57:53Z

The PR LGTM.

But I have a quick question, a bit unrelated to this PR but related to fixed-length string columns. Is this expected that we have in Spark:

sql(s"CREATE TABLE t(c3 CHAR(3), c5 CHAR(5)) USING parquet")
sql("INSERT INTO t VALUES ('a', 'a')")
sql("SELECT c3, c5, c3 = c5, upper(c3) = upper(c5) FROM t").show()

+---+-----+---------+-----------------------+
| c3|   c5|(c3 = c5)|(upper(c3) = upper(c5))|
+---+-----+---------+-----------------------+
|a  |a    |     true|                  false|
+---+-----+---------+-----------------------+

But in PostgreSQL upper(c3) = upper(c5) is true:

petertoth=# SELECT c3, c5, c3 = c5, upper(c3) = upper(c5) FROM t;
 c3  |  c5   | ?column? | ?column?
-----+-------+----------+----------
 a   | a     | t        | t

sql/core/src/test/scala/org/apache/spark/sql/CharVarcharTestSuite.scala

SparkQA · 2021-03-23T19:32:29Z

Test build #136405 has finished for PR 31940 at commit f869736.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2021-03-23T22:35:54Z

sql("SELECT c3, c5, c3 = c5, upper(c3) = upper(c5) FROM t").show()

@peter-toth Yea, that should be a bug. Do you want to fix it? If you don't have time, I'll take it.

SparkQA · 2021-03-24T01:12:43Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41011/

SparkQA · 2021-03-24T02:04:42Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41011/

yaooqinn · 2021-03-24T02:05:21Z

The PR LGTM.

But I have a quick question, a bit unrelated to this PR but related to fixed-length string columns. Is this expected that we have in Spark:

sql(s"CREATE TABLE t(c3 CHAR(3), c5 CHAR(5)) USING parquet")
sql("INSERT INTO t VALUES ('a', 'a')")
sql("SELECT c3, c5, c3 = c5, upper(c3) = upper(c5) FROM t").show()

+---+-----+---------+-----------------------+
| c3|   c5|(c3 = c5)|(upper(c3) = upper(c5))|
+---+-----+---------+-----------------------+
|a  |a    |     true|                  false|
+---+-----+---------+-----------------------+

But in PostgreSQL upper(c3) = upper(c5) is true:

petertoth=# SELECT c3, c5, c3 = c5, upper(c3) = upper(c5) FROM t;
 c3  |  c5   | ?column? | ?column?
-----+-------+----------+----------
 a   | a     | t        | t

cc @cloud-fan, we seemed to have discussed these when we prohibited char/varchar in UDFs

SparkQA · 2021-03-24T03:55:38Z

Test build #136427 has finished for PR 31940 at commit acdff36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-03-24T06:59:58Z

upper(c3) = upper(c5)

This is a bit hard to generalize and I can't come up with a general pattern that can trigger char type padding. What about something like concat(c1, c2) = c3? Ideas are welcome!

sql/core/src/test/scala/org/apache/spark/sql/CharVarcharTestSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

SparkQA · 2021-03-24T13:18:01Z

Test build #136462 has started for PR 31940 at commit 9142bfc.

SparkQA · 2021-03-24T17:07:04Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41046/

SparkQA · 2021-03-24T18:33:56Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41046/

…eries ### What changes were proposed in this pull request? This PR intends to fix the bug that does not apply right-padding for char types inside correlated subquries. For example, a query below returns nothing in master, but a correct result is `c`. ``` scala> sql(s"CREATE TABLE t1(v VARCHAR(3), c CHAR(5)) USING parquet") scala> sql(s"CREATE TABLE t2(v VARCHAR(5), c CHAR(7)) USING parquet") scala> sql("INSERT INTO t1 VALUES ('c', 'b')") scala> sql("INSERT INTO t2 VALUES ('a', 'b')") scala> val df = sql(""" |SELECT v FROM t1 |WHERE 'a' IN (SELECT v FROM t2 WHERE t2.c = t1.c )""".stripMargin) scala> df.show() +---+ | v| +---+ +---+ ``` This is because `ApplyCharTypePadding` does not handle the case above to apply right-padding into `'abc'`. This PR modifies the code in `ApplyCharTypePadding` for handling it correctly. ``` // Before this PR: scala> df.explain(true) == Analyzed Logical Plan == v: string Project [v#13] +- Filter a IN (list#12 [c#14]) : +- Project [v#15] : +- Filter (c#16 = outer(c#14)) : +- SubqueryAlias spark_catalog.default.t2 : +- Relation default.t2[v#15,c#16] parquet +- SubqueryAlias spark_catalog.default.t1 +- Relation default.t1[v#13,c#14] parquet scala> df.show() +---+ | v| +---+ +---+ // After this PR: scala> df.explain(true) == Analyzed Logical Plan == v: string Project [v#43] +- Filter a IN (list#42 [c#44]) : +- Project [v#45] : +- Filter (c#46 = rpad(outer(c#44), 7, )) : +- SubqueryAlias spark_catalog.default.t2 : +- Relation default.t2[v#45,c#46] parquet +- SubqueryAlias spark_catalog.default.t1 +- Relation default.t1[v#43,c#44] parquet scala> df.show() +---+ | v| +---+ | c| +---+ ``` This fix is lated to TPCDS q17; the query returns nothing because of this bug: https://github.com/apache/spark/pull/31886/files#r599333799 ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests added. Closes #31940 from maropu/FixCharPadding. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit 150769b) Signed-off-by: Takeshi Yamamuro <[email protected]>

maropu · 2021-03-24T23:32:46Z

Thanks for the reviews, all! Merged to master/branch-3.1.

### What changes were proposed in this pull request? This is a follow-up of #31940 . This PR generalizes the matching of attributes and outer references, so that outer references are handled everywhere. Note that, currently correlated subquery has a lot of limitations in Spark, and the newly covered cases are not possible to happen. So this PR is a code refactor. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #31959 from cloud-fan/follow. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>

### What changes were proposed in this pull request? This is a follow-up of #31940 . This PR generalizes the matching of attributes and outer references, so that outer references are handled everywhere. Note that, currently correlated subquery has a lot of limitations in Spark, and the newly covered cases are not possible to happen. So this PR is a code refactor. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #31959 from cloud-fan/follow. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit 658e95c) Signed-off-by: Takeshi Yamamuro <[email protected]>

…eries ### What changes were proposed in this pull request? This PR intends to fix the bug that does not apply right-padding for char types inside correlated subquries. For example, a query below returns nothing in master, but a correct result is `c`. ``` scala> sql(s"CREATE TABLE t1(v VARCHAR(3), c CHAR(5)) USING parquet") scala> sql(s"CREATE TABLE t2(v VARCHAR(5), c CHAR(7)) USING parquet") scala> sql("INSERT INTO t1 VALUES ('c', 'b')") scala> sql("INSERT INTO t2 VALUES ('a', 'b')") scala> val df = sql(""" |SELECT v FROM t1 |WHERE 'a' IN (SELECT v FROM t2 WHERE t2.c = t1.c )""".stripMargin) scala> df.show() +---+ | v| +---+ +---+ ``` This is because `ApplyCharTypePadding` does not handle the case above to apply right-padding into `'abc'`. This PR modifies the code in `ApplyCharTypePadding` for handling it correctly. ``` // Before this PR: scala> df.explain(true) == Analyzed Logical Plan == v: string Project [v#13] +- Filter a IN (list#12 [c#14]) : +- Project [v#15] : +- Filter (c#16 = outer(c#14)) : +- SubqueryAlias spark_catalog.default.t2 : +- Relation default.t2[v#15,c#16] parquet +- SubqueryAlias spark_catalog.default.t1 +- Relation default.t1[v#13,c#14] parquet scala> df.show() +---+ | v| +---+ +---+ // After this PR: scala> df.explain(true) == Analyzed Logical Plan == v: string Project [v#43] +- Filter a IN (list#42 [c#44]) : +- Project [v#45] : +- Filter (c#46 = rpad(outer(c#44), 7, )) : +- SubqueryAlias spark_catalog.default.t2 : +- Relation default.t2[v#45,c#46] parquet +- SubqueryAlias spark_catalog.default.t1 +- Relation default.t1[v#43,c#44] parquet scala> df.show() +---+ | v| +---+ | c| +---+ ``` This fix is lated to TPCDS q17; the query returns nothing because of this bug: https://github.com/apache/spark/pull/31886/files#r599333799 ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests added. Closes apache#31940 from maropu/FixCharPadding. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit 150769b) Signed-off-by: Takeshi Yamamuro <[email protected]>

### What changes were proposed in this pull request? This is a follow-up of apache#31940 . This PR generalizes the matching of attributes and outer references, so that outer references are handled everywhere. Note that, currently correlated subquery has a lot of limitations in Spark, and the newly covered cases are not possible to happen. So this PR is a code refactor. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes apache#31959 from cloud-fan/follow. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit 658e95c) Signed-off-by: Takeshi Yamamuro <[email protected]>

…eries ### What changes were proposed in this pull request? This PR intends to fix the bug that does not apply right-padding for char types inside correlated subquries. For example, a query below returns nothing in master, but a correct result is `c`. ``` scala> sql(s"CREATE TABLE t1(v VARCHAR(3), c CHAR(5)) USING parquet") scala> sql(s"CREATE TABLE t2(v VARCHAR(5), c CHAR(7)) USING parquet") scala> sql("INSERT INTO t1 VALUES ('c', 'b')") scala> sql("INSERT INTO t2 VALUES ('a', 'b')") scala> val df = sql(""" |SELECT v FROM t1 |WHERE 'a' IN (SELECT v FROM t2 WHERE t2.c = t1.c )""".stripMargin) scala> df.show() +---+ | v| +---+ +---+ ``` This is because `ApplyCharTypePadding` does not handle the case above to apply right-padding into `'abc'`. This PR modifies the code in `ApplyCharTypePadding` for handling it correctly. ``` // Before this PR: scala> df.explain(true) == Analyzed Logical Plan == v: string Project [v#13] +- Filter a IN (list#12 [c#14]) : +- Project [v#15] : +- Filter (c#16 = outer(c#14)) : +- SubqueryAlias spark_catalog.default.t2 : +- Relation default.t2[v#15,c#16] parquet +- SubqueryAlias spark_catalog.default.t1 +- Relation default.t1[v#13,c#14] parquet scala> df.show() +---+ | v| +---+ +---+ // After this PR: scala> df.explain(true) == Analyzed Logical Plan == v: string Project [v#43] +- Filter a IN (list#42 [c#44]) : +- Project [v#45] : +- Filter (c#46 = rpad(outer(c#44), 7, )) : +- SubqueryAlias spark_catalog.default.t2 : +- Relation default.t2[v#45,c#46] parquet +- SubqueryAlias spark_catalog.default.t1 +- Relation default.t1[v#43,c#44] parquet scala> df.show() +---+ | v| +---+ | c| +---+ ``` This fix is lated to TPCDS q17; the query returns nothing because of this bug: https://github.com/apache/spark/pull/31886/files#r599333799 ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests added. Closes apache#31940 from maropu/FixCharPadding. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit 150769b) Signed-off-by: Takeshi Yamamuro <[email protected]>

### What changes were proposed in this pull request? This is a follow-up of apache#31940 . This PR generalizes the matching of attributes and outer references, so that outer references are handled everywhere. Note that, currently correlated subquery has a lot of limitations in Spark, and the newly covered cases are not possible to happen. So this PR is a code refactor. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes apache#31959 from cloud-fan/follow. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit 658e95c) Signed-off-by: Takeshi Yamamuro <[email protected]>

Fix

da849aa

github-actions bot added the SQL label Mar 23, 2021

maropu mentioned this pull request Mar 23, 2021

[SPARK-34795][SQL][TESTS] Adds a new job in GitHub Actions to check the output of TPC-DS queries #31886

Closed

yaooqinn reviewed Mar 23, 2021

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/CharVarcharTestSuite.scala Outdated Show resolved Hide resolved

review

d02fbec

cloud-fan reviewed Mar 23, 2021

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/CharVarcharTestSuite.scala Show resolved Hide resolved

cloud-fan reviewed Mar 23, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Show resolved Hide resolved

Review

f869736

viirya reviewed Mar 23, 2021

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/CharVarcharTestSuite.scala Outdated Show resolved Hide resolved

viirya reviewed Mar 23, 2021

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/CharVarcharTestSuite.scala Outdated Show resolved Hide resolved

Add more tests

acdff36

yaooqinn approved these changes Mar 24, 2021

View reviewed changes

cloud-fan reviewed Mar 24, 2021

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/CharVarcharTestSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Mar 24, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

review

9142bfc

cloud-fan approved these changes Mar 24, 2021

View reviewed changes

viirya approved these changes Mar 24, 2021

View reviewed changes

maropu closed this in 150769b Mar 24, 2021

maropu mentioned this pull request Mar 25, 2021

[SPARK-34822][SQL] Update the plan stability golden files even if only the explain.txt changes #31957

Closed

cloud-fan mentioned this pull request Mar 25, 2021

[SPARK-34833][SQL][FOLLOWUP] Handle outer references in all the places #31959

Closed

[SPARK-34833][SQL] Apply right-padding correctly for correlated subqueries #31940

[SPARK-34833][SQL] Apply right-padding correctly for correlated subqueries #31940

Uh oh!

Conversation

maropu commented Mar 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

maropu commented Mar 23, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Mar 23, 2021

Uh oh!

SparkQA commented Mar 23, 2021

Uh oh!

SparkQA commented Mar 23, 2021

Uh oh!

SparkQA commented Mar 23, 2021

Uh oh!

dongjoon-hyun commented Mar 23, 2021

Uh oh!

SparkQA commented Mar 23, 2021

Uh oh!

SparkQA commented Mar 23, 2021

Uh oh!

SparkQA commented Mar 23, 2021

Uh oh!

SparkQA commented Mar 23, 2021

Uh oh!

peter-toth commented Mar 23, 2021

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Mar 23, 2021

Uh oh!

maropu commented Mar 23, 2021

Uh oh!

SparkQA commented Mar 24, 2021

Uh oh!

SparkQA commented Mar 24, 2021

Uh oh!

yaooqinn commented Mar 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 24, 2021

Uh oh!

cloud-fan commented Mar 24, 2021

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Mar 24, 2021

Uh oh!

SparkQA commented Mar 24, 2021

Uh oh!

SparkQA commented Mar 24, 2021

Uh oh!

maropu commented Mar 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

maropu commented Mar 23, 2021 •

edited

Loading

yaooqinn commented Mar 24, 2021 •

edited

Loading