[SPARK-31590][SQL] Metadata-only queries should not include subquery in partition filters #28383

cxzl25 · 2020-04-28T04:17:56Z

What changes were proposed in this pull request?

Metadata-only queries should not include subquery in partition filters.

Why are the changes needed?

Apply the OptimizeMetadataOnlyQuery rule again, will get the exception Cannot evaluate expression: scalar-subquery.

Does this PR introduce any user-facing change?

Yes. When spark.sql.optimizer.metadataOnly is enabled, it succeeds when the queries include subquery in partition filters.

How was this patch tested?

add UT

sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala

HyukjinKwon · 2020-05-01T04:26:17Z

ok to test

HyukjinKwon · 2020-05-01T04:57:54Z

Looks fine to me.

sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuerySuite.scala

SparkQA · 2020-05-01T08:57:34Z

Test build #122157 has finished for PR 28383 at commit 86f28d5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-01T09:50:56Z

Test build #122161 has finished for PR 28383 at commit c848508.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-01T11:46:51Z

Test build #122165 has finished for PR 28383 at commit 4abf2f5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-05-01T16:41:38Z

@viirya and @cloud-fan fyi

sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuerySuite.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuerySuite.scala

viirya

Is it a problem that normalizedFilters contains subquery expression?

By running a query like:

"""
      |SELECT partcol1, MAX(partcol2) AS partcol2
      |FROM srcpart
      |WHERE partcol1 = (
      |  SELECT MAX(col1)
      |  FROM srcpart
      |)
      |AND partcol2= 'event'
      |GROUP BY partcol1
      |""".stripMargin

== Physical Plan ==
SortAggregate(key=[partcol1#28], functions=[max(partcol2#29)])
+- *(2) Sort [partcol1#28 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(partcol1#28, 5), true, [id=#3464]
      +- SortAggregate(key=[partcol1#28], functions=[partial_max(partcol2#29)])
         +- *(1) Sort [partcol1#28 ASC NULLS FIRST], false, 0
            +- *(1) Filter (((isnotnull(partcol1#28) AND isnotnull(partcol2#29)) AND (partcol1#28 = Subquery scalar-subquery#247, [id=#3452])) AND (partcol2#29 = event))
               :  +- Subquery scalar-subquery#247, [id=#3452]
               :     +- *(2) HashAggregate(keys=[], functions=[max(col1#26)])
               :        +- Exchange SinglePartition, true, [id=#3448]
               :           +- *(1) HashAggregate(keys=[], functions=[partial_max(col1#26)])
               :              +- *(1) Project [col1#26]
               :                 +- *(1) ColumnarToRow
               :                    +- FileScan parquet default.srcpart[col1#26,partcol1#28,partcol2#29] Batched: true, DataFilters: [], Format: Parquet, Location: CatalogFileIndex[file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col1:int>
               +- *(1) LocalTableScan <empty>, [partcol1#28, partcol2#29]

Looks it is ok.

HyukjinKwon · 2020-05-02T03:15:37Z

@cxzl25 can you revert the test back to the original one and focus on the cleanup? The case before was a valid, and failed in the master. The fix itself seems right too.

SparkQA · 2020-05-02T07:05:01Z

Test build #122190 has finished for PR 28383 at commit a7638d6.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-02T07:05:01Z

Test build #122189 has finished for PR 28383 at commit 7046db8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-05-02T08:48:15Z

retest this please

SparkQA · 2020-05-02T15:14:15Z

Test build #122210 has finished for PR 28383 at commit a7638d6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-05-03T02:29:45Z

@cxzl25, I think #28383 (comment) isn't fully addressed. Can you fix the PR description to explain fully what this PR proposes? This PR doesn't filter unevaluable expressions but only sub-queries because their results are only available during runtime.

sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuerySuite.scala

SparkQA · 2020-05-04T13:03:09Z

Test build #122256 has finished for PR 28383 at commit d601af4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-05-05T01:20:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala

        case a: AttributeReference =>
          a.withName(relation.output.find(_.semanticEquals(a)).get.name)
      }
-    }


Could you filter out this unsupported case outside replaceTableScanWithPartitionMetadata(I think this filtering is not related to normalization)? e.g., in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala#L53-L55

maropu · 2020-05-05T01:24:42Z

Applying OptimizeMetadataOnlyQuery rule will generate scalar-subquery.

Is this statement true? It seems the test query itself has a subquery.

// Analyzed plan of the test query
Aggregate [partcol1#40], [partcol1#40, max(partcol2#41) AS partcol2#71]
+- Filter ((partcol1#40 = scalar-subquery#70 []) AND (partcol2#41 = even))
   :  +- Aggregate [max(partcol1#40) AS max(partcol1)#73]
   :     +- SubqueryAlias spark_catalog.default.srcpart
   :        +- Relation[col1#38,col2#39,partcol1#40,partcol2#41] parquet
   +- SubqueryAlias spark_catalog.default.srcpart
      +- Relation[col1#38,col2#39,partcol1#40,partcol2#41] parquet

I think the root cause is just that unsupported partitionFilters (subquery) is passed into FileIndex.listFiles.

cloud-fan · 2020-05-05T05:21:01Z

Shall we remove OptimizeMetadataOnlyQuery? IIRC it has a correcness issue and we disable it by default. cc @gengliangwang

gengliangwang · 2020-05-05T06:19:14Z

Shall we remove OptimizeMetadataOnlyQuery? IIRC it has a correcness issue and we disable it by default. cc @gengliangwang

On second thought: I think we should keep it for two reasons:

when users are 100% sure about their data won't contain empty partition, they can still turn it on.
the future developers may come up with the same idea and create exactly the same rule and enable it by default...

HyukjinKwon · 2020-05-05T11:52:28Z

I think we can just mention that it is discouraged to use that configuration for now. We cant just remove the configuration without deprecation anyway and the fix looks correct.

HyukjinKwon · 2020-05-06T01:56:21Z

Merged to master, branch-3.0, and branch-2.4.

…in partition filters ### What changes were proposed in this pull request? Metadata-only queries should not include subquery in partition filters. ### Why are the changes needed? Apply the `OptimizeMetadataOnlyQuery` rule again, will get the exception `Cannot evaluate expression: scalar-subquery`. ### Does this PR introduce any user-facing change? Yes. When `spark.sql.optimizer.metadataOnly` is enabled, it succeeds when the queries include subquery in partition filters. ### How was this patch tested? add UT Closes #28383 from cxzl25/fix_SPARK-31590. Authored-by: sychen <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 588966d) Signed-off-by: HyukjinKwon <[email protected]>

The filter used by Metadata-only queries should not have Unevaluable

c34f030

probot-autolabeler bot added the SQL label Apr 28, 2020

HyukjinKwon reviewed May 1, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala Outdated Show resolved Hide resolved

use filterNot(SubqueryExpression.hasSubquery)

86f28d5

HyukjinKwon reviewed May 1, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuerySuite.scala Outdated Show resolved Hide resolved

cxzl25 added 2 commits May 1, 2020 13:28

change ut

c848508

reuse testMetadataOnly

4abf2f5

HyukjinKwon approved these changes May 1, 2020

View reviewed changes

maropu reviewed May 1, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuerySuite.scala Outdated Show resolved Hide resolved

maropu reviewed May 1, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala Show resolved Hide resolved

maropu reviewed May 1, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuerySuite.scala Show resolved Hide resolved

viirya reviewed May 1, 2020

View reviewed changes

cxzl25 changed the title ~~[SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable~~ [SPARK-31590][SQL] The filter used by Metadata-only queries should filter out all the unevaluable expr May 2, 2020

cxzl25 added 2 commits May 2, 2020 11:18

nit

7046db8

ut use existing partition value

a7638d6

HyukjinKwon changed the title ~~[SPARK-31590][SQL] The filter used by Metadata-only queries should filter out all the unevaluable expr~~ [SPARK-31590][SQL] Metadata-only queries should not include subquery in partition filters May 3, 2020

HyukjinKwon reviewed May 3, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuerySuite.scala Outdated Show resolved Hide resolved

change ut name

d601af4

maropu reviewed May 5, 2020

View reviewed changes

HyukjinKwon closed this in 588966d May 6, 2020

[SPARK-31590][SQL] Metadata-only queries should not include subquery in partition filters #28383

[SPARK-31590][SQL] Metadata-only queries should not include subquery in partition filters #28383

Uh oh!

Conversation

cxzl25 commented Apr 28, 2020 • edited by HyukjinKwon Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

HyukjinKwon commented May 1, 2020

Uh oh!

HyukjinKwon commented May 1, 2020

Uh oh!

Uh oh!

SparkQA commented May 1, 2020

Uh oh!

SparkQA commented May 1, 2020

Uh oh!

SparkQA commented May 1, 2020

Uh oh!

HyukjinKwon commented May 1, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented May 2, 2020

Uh oh!

SparkQA commented May 2, 2020

Uh oh!

SparkQA commented May 2, 2020

Uh oh!

maropu commented May 2, 2020

Uh oh!

SparkQA commented May 2, 2020

Uh oh!

HyukjinKwon commented May 3, 2020

Uh oh!

Uh oh!

SparkQA commented May 4, 2020

Uh oh!

maropu May 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented May 5, 2020

Uh oh!

cloud-fan commented May 5, 2020

Uh oh!

gengliangwang commented May 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented May 5, 2020

Uh oh!

HyukjinKwon commented May 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

cxzl25 commented Apr 28, 2020 •

edited by HyukjinKwon

Loading

maropu May 5, 2020 •

edited

Loading

gengliangwang commented May 5, 2020 •

edited

Loading