Skip to content

Conversation

@cxzl25
Copy link
Contributor

@cxzl25 cxzl25 commented Apr 28, 2020

What changes were proposed in this pull request?

Metadata-only queries should not include subquery in partition filters.

Why are the changes needed?

Apply the OptimizeMetadataOnlyQuery rule again, will get the exception Cannot evaluate expression: scalar-subquery.

Does this PR introduce any user-facing change?

Yes. When spark.sql.optimizer.metadataOnly is enabled, it succeeds when the queries include subquery in partition filters.

How was this patch tested?

add UT

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

Looks fine to me.

@SparkQA
Copy link

SparkQA commented May 1, 2020

Test build #122157 has finished for PR 28383 at commit 86f28d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 1, 2020

Test build #122161 has finished for PR 28383 at commit c848508.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 1, 2020

Test build #122165 has finished for PR 28383 at commit 4abf2f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@viirya and @cloud-fan fyi

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a problem that normalizedFilters contains subquery expression?

By running a query like:

"""
      |SELECT partcol1, MAX(partcol2) AS partcol2
      |FROM srcpart
      |WHERE partcol1 = (
      |  SELECT MAX(col1)
      |  FROM srcpart
      |)
      |AND partcol2= 'event'
      |GROUP BY partcol1
      |""".stripMargin
== Physical Plan ==
SortAggregate(key=[partcol1#28], functions=[max(partcol2#29)])
+- *(2) Sort [partcol1#28 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(partcol1#28, 5), true, [id=#3464]
      +- SortAggregate(key=[partcol1#28], functions=[partial_max(partcol2#29)])
         +- *(1) Sort [partcol1#28 ASC NULLS FIRST], false, 0
            +- *(1) Filter (((isnotnull(partcol1#28) AND isnotnull(partcol2#29)) AND (partcol1#28 = Subquery scalar-subquery#247, [id=#3452])) AND (partcol2#29 = event))
               :  +- Subquery scalar-subquery#247, [id=#3452]
               :     +- *(2) HashAggregate(keys=[], functions=[max(col1#26)])
               :        +- Exchange SinglePartition, true, [id=#3448]
               :           +- *(1) HashAggregate(keys=[], functions=[partial_max(col1#26)])
               :              +- *(1) Project [col1#26]
               :                 +- *(1) ColumnarToRow
               :                    +- FileScan parquet default.srcpart[col1#26,partcol1#28,partcol2#29] Batched: true, DataFilters: [], Format: Parquet, Location: CatalogFileIndex[file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col1:int>
               +- *(1) LocalTableScan <empty>, [partcol1#28, partcol2#29]

Looks it is ok.

@HyukjinKwon
Copy link
Member

@cxzl25 can you revert the test back to the original one and focus on the cleanup? The case before was a valid, and failed in the master. The fix itself seems right too.

@cxzl25 cxzl25 changed the title [SPARK-31590][SQL] The filter used by Metadata-only queries should not have Unevaluable [SPARK-31590][SQL] The filter used by Metadata-only queries should filter out all the unevaluable expr May 2, 2020
@SparkQA
Copy link

SparkQA commented May 2, 2020

Test build #122190 has finished for PR 28383 at commit a7638d6.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 2, 2020

Test build #122189 has finished for PR 28383 at commit 7046db8.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented May 2, 2020

retest this please

@SparkQA
Copy link

SparkQA commented May 2, 2020

Test build #122210 has finished for PR 28383 at commit a7638d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon changed the title [SPARK-31590][SQL] The filter used by Metadata-only queries should filter out all the unevaluable expr [SPARK-31590][SQL] Metadata-only queries should not include subquery in partition filters May 3, 2020
@HyukjinKwon
Copy link
Member

@cxzl25, I think #28383 (comment) isn't fully addressed. Can you fix the PR description to explain fully what this PR proposes? This PR doesn't filter unevaluable expressions but only sub-queries because their results are only available during runtime.

@SparkQA
Copy link

SparkQA commented May 4, 2020

Test build #122256 has finished for PR 28383 at commit d601af4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case a: AttributeReference =>
a.withName(relation.output.find(_.semanticEquals(a)).get.name)
}
}
Copy link
Member

@maropu maropu May 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you filter out this unsupported case outside replaceTableScanWithPartitionMetadata(I think this filtering is not related to normalization)? e.g., in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala#L53-L55

@maropu
Copy link
Member

maropu commented May 5, 2020

Applying OptimizeMetadataOnlyQuery rule will generate scalar-subquery.

Is this statement true? It seems the test query itself has a subquery.

// Analyzed plan of the test query
Aggregate [partcol1#40], [partcol1#40, max(partcol2#41) AS partcol2#71]
+- Filter ((partcol1#40 = scalar-subquery#70 []) AND (partcol2#41 = even))
   :  +- Aggregate [max(partcol1#40) AS max(partcol1)#73]
   :     +- SubqueryAlias spark_catalog.default.srcpart
   :        +- Relation[col1#38,col2#39,partcol1#40,partcol2#41] parquet
   +- SubqueryAlias spark_catalog.default.srcpart
      +- Relation[col1#38,col2#39,partcol1#40,partcol2#41] parquet

I think the root cause is just that unsupported partitionFilters (subquery) is passed into FileIndex.listFiles.

@cloud-fan
Copy link
Contributor

Shall we remove OptimizeMetadataOnlyQuery? IIRC it has a correcness issue and we disable it by default. cc @gengliangwang

@gengliangwang
Copy link
Member

gengliangwang commented May 5, 2020

Shall we remove OptimizeMetadataOnlyQuery? IIRC it has a correcness issue and we disable it by default. cc @gengliangwang

On second thought: I think we should keep it for two reasons:

  1. when users are 100% sure about their data won't contain empty partition, they can still turn it on.
  2. the future developers may come up with the same idea and create exactly the same rule and enable it by default...

@HyukjinKwon
Copy link
Member

I think we can just mention that it is discouraged to use that configuration for now. We cant just remove the configuration without deprecation anyway and the fix looks correct.

@HyukjinKwon
Copy link
Member

Merged to master, branch-3.0, and branch-2.4.

HyukjinKwon pushed a commit that referenced this pull request May 6, 2020
…in partition filters

### What changes were proposed in this pull request?
Metadata-only queries should not include subquery in partition filters.

### Why are the changes needed?

Apply the `OptimizeMetadataOnlyQuery` rule again, will get the exception `Cannot evaluate expression: scalar-subquery`.

### Does this PR introduce any user-facing change?
Yes. When `spark.sql.optimizer.metadataOnly` is enabled, it succeeds when the queries include subquery in partition filters.

### How was this patch tested?

add UT

Closes #28383 from cxzl25/fix_SPARK-31590.

Authored-by: sychen <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 588966d)
Signed-off-by: HyukjinKwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request May 6, 2020
…in partition filters

### What changes were proposed in this pull request?
Metadata-only queries should not include subquery in partition filters.

### Why are the changes needed?

Apply the `OptimizeMetadataOnlyQuery` rule again, will get the exception `Cannot evaluate expression: scalar-subquery`.

### Does this PR introduce any user-facing change?
Yes. When `spark.sql.optimizer.metadataOnly` is enabled, it succeeds when the queries include subquery in partition filters.

### How was this patch tested?

add UT

Closes #28383 from cxzl25/fix_SPARK-31590.

Authored-by: sychen <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 588966d)
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants