-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-31590][SQL] Metadata-only queries should not include subquery in partition filters #28383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala
Outdated
Show resolved
Hide resolved
|
ok to test |
|
Looks fine to me. |
sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuerySuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #122157 has finished for PR 28383 at commit
|
|
Test build #122161 has finished for PR 28383 at commit
|
|
Test build #122165 has finished for PR 28383 at commit
|
|
@viirya and @cloud-fan fyi |
sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuerySuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuerySuite.scala
Show resolved
Hide resolved
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a problem that normalizedFilters contains subquery expression?
By running a query like:
"""
|SELECT partcol1, MAX(partcol2) AS partcol2
|FROM srcpart
|WHERE partcol1 = (
| SELECT MAX(col1)
| FROM srcpart
|)
|AND partcol2= 'event'
|GROUP BY partcol1
|""".stripMargin
== Physical Plan ==
SortAggregate(key=[partcol1#28], functions=[max(partcol2#29)])
+- *(2) Sort [partcol1#28 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(partcol1#28, 5), true, [id=#3464]
+- SortAggregate(key=[partcol1#28], functions=[partial_max(partcol2#29)])
+- *(1) Sort [partcol1#28 ASC NULLS FIRST], false, 0
+- *(1) Filter (((isnotnull(partcol1#28) AND isnotnull(partcol2#29)) AND (partcol1#28 = Subquery scalar-subquery#247, [id=#3452])) AND (partcol2#29 = event))
: +- Subquery scalar-subquery#247, [id=#3452]
: +- *(2) HashAggregate(keys=[], functions=[max(col1#26)])
: +- Exchange SinglePartition, true, [id=#3448]
: +- *(1) HashAggregate(keys=[], functions=[partial_max(col1#26)])
: +- *(1) Project [col1#26]
: +- *(1) ColumnarToRow
: +- FileScan parquet default.srcpart[col1#26,partcol1#28,partcol2#29] Batched: true, DataFilters: [], Format: Parquet, Location: CatalogFileIndex[file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col1:int>
+- *(1) LocalTableScan <empty>, [partcol1#28, partcol2#29]
Looks it is ok.
|
@cxzl25 can you revert the test back to the original one and focus on the cleanup? The case before was a valid, and failed in the master. The fix itself seems right too. |
|
Test build #122190 has finished for PR 28383 at commit
|
|
Test build #122189 has finished for PR 28383 at commit
|
|
retest this please |
|
Test build #122210 has finished for PR 28383 at commit
|
|
@cxzl25, I think #28383 (comment) isn't fully addressed. Can you fix the PR description to explain fully what this PR proposes? This PR doesn't filter unevaluable expressions but only sub-queries because their results are only available during runtime. |
sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuerySuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #122256 has finished for PR 28383 at commit
|
| case a: AttributeReference => | ||
| a.withName(relation.output.find(_.semanticEquals(a)).get.name) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you filter out this unsupported case outside replaceTableScanWithPartitionMetadata(I think this filtering is not related to normalization)? e.g., in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala#L53-L55
Is this statement true? It seems the test query itself has a subquery. I think the root cause is just that unsupported |
|
Shall we remove |
On second thought: I think we should keep it for two reasons:
|
|
I think we can just mention that it is discouraged to use that configuration for now. We cant just remove the configuration without deprecation anyway and the fix looks correct. |
|
Merged to master, branch-3.0, and branch-2.4. |
…in partition filters ### What changes were proposed in this pull request? Metadata-only queries should not include subquery in partition filters. ### Why are the changes needed? Apply the `OptimizeMetadataOnlyQuery` rule again, will get the exception `Cannot evaluate expression: scalar-subquery`. ### Does this PR introduce any user-facing change? Yes. When `spark.sql.optimizer.metadataOnly` is enabled, it succeeds when the queries include subquery in partition filters. ### How was this patch tested? add UT Closes #28383 from cxzl25/fix_SPARK-31590. Authored-by: sychen <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 588966d) Signed-off-by: HyukjinKwon <[email protected]>
…in partition filters ### What changes were proposed in this pull request? Metadata-only queries should not include subquery in partition filters. ### Why are the changes needed? Apply the `OptimizeMetadataOnlyQuery` rule again, will get the exception `Cannot evaluate expression: scalar-subquery`. ### Does this PR introduce any user-facing change? Yes. When `spark.sql.optimizer.metadataOnly` is enabled, it succeeds when the queries include subquery in partition filters. ### How was this patch tested? add UT Closes #28383 from cxzl25/fix_SPARK-31590. Authored-by: sychen <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 588966d) Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
Metadata-only queries should not include subquery in partition filters.
Why are the changes needed?
Apply the
OptimizeMetadataOnlyQueryrule again, will get the exceptionCannot evaluate expression: scalar-subquery.Does this PR introduce any user-facing change?
Yes. When
spark.sql.optimizer.metadataOnlyis enabled, it succeeds when the queries include subquery in partition filters.How was this patch tested?
add UT