-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20718][SQL] FileSourceScanExec with different filter orders should be the same after canonicalization #17959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
also cc @gatorsmile |
|
LGTM |
|
Test build #76843 has finished for PR 17959 at commit
|
| None) | ||
| } | ||
|
|
||
| private def canonicalizeFilters(filters: Seq[Expression], output: Seq[Attribute]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a function description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
|
How about |
…ould be the same after canonicalization ## What changes were proposed in this pull request? Since `constraints` in `QueryPlan` is a set, the order of filters can differ. Usually this is ok because of canonicalization. However, in `FileSourceScanExec`, its data filters and partition filters are sequences, and their orders are not canonicalized. So `def sameResult` returns different results for different orders of data/partition filters. This leads to, e.g. different decision for `ReuseExchange`, and thus results in unstable performance. ## How was this patch tested? Added a new test for `FileSourceScanExec.sameResult`. Author: wangzhenhua <[email protected]> Closes #17959 from wzhfy/canonicalizeFileSourceScanExec. (cherry picked from commit c8da535) Signed-off-by: Wenchen Fan <[email protected]>
|
merged to master/2.2, please send a follow-up PR to address @gatorsmile 's comments, thanks! |
|
@gatorsmile Right, thanks for pointing this out! |
…ould be the same after canonicalization ## What changes were proposed in this pull request? Since `constraints` in `QueryPlan` is a set, the order of filters can differ. Usually this is ok because of canonicalization. However, in `FileSourceScanExec`, its data filters and partition filters are sequences, and their orders are not canonicalized. So `def sameResult` returns different results for different orders of data/partition filters. This leads to, e.g. different decision for `ReuseExchange`, and thus results in unstable performance. ## How was this patch tested? Added a new test for `FileSourceScanExec.sameResult`. Author: wangzhenhua <[email protected]> Closes apache#17959 from wzhfy/canonicalizeFileSourceScanExec.
…ould be the same after canonicalization ## What changes were proposed in this pull request? Since `constraints` in `QueryPlan` is a set, the order of filters can differ. Usually this is ok because of canonicalization. However, in `FileSourceScanExec`, its data filters and partition filters are sequences, and their orders are not canonicalized. So `def sameResult` returns different results for different orders of data/partition filters. This leads to, e.g. different decision for `ReuseExchange`, and thus results in unstable performance. ## How was this patch tested? Added a new test for `FileSourceScanExec.sameResult`. Author: wangzhenhua <[email protected]> Closes apache#17959 from wzhfy/canonicalizeFileSourceScanExec.
What changes were proposed in this pull request?
Since
constraintsinQueryPlanis a set, the order of filters can differ. Usually this is ok because of canonicalization. However, inFileSourceScanExec, its data filters and partition filters are sequences, and their orders are not canonicalized. Sodef sameResultreturns different results for different orders of data/partition filters. This leads to, e.g. different decision forReuseExchange, and thus results in unstable performance.How was this patch tested?
Added a new test for
FileSourceScanExec.sameResult.