-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-36444][SQL][3.1] Remove OptimizeSubqueries from batch of PartitionPruning #35431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @wangyum, I know it's a perf only change, but consider the huge improvement of sql compile performance, I think it's worth to backport to branch-3.1 |
|
Also cc @cloud-fan @maryannxue @yaooqinn |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @pan3793 .
Thank you for making a PR, but Apache Spark community follows Semantic Versioning and doesn't allow backporting of new feature or improvements.
I'd like to recommend you to use Apache Spark 3.2.1 instead. If you need this in your fork, you can backport it into your fork directly instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, it would be great if you can upgrade Kyuubi to Apache Spark 3.2.1 instead of Apache Spark 3.1.2.
Thanks @dongjoon-hyun I think maybe we can treat it as a bug based on #33664 (comment), comparing to 2.4s Analyzer/Optimizer costs 15821.6 seconds looks really unreasonable.
Thanks for tips, Kyuubi supports both Spark 3.0/3.1/3.2, and builds against Spark 3.1 in default, we are planning to switch the default to Spark 3.2 |
### What changes were proposed in this pull request? This PR propose to materialize `QueryPlan#subqueries` and pruned by `PLAN_EXPRESSION` on searching to improve the SQL compile performance. ### Why are the changes needed? We found a query in production that cost lots of time in optimize phase (also include AQE optimize phase) when enable DPP, the SQL pattern likes ``` select <cols...> from a left join b on a.<col> = b.<col> left join c on b.<col> = c.<col> left join d on c.<col> = d.<col> left join e on d.<col> = e.<col> left join f on e.<col> = f.<col> left join g on f.<col> = g.<col> left join h on g.<col> = h.<col> ... ``` SPARK-36444 significantly reduces the optimize time (exclude AQE phase), see detail at #35431, but there are still lots of time costs in `InsertAdaptiveSparkPlan` on AQE optimize phase. Before this change, the query costs 658s, after this change only costs 65s. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. Closes #35438 from pan3793/subquery. Authored-by: Cheng Pan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
This is a backport PR of #33664 for branch-3.1.
Why are the changes needed?
We found a query in production that cost lots of time in optimize phase when enable DPP, the SQL pattern likes
Before this PR, Analyzer/Optimizer costs 15821.6 seconds
After this PR, Analyzer/Optimizer costs 2.4 seconds seconds
The original description of SPARK-36444 did not show this improvement, but it does significantly improve the SQL compile performance for such cases.
Does this PR introduce any user-facing change?
Significant SQL compile performance improvement for some cases.
How was this patch tested?
Added UT.