Skip to content

Conversation

@pan3793
Copy link
Member

@pan3793 pan3793 commented Feb 8, 2022

What changes were proposed in this pull request?

This PR propose to materialize QueryPlan#subqueries and pruned by PLAN_EXPRESSION on searching to improve the SQL compile performance.

Why are the changes needed?

We found a query in production that cost lots of time in optimize phase (also include AQE optimize phase) when enable DPP, the SQL pattern likes

select <cols...>
from a
left join b on a.<col> = b.<col>
left join c on b.<col> = c.<col>
left join d on c.<col> = d.<col>
left join e on d.<col> = e.<col>
left join f on e.<col> = f.<col>
left join g on f.<col> = g.<col>
left join h on g.<col> = h.<col>
...

SPARK-36444 significantly reduces the optimize time (exclude AQE phase), see detail at #35431, but there are still lots of time costs in InsertAdaptiveSparkPlan on AQE optimize phase.

Before this change, the query costs 658s, after this change only costs 65s.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTs.

@github-actions github-actions bot added the SQL label Feb 8, 2022
@pan3793
Copy link
Member Author

pan3793 commented Feb 8, 2022

cc @wangyum @cloud-fan @yaooqinn

@HyukjinKwon
Copy link
Member

cc @maryannxue @allisonwang-db @sigmod FYI

*/
def subqueries: Seq[PlanType] = {
expressions.flatMap(_.collect {
lazy val subqueries: Seq[PlanType] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add @transient

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tips, updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for education purpose: why @transient is useful here?

Copy link
Member Author

@pan3793 pan3793 Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SparkPlan is the subclass of QueryPlan, which need to be sent to executor, use @transient to reduce the memory usage of executor.

abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializable

@amaliujia
Copy link
Contributor

cc @amaliujia

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@pan3793
Copy link
Member Author

pan3793 commented Feb 16, 2022

@cloud-fan would you please take a look? thanks

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 0fcb560 Feb 18, 2022
@pan3793 pan3793 deleted the subquery branch April 4, 2022 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants