-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16456][SQL] Reuse the uncorrelated scalar subqueries with the same logical plan in a query #14111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #62007 has finished for PR 14111 at commit
|
|
Test build #62010 has finished for PR 14111 at commit
|
|
Test build #62017 has finished for PR 14111 at commit
|
|
Test build #62018 has finished for PR 14111 at commit
|
|
Test build #62021 has finished for PR 14111 at commit
|
|
Test build #62056 has finished for PR 14111 at commit
|
| def subqueries: Seq[PlanType] = { | ||
| expressions.flatMap(_.collect {case e: SubqueryExpression => e.plan.asInstanceOf[PlanType]}) | ||
| expressions.flatMap(_.collect { | ||
| case e: SubqueryExpression => e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use ExpressionCanonicalizer to canonicalize the expression before call distinct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because we use ReusedScalarSubquery, not Alias to indicate the reused SubqueryExpression, I think we don't use ExpressionCanonicalizer.
|
I have a simpler idea to implement this feature:
I think this approach need much less code change, what do you think? |
|
@cloud-fan At firstly I have implemented it with you said. But the following situation that has broadcast join will have a error 'ScalarSubquery has not finished', example (from SPARK-14791): The steps are as follows:
But after this PR, they are the same subquery, the steps are as follows:
|
|
@lianhuiwang, after taking a look at this example, I think this is a very special case: 2 physical plans reference to one same subquery(same instance). However, I don't think this is a valid case, I'd rather treat it as a bug of constraints propagation. Except this case, do we have another case that the simpler approach can't handle? |
|
@cloud-fan I don't think it is a bug of constraints propagation because filter with the uncorrelated scalar subquery needs to push down due to it can filter many records. When BroadcastExchangeExec has one same subquery that also appears in other places of this query, The first place will at firstly prepare subquery, But the second place in BroadcastExchangeExec will firstly wait for subquery because BroadcastExchangeExec.doPrepare will execute child plan. |
|
ah you are right, I think we need a better approach to execute subqueries and reuse the results globally. cc @hvanhovell to comment more on this. |
|
I think this is duplicate to the merged one #14548? |
|
@lianhuiwang we have merged PR #14548 implementing similar functionality. Could you close this PR? Thanks for your work! |
|
OK. Thanks. |
What changes were proposed in this pull request?
In TPCDS-Q14 the same physical plan of uncorrelated scalar subqueries from a CTE could be executed multiple times, we should re-use the same result to avoid the duplicated computing.
Before
After
How was this patch tested?
Pass the Jenkins tests (including a new testcase).