-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-7440][SQL] Remove physical Distinct operator in favor of Aggregate #6637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…gate. Distinct is very similar to Aggregate, which is an important operator to optimize for. This patch replaces Distinct with Aggregate in the optimizer, so Distinct will become more efficient over time as we optimize Aggregate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Seems that Once is enough. Also applies to the "Remove SubQueries" batch above.
|
LGTM except for a minor issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example in the comment can be useful for understanding:
SELECT DISTINCT f1, f2 FROM t ==> SELECT f1, f2 FROM t GROUP BY f1, f2
|
Test build #34170 has finished for PR 6637 at commit
|
|
I updated the comment but left the Once/Fixed in place. If we want to change that, we can do it in the future. Since Michael wrote the original code, I'm not sure if there are things that'd require running this to fixed point. |
|
Test build #34198 has finished for PR 6637 at commit
|
|
Test build #34200 has finished for PR 6637 at commit
|
…gate This patch replaces Distinct with Aggregate in the optimizer, so Distinct will become more efficient over time as we optimize Aggregate (via Tungsten). Author: Reynold Xin <[email protected]> Closes apache#6637 from rxin/replace-distinct and squashes the following commits: b3cc50e [Reynold Xin] Mima excludes. 93d6117 [Reynold Xin] Code review feedback. 87e4741 [Reynold Xin] [SPARK-7440][SQL] Remove physical Distinct operator in favor of Aggregate.
…gate This patch replaces Distinct with Aggregate in the optimizer, so Distinct will become more efficient over time as we optimize Aggregate (via Tungsten). Author: Reynold Xin <[email protected]> Closes apache#6637 from rxin/replace-distinct and squashes the following commits: b3cc50e [Reynold Xin] Mima excludes. 93d6117 [Reynold Xin] Code review feedback. 87e4741 [Reynold Xin] [SPARK-7440][SQL] Remove physical Distinct operator in favor of Aggregate.
This patch replaces Distinct with Aggregate in the optimizer, so Distinct will become
more efficient over time as we optimize Aggregate (via Tungsten).