-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-9251][SQL] do not order by expressions which still need evaluation #7593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an existing and small bug, sometimes seed is large and can not be represented as int literal, add a L at end to make it a long literal.
|
Test build #38064 has finished for PR 7593 at commit
|
|
retest this please. |
|
Test build #56 has finished for PR 7593 at commit
|
|
Test build #38065 has finished for PR 7593 at commit
|
|
retest this please. |
|
Test build #60 has finished for PR 7593 at commit
|
|
Test build #38079 has finished for PR 7593 at commit
|
|
cc @yhuai can you review this? |
|
Test build #1165 has finished for PR 7593 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a comment at here to explain we need a Project at the top to get the expected output attributes.
|
Test build #38156 has finished for PR 7593 at commit
|
|
retest this please |
|
Test build #38166 has finished for PR 7593 at commit
|
|
Test build #74 has finished for PR 7593 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that we will have multiple conditions needed to alias?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's definitely possible, but the alias name here doesn't matter, we'll call toAttribute later, and thus bind it with expression id.
|
@rxin , I'm wondering should we do this for all kind of expressions? We will copy rows before sort, with this change, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This even introduce complicity.
I'm wondering what's the reason we should do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The origin motivation is adding a project to materialize nondeterministic expressions in ORDER BY to avoid extra evaluation that lead to wrong answer, see JIRA. In an offline discussion we decided to apply this rule for all still-need-evaluate expressions. But now I think it maybe overkill. @rxin What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most optimal way is we have a perfect cost model that can predict what we are trading off (network vs cpu). Minus that, I think just always projecting is the approach that makes more sense in most common cases, because:
- It is hard to quantify the difference.
- I/O (network, disk) is rarely the bottleneck here, especially with more SSDs and 10Gbps network.
- Most of the time order by is just ordering by a field, and this won't hurt that case.
- If there is a complex expression, doing the eval many times during sorting is bad.
The alternative, which is probably even better, is for the sorter itself to always project out the sort key. It might make more sense there, but is slightly more complicated to write I think.
|
@cloud-fan would be great to add unit test for this analysis rule too. |
|
Test build #38434 has finished for PR 7593 at commit
|
d9f0b6e to
b2a2c8c
Compare
|
Test build #38436 has finished for PR 7593 at commit
|
|
Test build #38435 has finished for PR 7593 at commit
|
|
Test build #38439 has finished for PR 7593 at commit
|
|
cc @yhuai for review |
|
Test build #38446 has finished for PR 7593 at commit
|
|
LGTM. Will merge it once it passes the test. |
|
Test build #38803 has finished for PR 7593 at commit
|
|
Thanks - I've merged this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is probably more expensive than just doing the transformation always. If its a noop we will detect that through reference equality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also add a test to make sure we don't project unnecessarily when there is an alias?
as an offline discussion with @rxin , it's weird to be computing stuff while doing sorting, we should only order by bound reference during execution.