-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-11982] [SQL] improve performance of cartesian product #9969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #46683 has finished for PR 9969 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document what the difference is between this iterator and the sorted iterator? Is it simply that one is sorted and the other is not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davies are you trying to save a in-memory sort here?
|
Test build #46702 has finished for PR 9969 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the UnsafeExternalSorter preserve records order if it spills?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and we may also need to update CartesianProduct strategy to put smaller child at right side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed it in #7417, right now it's not clear that which metric could be used as the size of table, that could be another story.
Even the right table is larger than left, this approach is still much better than current one (building the partition is usually much expensive than loading them from memory or disk), it also fix another problem that the right table could be nondeterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan For the first question, yes.
a3b3957 to
074f2a7
Compare
|
Test build #2117 has finished for PR 9969 at commit
|
|
Test build #46755 has finished for PR 9969 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesnt work if iterators contain empty iterators. Fix or assert that can't be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It checked that the iterators is not empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not what i mean.
If iterators contains an empty one. So iterators is:
(1, 2) : empty : (3, 4)
When you move to the second iterator (current is empty) you will stop and not iterate over the iterator containing (3,4)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see, thanks, will fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For UnsafeExternalSorter, it's not possible to have an empty iterator in the middle, they are spilled files. It's still good to be defensive for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea. I figured it would not be empty but I agree about being defensive. If the implementation of UnsafeExternalSorter changes, we don't want to debug this.
Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala
|
LGTM |
|
Test build #46898 has finished for PR 9969 at commit
|
|
Test build #2131 has finished for PR 9969 at commit
|
|
After this patch, the query time of TPC-DS Q65 go down to 4 seconds from 28 minutes (420X faster). |
|
Scale factor 1 and 10 (1G and 10G). |
This PR improve the performance of CartesianProduct by caching the result of right plan.
After this patch, the query time of TPC-DS Q65 go down to 4 seconds from 28 minutes (420X faster).
cc @nongli