-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-9066][SQL] Improve cartesian performance #7417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #37347 has finished for PR 7417 at commit
|
|
Test build #37350 has finished for PR 7417 at commit
|
|
Jenkins, retest this please. |
|
Test build #23 has finished for PR 7417 at commit
|
|
Test build #37354 has finished for PR 7417 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick question. Why not use sizeInBytes? I assume we want to move as little data as possible? Using sizeInBytes would be a bit more involved, since this would involve the planner, and (probably) adding a BuildSide parameter to CartesianProduct...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, use partition size here is not accurate, see a rdd with 100 partitions, and each partition has one record and a rdd with 10 partition and each partition has 100 million records, use the method above will cause more scan from hdfs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell Yes, use sizeInBytes is better, but also have a problem, if leftResults only have 1 record and this record size are big, and rightResults have many records and these records total size are small, then at this scenario will cause worse performance. The best way is we check the total records for the partition, but now we can not get it.
|
Test build #37467 has finished for PR 7417 at commit
|
|
Test build #37471 has finished for PR 7417 at commit
|
|
Test build #37495 has finished for PR 7417 at commit
|
|
Test build #37557 has finished for PR 7417 at commit
|
|
Test build #37566 has finished for PR 7417 at commit
|
|
Test build #37807 has finished for PR 7417 at commit
|
|
Do you have any benchmarking results for this? Would be great to see how much this improves the current situation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different from a BroadcastNestedLoopJoin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BroadcastNestedLoopJoin just used for out join right? But this is used for cartesian.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inner join variant with (degenerate) condition 1 = 1 would do the same.
All I am saying is that this also a way to get a broadcasting cartesian join going, and it saves some lines of code.
|
Test build #38024 has finished for PR 7417 at commit
|
|
@hvanhovell I use tpc-ds to test, for below SQL clause: use this patch run1h55min, without this patch run half tasks use 16.7h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is almost the same as the code above. I would put it in a method i.e. createCartesianProduct, and wrap the result in a Filter operator.
|
@Sephiroth-Lin can you rebase this? |
Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/CartesianProduct.scala
|
Test build #42133 has finished for PR 7417 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think BroadcastNestedLoopJoin can support the condition, and push down the filter into the operator can also reduce the memory overhead, as BroadcastNestedLoopJoin will put all of the valid tuple into a compact buffer.
|
Test build #42193 has finished for PR 7417 at commit
|
|
BTW, can you add some unit test like what I did at #8652 |
|
Test build #43092 has finished for PR 7417 at commit
|
|
@zsxwing the rdds order do matters for For Anyway this PR LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of passing a BuildSide to CartesianProduct, why not just change the parameters order according to the data size? like
if (left < right) {
CartesianProduct(left, right)
} else {
CartesianProduct(right, left)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I am a little concern about the side switch based on the statistic, as I commented previously. And also as @cloud-fan comment out:
for (x <- rdd1.iterator(currSplit.s1, context); y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
What we actually cared is the average amount of records in each partition in both sides, and, I don't think we can say, the one take the bigger file size in statistics will also with more average amount of records in its partition(most likely the average amount of records in each partition should be same).
Probably we'd better add more statistic info says partition number logical plan or average file size of each partition, and in order not to make confusing for the further improvement, I think we'd better remove this optimization rule for cartesian join. And that's why I didn't do that at #8652
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! This optimization should depend on record numbers, not data size.
|
Hi @Sephiroth-Lin , according to the previous discussion, I think we should NOT do optimization according to data size, do you mind closing this PR and help us review #8652? It contains part of your optimization which is still valid. |
|
@cloud-fan OK. |
see jira https://issues.apache.org/jira/browse/SPARK-9066
use tpc-ds to test, for below SQL clause:
use this patch run1h55min, without this patch run half tasks use 16.7h
@scwf