Skip to content

Conversation

@chenghao-intel
Copy link
Contributor

In some cases, we can broadcast the smaller relation in cartesian join, which improve the performance significantly.

@SparkQA
Copy link

SparkQA commented Sep 8, 2015

Test build #42126 has finished for PR 8652 at commit 98efb3d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Sep 8, 2015

Looks this is a part of #7417

@jameszhouyi
Copy link

Hi @chenghao-intel ,
We can pass the case and also get better performance than before optimization for cross join after applying the patch.

@chenghao-intel
Copy link
Contributor Author

Oh, I didn't notice that. Thank you @zsxwing , I leave some comments in #7417.

@jameszhouyi
Copy link

After optimized patch , we can see "CartesianProduct" optimized to "BroadcastNestedLoopJoin" from physical plan for cross join. The benchmark result showed ~42% performance gain(15m1s vs. 26m37s).

== Physical Plan ==
TungstenProject [concat(cast(s_store_sk#454L as string),_,s_store_name#455) AS store_ID#444,pr_review_date#447,pr_review_content#453]
BroadcastNestedLoopJoin BuildRight, Inner, Some((locate(lower(s_store_name#455),lower(pr_review_content#453),1) >= 1))
HiveTableScan [pr_review_date#447,pr_review_content#453], (MetastoreRelation bigbench, product_reviews, Some(pr))
HiveTableScan [s_store_sk#454L,s_store_name#455], (MetastoreRelation bigbench, temp_stores_with_regression, Some(stores_with_regression))
Code Generation: true

@jameszhouyi
Copy link

Hi @yhuai , I saw the PR is ready for some time. could you help to review this PR. Hopefully it can be fixed in 1.5.1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not so sure your mean about the exception at BroadcastNestedLoopJoin.scala, actually I also add the code change BroadcastNestedLoopJoin.scala below, to support the INNER join, did that your mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, sorry. I missed that line. For a CROSS JOIN, we will use INNER as the join type and the condition should be None. Can explicitly match these? I mean to let others easy to understand the code, case logical.Join(CanBroadcast(left), right, JoinType.Inner, None) will be better, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we are trying to support the Cartesian join by using the Broadcast join, so it will support all of the join types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cartesian means that there is no join condition, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the join type is CROSS_JOIN. At here we just put INNER as a placeholder because the type does not really matter.

@SparkQA
Copy link

SparkQA commented Oct 15, 2015

Test build #43763 has finished for PR 8652 at commit 07144ba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@chenghao-intel
Copy link
Contributor Author

@yhuai, after some investigation, I think the we still can optimize the left/right/full outer join in broadcasting way if they don't have condition, but LeftSemi are not included, and it should be a bug in previous implementation. And I also add more unit test by comparing with Hive.

@SparkQA
Copy link

SparkQA commented Oct 15, 2015

Test build #43775 has finished for PR 8652 at commit 26818c7.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@chenghao-intel
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 15, 2015

Test build #43779 has finished for PR 8652 at commit 26818c7.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 16, 2015

Test build #43820 has finished for PR 8652 at commit 4ee1797.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation is off here.

@chenghao-intel
Copy link
Contributor Author

Thank you @liancheng updated!

@SparkQA
Copy link

SparkQA commented Oct 19, 2015

Test build #43906 has finished for PR 8652 at commit fab8923.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may miss some context at here. This is a strategy for CartesianProduct. We have a strategy BroadcastNestedLoopJoin to handle other cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I noticed that also, however, without these changes, the cases I added will transform into the operator CartesianProduct instead of BroadcastNestedLoopJoin, as the strategy BroadcastNestedLoopJoin is the rule after CartesianProduct.

Besides, explicitly providing the rules for the optimization, probably will be helpful for people to understand how the logic behind.

PS: The rule BroadcastNestedLoopJoin has to be the last gate, as it's supposed to handle all kinds of joins.

@JoshRosen
Copy link
Contributor

/cc @harsha2010, this PR may interest you given your similar broadcast cartesian join optimizations in Magellan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this comment related to this case? Also, I still do not think these two rules should be in this Strategy because the name of this strategy is CartesianProduct, but these first two cases are not for CartesianProduct. Actually, can we combine CartesianProduct and BroadcastNestedLoopJoin strategies?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why condition needs to be None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the comment is stale.

If we restrict the outer join condition as None here, then it's more like a CartesianProduct, that's why I put the rule in the CartesianProduct, and more importantly, we'd like to take those 2 rules as higher priority than the rule in Line 292.

I am totally agree with you to combine the CartesianProduct and BroadcastNestedLoopJoin, as the later just a special case of former.

Will update the code soon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a proper name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestion for the name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about NonEquiJoinSelection? w.r.t EquiJoinSelection

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BroadcastNestedLoopJoin actually supports equi-join, thus we never run into this case, as we have more optimal solution for it in previous rules.

After a double think, I am a little hesitate to combine the rules in CartesianProduct and BroadcastNestedLoopJoin, as the later is supposed to be the last gate for JOIN, and works for all kinds of JOIN type w/ or w/o join condition, the others can be considered as the optimization compared to it.

I am going to revert the code change if @yhuai is not strongly opposite to it. Or we can refactor the JOIN strategy after this PR been merged.

What do you think @yhuai ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if BroadcastNestedLoopJoin is a good last gate. It is possible that we join two large tables and we cannot really use broadcast join, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, it is fine to revert it since we will not make that case worse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these two rules using CartesianProduct should go to the bottom.

@SparkQA
Copy link

SparkQA commented Oct 26, 2015

Test build #44329 has finished for PR 8652 at commit 564abd3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 26, 2015

Test build #44331 has finished for PR 8652 at commit 293e5ff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The renaming is intended, as some of the users will extends the SQLContext, they may not ware the changing here.

@chenghao-intel
Copy link
Contributor Author

Thank you @yhuai , I've just split the BroadcastNestedLoopJoin into BroadcastNestedLoop and DefaultJoin, which probably makes more sense.

@SparkQA
Copy link

SparkQA commented Oct 26, 2015

Test build #44341 has finished for PR 8652 at commit 975eb46.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 27, 2015

Test build #44410 has finished for PR 8652 at commit 7fda511.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@chenghao-intel
Copy link
Contributor Author

@yhuai Any more comment on this?

@yhuai
Copy link
Contributor

yhuai commented Oct 28, 2015

LGTM. Merging to master.

@asfgit asfgit closed this in d9c6039 Oct 28, 2015
@chenghao-intel chenghao-intel deleted the cartesian branch October 28, 2015 04:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants