Skip to content

Conversation

@ulysses-you
Copy link
Contributor

@ulysses-you ulysses-you commented Jun 15, 2022

What changes were proposed in this pull request?

Add a new optimizer rule PullOutComplexJoinKeys in Optimizer.

A plan change example:

                                              - Project [c1, c2]
 +- Join Inner, ((c1 % 2) = c2))                +- Join Inner, (_complexjoinkey_0 = c2))
    :- Relation default.t1[c1] parquet    =>       :- Project [(c1 % 2) AS _complexjoinkey_0]
    +- Relation default.t2[c2] parquet             :  +- Relation default.t1[c1] parquet
                                                   +- Relation default.t2[c2] parquet

Note that, we will skip pull out if the side can be broadcast.

Why are the changes needed?

For a sort merge join, a complex join key may run three times at most:

  1. exchange
  2. sort
  3. join

We can pull out it to project so we will execute it only once.

Does this PR introduce any user-facing change?

no, only plan change for performance

How was this patch tested?

add new test

@github-actions github-actions bot added the SQL label Jun 15, 2022
@wangyum
Copy link
Member

wangyum commented Jun 15, 2022

Similar implementation, please see previous comments: #33522

@ulysses-you
Copy link
Contributor Author

thank you @wangyum point out that pr. Seems the main concern is why we run python UDF 2 more times ?

Comment on lines +213 to +214
Batch("Pull Out Complex Join Keys", Once,
PullOutComplexJoinKeys) :+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an advantage to putting it here:

  • Reduce complex join key runs from 3 to 1 for SMJ.

However, a disadvantage cannot be avoided:

  • It may increase the data size of shuffle. For example: the join key is: concat(col1, col2, col3, col4 ...).

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Sep 25, 2022
@github-actions github-actions bot closed this Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants