-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-35363][SQL] Refactor sort merge join code-gen be agnostic to join type #32495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @cloud-fan and @maropu , thanks. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Thank you @maropu for review. There was a maven compilation error in github action. It looks transient to me. Waiting for jenkins test result now. |
|
Ah, Jenkins does not check the java11 compilation? I think It's be better to re-invoke the GA tests just in case for checking if the compilation is okay. (though I also think that's transient error as you said). |
|
Thanks @maropu, restarted GA test here - https://github.com/c21/spark/actions/runs/829684422 . |
|
okay, I've checked that it passed. I will merge this. |
|
All the GA tests passed. Merged to master. Thank you, @c21 |
|
Thank you @maropu for monitoring and review! |
|
Test build #138343 has finished for PR 32495 at commit
|
| | } | ||
| | return false; // unreachable | ||
| |} | ||
| """.stripMargin, inlineToOuterClass = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have multiple SMJ in one whole-stage, will we have multiple findNextJoinRows methods in the outer class and fail the compilation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is applying to inner join before this PR as well. There won't be an issue and we will never have multiple findNextJoinRows in one class. The reason is with current design, sort merge join will always do code-gen for its children separately, so there won't be two SortMergeJoinExecs code-gen-ed in the same class.
Verified with example query:
val df1 = spark.range(10).select($"id".as("k1"))
val df2 = spark.range(4).select($"id".as("k2"))
val df3 = spark.range(6).select($"id".as("k3"))
df3.join(df2.hint("SHUFFLE_MERGE"), $"k3" === $"k2", "left_outer")
.join(df1.hint("SHUFFLE_MERGE"), $"k3" === $"k1", "right_outer")
.explain("codegen")
Query plan:
*(8) SortMergeJoin [k3#10L], [k1#2L], RightOuter
:- *(5) SortMergeJoin [k3#10L], [k2#6L], LeftOuter
: :- *(2) Sort [k3#10L ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(k3#10L, 5), ENSURE_REQUIREMENTS, [id=#43]
: : +- *(1) Project [id#8L AS k3#10L]
: : +- *(1) Range (0, 6, step=1, splits=2)
: +- *(4) Sort [k2#6L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(k2#6L, 5), ENSURE_REQUIREMENTS, [id=#49]
: +- *(3) Project [id#4L AS k2#6L]
: +- *(3) Range (0, 4, step=1, splits=2)
+- *(7) Sort [k1#2L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(k1#2L, 5), ENSURE_REQUIREMENTS, [id=#58]
+- *(6) Project [id#0L AS k1#2L]
+- *(6) Range (0, 10, step=1, splits=2)
All generated code is in https://gist.github.com/c21/873775bcd08583105b289e67221f6e17.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth to add a comment for it, in case we changed this in the future (codegen one side with SMJ together)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or we just make it future proof and create a fresh function name here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan - thanks for calling it out. I would like to make it future proof by giving it a fresh name. Let me do it in a followup PR. Actually I was wondering the same question as you when implementing and spent some time to figuring it out.
…ead of hardcoding it ### What changes were proposed in this pull request? This is a followup from discussion in #32495 (comment) . The hardcoded function name `findNextJoinRows` is not a real problem now as we always do code generation for SMJ's children separately. But this change is to make it future proof in case this assumption changed in the future. ### Why are the changes needed? Fix the potential reliability issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #32548 from c21/smj-followup. Authored-by: Cheng Su <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This is a pre-requisite of #32476, in discussion of #32476 (comment) . This is to refactor sort merge join code-gen to depend on streamed/buffered terminology, which makes the code-gen agnostic to different join types and can be extended to support other join types than inner join.
Why are the changes needed?
Pre-requisite of #32476.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing unit test in
InnerJoinSuite.scalafor inner join code-gen.