-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-22746][SQL] Avoid the generation of useless mutable states by SortMergeJoin #19937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -507,32 +507,38 @@ case class SortMergeJoinExec( | |
| } | ||
|
|
||
| /** | ||
| * Creates variables for left part of result row. | ||
| * Creates variables and declarations for left part of result row. | ||
| * | ||
| * In order to defer the access after condition and also only access once in the loop, | ||
| * the variables should be declared separately from accessing the columns, we can't use the | ||
| * codegen of BoundReference here. | ||
| */ | ||
| private def createLeftVars(ctx: CodegenContext, leftRow: String): Seq[ExprCode] = { | ||
| private def createLeftVars(ctx: CodegenContext, leftRow: String): (Seq[ExprCode], Seq[String]) = { | ||
| ctx.INPUT_ROW = leftRow | ||
| left.output.zipWithIndex.map { case (a, i) => | ||
| val value = ctx.freshName("value") | ||
| val valueCode = ctx.getValue(leftRow, a.dataType, i.toString) | ||
| // declare it as class member, so we can access the column before or in the loop. | ||
| ctx.addMutableState(ctx.javaType(a.dataType), value) | ||
| val javaType = ctx.javaType(a.dataType) | ||
| val defaultValue = ctx.defaultValue(a.dataType) | ||
| if (a.nullable) { | ||
| val isNull = ctx.freshName("isNull") | ||
| ctx.addMutableState(ctx.JAVA_BOOLEAN, isNull) | ||
| val code = | ||
| s""" | ||
| |$isNull = $leftRow.isNullAt($i); | ||
| |$value = $isNull ? ${ctx.defaultValue(a.dataType)} : ($valueCode); | ||
| |$value = $isNull ? $defaultValue : ($valueCode); | ||
| """.stripMargin | ||
| ExprCode(code, isNull, value) | ||
| val leftVarsDecl = | ||
| s""" | ||
| |boolean $isNull = false; | ||
| |$javaType $value = $defaultValue; | ||
| """.stripMargin | ||
| (ExprCode(code, isNull, value), leftVarsDecl) | ||
| } else { | ||
| ExprCode(s"$value = $valueCode;", "false", value) | ||
| val code = s"$value = $valueCode;" | ||
| val leftVarsDecl = s"""$javaType $value = $defaultValue;""" | ||
| (ExprCode(code, "false", value), leftVarsDecl) | ||
| } | ||
| } | ||
| }.unzip | ||
| } | ||
|
|
||
| /** | ||
|
|
@@ -580,7 +586,7 @@ case class SortMergeJoinExec( | |
| val (leftRow, matches) = genScanner(ctx) | ||
|
|
||
| // Create variables for row from both sides. | ||
| val leftVars = createLeftVars(ctx, leftRow) | ||
| val (leftVars, leftVarDecl) = createLeftVars(ctx, leftRow) | ||
| val rightRow = ctx.freshName("rightRow") | ||
| val rightVars = createRightVar(ctx, rightRow) | ||
|
|
||
|
|
@@ -617,6 +623,7 @@ case class SortMergeJoinExec( | |
|
|
||
| s""" | ||
| |while (findNextInnerJoinRows($leftInput, $rightInput)) { | ||
| | ${leftVarDecl.mkString("\n")} | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can't we define them before the loop and reuse them?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It could be. Would it be possible to let us know advantages compare to the current method?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since I have no strong preference, I would like to hear opinions from others.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that the advantage would be to reuse them, avoiding creating and destroying them at every loop.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since they are local variable, it takes almost no cost in native code. If they are on CPU registers, there is no cost. If they are in stack frame, it is up to one instruction to increase or decrease stack frame size. WDYT? Did you see huge overhead to create and destroy local variables?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We would appreciate it if you would past the kernel code what you are thinking about.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here is my benchmark result. Is there any your result?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If no performance difference, we prefer to the simpler codes.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since these local variables are only needed inside the loop, I feel the current code is more readable. Performance is not a concern here as the overhead is very low or none.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kiszk thanks for the test. Then I agree this is the best option, thanks. |
||
| | ${beforeLoop.trim} | ||
| | scala.collection.Iterator<UnsafeRow> $iterator = $matches.generateIterator(); | ||
| | while ($iterator.hasNext()) { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to update the function description.