Skip to content

Conversation

@mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Sep 1, 2019

What changes were proposed in this pull request?

The PR proposes to split the code for subexpression elimination before inlining the function calls all in the apply method for Generate[Mutable|Unsafe]Projection.

Why are the changes needed?

Before this PR, code generation can fail due to the 64KB code size limit if a lot of subexpression elimination functions are generated. The added UT is a reproducer for the issue (thanks to the JIRA reporter and @HyukjinKwon for it).

Does this PR introduce any user-facing change?

No.

How was this patch tested?

added UT

@mgaido91
Copy link
Contributor Author

mgaido91 commented Sep 1, 2019

cc @cloud-fan @kiszk @maropu

@SparkQA
Copy link

SparkQA commented Sep 1, 2019

Test build #109992 has finished for PR 25642 at commit 9295731.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
}

test("SPARK-28916: subexrepssion elimination can cause 64kb code limit") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how long does this test run? We can write a unit test instead if the end-to-end test is too expensive to run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it takes about 3 mins. I'll try and find a way to create a UT...

* Returns the code for subexpression elimination after splitting it if necessary.
*/
def subexprFunctionsCode: String = {
// Wholestage codegen does not allow subexpression elimination
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add assert(currentVars == null) to guarantee it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @maropu , I vaguely remember that we do subexpression elimination in hash aggregate in one of your PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it, thanks. Anyway, when we generate subexpression elimination functions they all take an InternalRow as input. Unless that mechanism changes, I doubt it is usable in wholestage codegen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add assert(currentVars == null) to guarantee it?

I cannot do that, it doesn't work. In that case just an empty string is returned because the subexpression functions seq is empty.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, how about assert(currentVars != null && subexprFunctions.isEmpty) for strict checks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add that, thanks

@cloud-fan
Copy link
Contributor

@mgaido91 thanks for the fix! One question: AFAIK we will fallback to interpreted code path if codegen fails compilation. Why it doesn't work here?

@mgaido91
Copy link
Contributor Author

mgaido91 commented Sep 2, 2019

AFAIK we will fallback to interpreted code path if codegen fails compilation. Why it doesn't work here

It does work. We could see this as an improvement, indeed.

@SparkQA
Copy link

SparkQA commented Sep 2, 2019

Test build #110006 has finished for PR 25642 at commit 0167929.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* Returns the code for subexpression elimination after splitting it if necessary.
*/
def subexprFunctionsCode: String = {
// Wholestage codegen does not allow subexpression elimination
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on your thought on this comment?
IMHO, this change is not directly related to whether wholestage codegen performs subexpression elimination by controlling in other places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason of this comment is that here we are passing the InternalRow to the variables and not the currentVars which would be needed in whole-stage codegen. This is done because in subexpression elimination we always put an InternalRow as argument. For this reason, whole-stage codegen disables subexpression elimination.
So I put here this comment in order to highlight that if in the future there will be a work to support subexpression elimination also in wholestage codegen, we need to modify this method too.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Sep 2, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Sep 2, 2019

Test build #110010 has finished for PR 25642 at commit 0167929.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* Returns the code for subexpression elimination after splitting it if necessary.
*/
def subexprFunctionsCode: String = {
// Wholestage codegen does not allow subexpression elimination: in that case, subexprFunctions
Copy link
Contributor

@cloud-fan cloud-fan Sep 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not true, see subexpressionEliminationForWholeStageCodegen in this class. @mgaido91 can you doule check to see if this fix works for whole-stage-codegen as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like whole-stage-codegen has the same issue, but unfortunately splitExpressionsWithCurrentInputs is not completed yet.

I think the right description here is: whole-stage-codegen supports subexpression elimination, but we are not able to split the code for it yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @maropu @viirya to double-check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan yes, sorry, this is not very well explained. I mean the wholestage codegen does not honor spark.sql.subexpressionElimination.enabled and even though it is true, it doesn't use subexprFunctions. So in this method we don't have to deal with wholestage codegen. Do you have suggestion on rewording this comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about whole-stage-codegen supports subexpression elimination, and is handled by another code path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently looks better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok, too.

@SparkQA
Copy link

SparkQA commented Sep 2, 2019

Test build #110013 has finished for PR 25642 at commit 2c6b64e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 2, 2019

Test build #110019 has finished for PR 25642 at commit 6f4c524.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

df.createOrReplaceTempView("spark64kb")
val data = spark.sql("select * from spark64kb limit 10")
// This fails if 64Kb limit is reached in code generation
data.describe()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we disable fallback to interpreter in this test?

Copy link
Contributor Author

@mgaido91 mgaido91 Sep 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to revisit it according to Wenchen's suggestion... unfortunately this might take some time as I am a bit busy these days...

test("SPARK-28916: subexrepssion elimination can cause 64kb code limit") {
val df = spark.range(2).selectExpr((0 to 5000).map(i => s"id as field_$i"): _*)
df.createOrReplaceTempView("spark64kb")
val data = spark.sql("select * from spark64kb limit 10")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a test for GenerateMutableProjection? How about the case GenerateUnsafeProjection?

*/
def subexprFunctionsCode: String = {
// Whole-stage codegen's subexpression elimination is handled in another code path
splitExpressions(subexprFunctions, "subexprFunc_split", Seq("InternalRow" -> INPUT_ROW))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to check if its empty?

if (subexprFunctions.nonEmpty) {
  splitExpressions(...
} else {
 ""
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not necessary: if it is empty splitExpressions would return an empty string, anyway I can add it if you think it is more clear

@maropu
Copy link
Member

maropu commented Sep 4, 2019

btw, (this is a off-topic though) the HashAggregateExec code for common subexpr elimination has the same issue? That also expands all generated the code for CSE in a single method now;
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L287

@maropu
Copy link
Member

maropu commented Sep 4, 2019

To make the title more precise, how about adding For Generate[Mutable|Unsafe]Projection in it?

@rednaxelafx
Copy link
Contributor

The PR as is it now is one level better than the status quo. That's probably good enough. But I was curious whether or not it makes more sense to perform a tree-splitting instead of a fixed-level splitting.

Basically, CodegenContext.splitExpressions only performs a fixed one-level splitting, so it splits

orig_func() {
  expr1
  expr2
  expr3
  expr4
  expr5
  expr6
}

into something like the following, assuming the split threshold is an imaginary 2 expressions:

func1() { expr1; expr2 }
func2() { expe3; expr4 }
func3() { expr5; expr6 }

orig_func() {
  func1()
  func2()
  func3()
}

Now, given that we assume the split threshold is 2, after the split this top-level code is still above the split threshold, which is not good.

Instead, it'd be really nice if the splitExpressions utility method can perform tree splitting within itself, without a 1- or 2-level fixed depth split limit.

Doing so would help us be more likely to cap the codegen method size below not only 64KB but also some lower thresholds like 8KB.

@kiszk
Copy link
Member

kiszk commented Sep 4, 2019

@maropu good catch. In addition to that, this line may also cause a very huge method. I think that another PR can address these issues.

@kiszk
Copy link
Member

kiszk commented Sep 4, 2019

@rednaxelafx I had a similar thought in my mind. I was optimistic that HotSpot compiler can apply inlining as possible.

It would be great if splitExpressions can handle splitting better.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Sep 4, 2019

btw, (this is a off-topic though) the HashAggregateExec code for common subexpr elimination has the same issue? That also expands all generated the code for CSE in a single method now;
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L287

@maropu I think the point there is that in case of wholestage codegen we are not able to split things at the moment, so I am not sure whether is is possible doing anything better there.

@cloud-fan
Copy link
Contributor

@mgaido91 It will be possible after #20965 . We can apply this patch in HashAggregateExec later.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Sep 4, 2019

But I was curious whether or not it makes more sense to perform a tree-splitting instead of a fixed-level splitting.

@rednaxelafx actually we are doing something similar to what you are suggesting. We already have a 2 or more levels splitting feature. The point is that the splitting point is given by the inner classes. We can say that now we assume that the number of function calls which fits in a single inner classes are a safe threshold for the number of function calls inside a specific function. This is not exactly what you are proposing, as it is not driven by the method size conf, but it is close. For more details please check generateInnerClassesFunctionCalls and SPARK-22226.

@mgaido91 mgaido91 changed the title [SPARK-28916][SQL] Split subexpression elimination functions code [SPARK-28916][SQL] Split subexpression elimination functions code for Generate[Mutable|Unsafe]Projection Sep 4, 2019
@SparkQA
Copy link

SparkQA commented Sep 6, 2019

Test build #110259 has finished for PR 25642 at commit 2d4b8f8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Sep 6, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Sep 7, 2019

Test build #110267 has finished for PR 25642 at commit 2d4b8f8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in c411579 Sep 9, 2019
PavithraRamachandran pushed a commit to PavithraRamachandran/spark that referenced this pull request Sep 15, 2019
… Generate[Mutable|Unsafe]Projection

### What changes were proposed in this pull request?

The PR proposes to split the code for subexpression elimination before inlining the function calls all in the apply method for `Generate[Mutable|Unsafe]Projection`.

### Why are the changes needed?

Before this PR, code generation can fail due to the 64KB code size limit if a lot of subexpression elimination functions are generated. The added UT is a reproducer for the issue (thanks to the JIRA reporter and HyukjinKwon for it).

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

added UT

Closes apache#25642 from mgaido91/SPARK-28916.

Authored-by: Marco Gaido <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
moritzmeister pushed a commit to moritzmeister/spark that referenced this pull request Jul 23, 2020
… Generate[Mutable|Unsafe]Projection

The PR proposes to split the code for subexpression elimination before inlining the function calls all in the apply method for `Generate[Mutable|Unsafe]Projection`.

Before this PR, code generation can fail due to the 64KB code size limit if a lot of subexpression elimination functions are generated. The added UT is a reproducer for the issue (thanks to the JIRA reporter and HyukjinKwon for it).

No.

added UT

Closes apache#25642 from mgaido91/SPARK-28916.

Authored-by: Marco Gaido <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit c411579)
(cherry picked from commit 50c88609dc7e7dea6f747f4edc7b6349c0e2d644)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants