[SPARK-22215][SQL] Add configuration to set the threshold for generated class #19447

mgaido91 · 2017-10-06T14:05:06Z

What changes were proposed in this pull request?

SPARK-18016 introduced an arbitrary threshold for the size of a generated class (

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

Line 286 in 83488cc

} else if (currClassSize > 1600000) {

). This value is hardcoded.
In some cases making it smaller can help avoiding the error of exceeding the maximum number of entries in the Constant Pool.
This PR introduces a new configuration parameter, which defaults to the previous value, but it allows to set this to a smaller one if needed.

How was this patch tested?

manual tests

…ed class

AmplabJenkins · 2017-10-06T14:06:59Z

Can one of the admins verify this patch?

kiszk · 2017-10-06T14:20:22Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

-    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
-    // sub-class.
+    // threshold to determine when a function should be inlined to a private, nested sub-class
+    val generatedClassLengthThreshold = SparkEnv.get.conf.getInt(


Can you ensure SparkEnv.get is always true?

Can this be empty? In

spark/sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java

Line 80 in 83488cc

final SparkEnv sparkEnv = SparkEnv.get();

there is no check about it being None...

The code that you pointed out is not in CodeGenerator.scala. In my case at CodeGenerator.scala, there are some cases that SparkEnv.get is null. If you have not seen this null in you case, it would be good.
@gatorsmile will implement a better approach to get this conf soon.

you better refer other code in CodeGenerate:

https://github.com/mgaido91/spark/blob/c69be31314d9aa96c3920073beaf7cca46d507fa/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L934

dongjoon-hyun · 2017-10-06T15:07:07Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+    // threshold to determine when a function should be inlined to a private, nested sub-class
+    val generatedClassLengthThreshold = SparkEnv.get.conf.getInt(
+      SQLConf.GENERATED_CLASS_LENGTH_THRESHOLD.key,
+      SQLConf.GENERATED_CLASS_LENGTH_THRESHOLD.defaultValue.get)


Could you simplify the above three lines into the following?

val generatedClassLengthThreshold = SparkEnv.get.conf.get( SQLConf.GENERATED_CLASS_LENGTH_THRESHOLD)

maropu · 2017-10-06T15:23:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala


+  val GENERATED_CLASS_LENGTH_THRESHOLD =
+    buildConf("spark.sql.codegen.generatedClass.size.threshold")
+      .doc("Threshold in bytes for the size of a generated class. If the generated class " +


Please add .internal(). I think this is not a parameter for users.

Sorry for my ignorance. If users cannot set it, how can this parameter be changed?

I think this is not a frequently-used parameter for users. Also, other JVM implimentation-specific parameters are internal, e.g., WHOLESTAGE_HUGE_METHOD_LIMIT. cc: @rednaxelafx

I agree with you, but might you please explain me how to set a parameter which is internal? Thanks.

Users can change this value in the same way with other public parameters, but internal just means that parameters are not explicitly exposed to users.

thanks for the explanation, I'll add it immediately.

mgaido91 · 2017-10-06T15:28:31Z

thank you for your comments @kiszk and @dongjoon-hyun, I changed a bit the approach according to a similar approach in the same file: https://github.com/mgaido91/spark/blob/c69be31314d9aa96c3920073beaf7cca46d507fa/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L1079

Is it good to you like this?

dongjoon-hyun · 2017-10-06T15:40:12Z

Could you add a test case for this? Since it's configurable, you can set a small number and catch some exceptions?

mgaido91 · 2017-10-06T15:49:30Z

@dongjoon-hyun I am not sure how I can test it. The use case in which this was useful is quite complex and I have not been able to reproduce it in a simpler way.

dongjoon-hyun · 2017-10-06T15:57:09Z

Do you mean Spark will work with small value like 1000?

dongjoon-hyun · 2017-10-06T16:00:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "size is higher of this value, a private nested class is created and used." +
+        "This is useful to limit the number of named constants in the class " +
+        "and therefore its Constant Pool. The default is 1600k.")
+      .intConf


Please add .checkValue here with a valid minimum and maximum.

mgaido91 · 2017-10-06T16:03:44Z

Yes, with small values it will produce a lot of small NestedClasses, but it will work. Instead, if the value is too high this, all the functions (methods) which are created are inlined in the same class and this can cause an Exception because of exceeding the maximum number of entries in the Constant Pool for the class.

gatorsmile · 2017-10-06T16:12:37Z

Instead of simply introducing a new conf, we need to answer three questions; otherwise, this is not useful to the users.

What is the perf impact of this conf?
When should users increase/reduce the conf value?
Should we adjust the existing default values?

mgaido91 · 2017-10-06T16:23:21Z

Here the answers to your questions @gatorsmile , please tell me if I need to elaborate more deeply.
This conf controls how many inner classes are generated. A big value means that we will have few inner classes and that the outer class and the inner classes will have a lot of methods. A small one means a lot of classes with few values. So,
1 - I don't think there is a big perf impact. Maybe, if this value is very small, we might have a lot nested class instances and therefore a waste of memory. But since these classes don't have any variable, this is quite negligible to me.
2 - If the user encounters an Exception like

Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass has grown past JVM limit of 0xFFFF

then he/she must reduce this value
3 - I don't think so, because I have not been able to reproduce the problem other than in a very complex use case. So probably that use case is a corner case which can benefit by this change.

gatorsmile · 2017-10-06T17:09:17Z

We still need to add a test case. We also should capture the exception and issue a better one; otherwise, users will not know what they should do when they hit such a confusing error.

cc @kiszk @rednaxelafx Could you take a look at this PR? Do we still hit this using the current hard-coded value? If so, what is the easy way to write a test case?

maropu · 2017-10-07T01:52:07Z

How about making a separated function for checking the threshold and then test it like #18810: https://github.com/apache/spark/pull/18810/files#diff-8bcc5aea39c73d4bf38aef6f6951d42cR363

kiszk · 2017-10-07T02:31:55Z

@gatorsmile I am happy to take it. We have one test case that cause the error JVM limit of 0xFFFF. I am not sure this new option may help the test case or not. I will try it this long weekend (Monday is a holiday in Japan).

Ideally, we want to find better metric rather than string length of the class file.
While constant pool entry is used for various purposes, I imagine that the majorities in this cause is CONSTANT_Fieldref. (i.e. instance variable). It may be helpful to watch # of instance variables in CodegenContext.mutableStates. cc: @rednaxelafx

@maropu it would be good to make currClassSize more precise.

mgaido91 · 2017-10-07T08:57:41Z

Yes, what I can do as a test case is that I can check how many inner classes are generated changing the configuration value, is it ok? Or doing what @maropu suggested. If these approaches are ok, I can go on with them.

kiszk · 2017-10-07T16:13:02Z

@mgaido91 When I reduced this value to 10000, this test case causes an exception JVM limit of 0xFFFF. Unfortunately, my proposal to watch # of CodegenContext.mutableStates did not work, too. This is because to generate nested classes does not reduce the number of instance variables (more than 64000 in this test case) in the top class.

I would appreciate it if you would double-check this to avoid possibility of my mistake. If you could not find this PR can help this test case, could you please clarify what cases can be resolved by this PR by using test cases? Then, we can discuss whether this PR would be useful for users or not.

What do you think?

mgaido91 · 2017-10-07T17:24:44Z

Thanks for your comment @kiszk. It make sense, actually, that a very small number like 10000 can cause exceptions. I will explain later why this happens, now I will start with a brief introduction about the problem, making a brief summary of my knowledge of the topic.

Code generation can create too many entries for the constant pool of the generated classes. This happens when we have thousands of columns. Many things can add new entries in the constant pool. After the PR you referenced for SPARK-18016, we have many constant pool: one for the outer class and one for each of the inner classes. In particular, in Spark code generation, the problem is that in the constant pool there are entries for:

1 - each attribute declared for a class: all the attributes are added to the outer class. Thus, here we have a limit. We should split them as well, but this is out of my focus now.
2 - each method declared for a class: the number of methods is limited by splitting them among the inner classes. This is controlled by the "number" which is the subject of this PR.
3 - each method which is referenced in a class: at the moment, even though the methods are declared in the inner classes, they are always used in the outer class. This is also a limit.

The 1 point explains you the reason why a very low number like 10000 can create the problem. Since in the outer class we have an attribute for each nested class, if we have a lot of them we can reach the limit. With the current value or similar values (in the use case I am working on I set it to 1000000), even with thousands of columns the number of inner classes is very small, thus this is not a problem. But since we have multiple fields for each column, the number of columns is limited by this problem.

In the use case I am working on I hit the problem 2 and 3, in the reverse order actually. I have a fix for the 3 too, even though I have not yet created a PR because I was not able to reproduce that problem either is a simple way which can be used in a UT and I was not sure about which might be the right approach for creating a PR for it (if someone has suggestion for this, please I'd really appreciate some advice). As I told, the use case is rather complex, and all my trials to reproduce it in an easy way were unsuccessful, since I always hit the 1 issue.

Then, after this very large introduction, I hope that it is clear that this PR's goal is to address those cases in which we have a lot of very small functions and this is the driver to reach the constant pool's limit. So far I have been unsuccessful reproducing this issue is an easy way, I am very sorry.

What I would be able to do, in a test case, is to look for the number of inner classes which are created (CodegenContext.classes): with a lower number they should be more that with a bigger one. If this is ok as UT I can do that.

Hope that this was clear enough. Please, don't hesitate asking for clarifications if needed.

Thanks.

kiszk · 2017-10-07T17:46:30Z

Thank you for your clarification. I understand that you originally addressed JVM limit of 0xFFFF caused by a lot of small methods.

For 2, my idea is that it would be good to monitor the number of method entries and to split the class if it is large (e.g. 16384). Does it work for your case?
if (currClassSize > 1600000 || (the number of method entries [= # of new function names] > 16384))
For 3, you already have a solution (hopefully without introducing new config).

Finally, we could control # of entries for methods in problems 2 and 3 without new config. What do you think?

When we submit a PR, I think that it is necessary to have test case. We hope that you can create it. If it is hard to write a program to cause the issue, you can directly call internal APIs (e.g. CodeGeneratorSuite).

mgaido91 · 2017-10-07T17:56:07Z

Thanks @kiszk, that would most probably work, but I am not sure about which "number", which "limit" to set there. Moreover, that number depends also on the number of other items which contributes to add entries to the constant pool. This is the reason why I thought that a configuration might have been better, being more flexible.

For the problem 3, no new configuration. I am just adding a new method to the inner classes which wraps all the small methods, in this way they don't contribute to make the outer class constant pool growing. I know that a UT is necessary. Unfortunately so far I have not been able to create it. I will try with lower level APIs and I hope I'll be able to do that. Any suggestion of course is very welcomed. If I manage to find a UT, should I create a PR as a followup of SPARK-18016 or create a new ticket for it?

Thanks.

kiszk · 2017-10-07T23:10:01Z

About which "number", which "limit" to set there, we want to see what problem happens in test cases or your precise description (what class for code generation causes a lot of small methods while the number of instance variables are not very large under a certain condition). Then, we would discuss what is a better solution.

For the PR for the problem 3, I think that it would be good to update this PR or create a new PR.

maropu · 2017-10-09T02:46:06Z

I feel it is a bit annoying to add a parameters for each Constant Pool issue and we better look for solutions so that less parameters (e.g., other metrics as @kiszk suggested) can almost solve the issue. I think splitting classes is not a silver bullet; if we can't compile gen'd code, we should fail over interpreted mode (this is filed in SPARK-21320).

kiszk · 2017-10-09T17:13:55Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

-    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
-    // sub-class.
+    // threshold to determine when a function should be inlined to a private, nested sub-class
+    val generatedClassLengthThreshold = SQLConf.get.generatedClassLengthThreshold


Good to know this discussion

thanks for the interesting link. I assume using SparkEnv has the same problem, hasn't it?

mgaido91 · 2017-11-03T10:53:22Z

I am closing this because as @kiszk pointed out in his comment, there is no reliable way to get SQLConf here.

[SPARK-22215][SQL] Add configuration to set the threshold for generat…

c69be31

…ed class

kiszk reviewed Oct 6, 2017

View reviewed changes

dongjoon-hyun reviewed Oct 6, 2017

View reviewed changes

maropu reviewed Oct 6, 2017

View reviewed changes

use SQLConf instead of SparkEnv

ed8aaea

dongjoon-hyun reviewed Oct 6, 2017

View reviewed changes

add internal and check

4c68de6

kiszk reviewed Oct 9, 2017

View reviewed changes

mgaido91 mentioned this pull request Oct 12, 2017

[SPARK-22226][SQL] splitExpression can create too many method calls in the outer class #19480

Closed

mgaido91 closed this Nov 3, 2017

[SPARK-22215][SQL] Add configuration to set the threshold for generated class #19447

[SPARK-22215][SQL] Add configuration to set the threshold for generated class #19447

Uh oh!

Conversation

mgaido91 commented Oct 6, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Oct 6, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgaido91 Oct 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgaido91 commented Oct 6, 2017

Uh oh!

dongjoon-hyun commented Oct 6, 2017

Uh oh!

mgaido91 commented Oct 6, 2017

Uh oh!

dongjoon-hyun commented Oct 6, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgaido91 commented Oct 6, 2017

Uh oh!

gatorsmile commented Oct 6, 2017

Uh oh!

mgaido91 commented Oct 6, 2017

Uh oh!

gatorsmile commented Oct 6, 2017

Uh oh!

maropu commented Oct 7, 2017

Uh oh!

kiszk commented Oct 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgaido91 commented Oct 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kiszk commented Oct 7, 2017

Uh oh!

mgaido91 commented Oct 7, 2017

Uh oh!

kiszk commented Oct 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgaido91 commented Oct 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kiszk commented Oct 7, 2017

Uh oh!

maropu commented Oct 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgaido91 commented Nov 3, 2017

Uh oh!

mgaido91 Oct 6, 2017 •

edited

Loading

kiszk commented Oct 7, 2017 •

edited

Loading

mgaido91 commented Oct 7, 2017 •

edited

Loading

kiszk commented Oct 7, 2017 •

edited

Loading

mgaido91 commented Oct 7, 2017 •

edited

Loading