Skip to content

Conversation

@mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Oct 6, 2017

What changes were proposed in this pull request?

SPARK-18016 introduced an arbitrary threshold for the size of a generated class (

). This value is hardcoded.
In some cases making it smaller can help avoiding the error of exceeding the maximum number of entries in the Constant Pool.
This PR introduces a new configuration parameter, which defaults to the previous value, but it allows to set this to a smaller one if needed.

How was this patch tested?

manual tests

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

// threshold of 1600k bytes to determine when a function should be inlined to a private, nested
// sub-class.
// threshold to determine when a function should be inlined to a private, nested sub-class
val generatedClassLengthThreshold = SparkEnv.get.conf.getInt(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you ensure SparkEnv.get is always true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be empty? In

there is no check about it being None...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code that you pointed out is not in CodeGenerator.scala. In my case at CodeGenerator.scala, there are some cases that SparkEnv.get is null. If you have not seen this null in you case, it would be good.
@gatorsmile will implement a better approach to get this conf soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// threshold to determine when a function should be inlined to a private, nested sub-class
val generatedClassLengthThreshold = SparkEnv.get.conf.getInt(
SQLConf.GENERATED_CLASS_LENGTH_THRESHOLD.key,
SQLConf.GENERATED_CLASS_LENGTH_THRESHOLD.defaultValue.get)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you simplify the above three lines into the following?

val generatedClassLengthThreshold = SparkEnv.get.conf.get(
  SQLConf.GENERATED_CLASS_LENGTH_THRESHOLD)


val GENERATED_CLASS_LENGTH_THRESHOLD =
buildConf("spark.sql.codegen.generatedClass.size.threshold")
.doc("Threshold in bytes for the size of a generated class. If the generated class " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add .internal(). I think this is not a parameter for users.

Copy link
Contributor Author

@mgaido91 mgaido91 Oct 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my ignorance. If users cannot set it, how can this parameter be changed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not a frequently-used parameter for users. Also, other JVM implimentation-specific parameters are internal, e.g., WHOLESTAGE_HUGE_METHOD_LIMIT. cc: @rednaxelafx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, but might you please explain me how to set a parameter which is internal? Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can change this value in the same way with other public parameters, but internal just means that parameters are not explicitly exposed to users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the explanation, I'll add it immediately.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Oct 6, 2017

thank you for your comments @kiszk and @dongjoon-hyun, I changed a bit the approach according to a similar approach in the same file: https://github.com/mgaido91/spark/blob/c69be31314d9aa96c3920073beaf7cca46d507fa/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L1079

Is it good to you like this?

@dongjoon-hyun
Copy link
Member

Could you add a test case for this? Since it's configurable, you can set a small number and catch some exceptions?

@mgaido91
Copy link
Contributor Author

mgaido91 commented Oct 6, 2017

@dongjoon-hyun I am not sure how I can test it. The use case in which this was useful is quite complex and I have not been able to reproduce it in a simpler way.

@dongjoon-hyun
Copy link
Member

Do you mean Spark will work with small value like 1000?

"size is higher of this value, a private nested class is created and used." +
"This is useful to limit the number of named constants in the class " +
"and therefore its Constant Pool. The default is 1600k.")
.intConf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add .checkValue here with a valid minimum and maximum.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Oct 6, 2017

Yes, with small values it will produce a lot of small NestedClasses, but it will work. Instead, if the value is too high this, all the functions (methods) which are created are inlined in the same class and this can cause an Exception because of exceeding the maximum number of entries in the Constant Pool for the class.

@gatorsmile
Copy link
Member

Instead of simply introducing a new conf, we need to answer three questions; otherwise, this is not useful to the users.

  • What is the perf impact of this conf?
  • When should users increase/reduce the conf value?
  • Should we adjust the existing default values?

@mgaido91
Copy link
Contributor Author

mgaido91 commented Oct 6, 2017

Here the answers to your questions @gatorsmile , please tell me if I need to elaborate more deeply.
This conf controls how many inner classes are generated. A big value means that we will have few inner classes and that the outer class and the inner classes will have a lot of methods. A small one means a lot of classes with few values. So,
1 - I don't think there is a big perf impact. Maybe, if this value is very small, we might have a lot nested class instances and therefore a waste of memory. But since these classes don't have any variable, this is quite negligible to me.
2 - If the user encounters an Exception like

Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass has grown past JVM limit of 0xFFFF

then he/she must reduce this value
3 - I don't think so, because I have not been able to reproduce the problem other than in a very complex use case. So probably that use case is a corner case which can benefit by this change.

@gatorsmile
Copy link
Member

We still need to add a test case. We also should capture the exception and issue a better one; otherwise, users will not know what they should do when they hit such a confusing error.

cc @kiszk @rednaxelafx Could you take a look at this PR? Do we still hit this using the current hard-coded value? If so, what is the easy way to write a test case?

@maropu
Copy link
Member

maropu commented Oct 7, 2017

How about making a separated function for checking the threshold and then test it like #18810: https://github.com/apache/spark/pull/18810/files#diff-8bcc5aea39c73d4bf38aef6f6951d42cR363

@kiszk
Copy link
Member

kiszk commented Oct 7, 2017

@gatorsmile I am happy to take it. We have one test case that cause the error JVM limit of 0xFFFF. I am not sure this new option may help the test case or not. I will try it this long weekend (Monday is a holiday in Japan).

Ideally, we want to find better metric rather than string length of the class file.
While constant pool entry is used for various purposes, I imagine that the majorities in this cause is CONSTANT_Fieldref. (i.e. instance variable). It may be helpful to watch # of instance variables in CodegenContext.mutableStates. cc: @rednaxelafx

@maropu it would be good to make currClassSize more precise.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Oct 7, 2017

Yes, what I can do as a test case is that I can check how many inner classes are generated changing the configuration value, is it ok? Or doing what @maropu suggested. If these approaches are ok, I can go on with them.

@kiszk
Copy link
Member

kiszk commented Oct 7, 2017

@mgaido91 When I reduced this value to 10000, this test case causes an exception JVM limit of 0xFFFF. Unfortunately, my proposal to watch # of CodegenContext.mutableStates did not work, too. This is because to generate nested classes does not reduce the number of instance variables (more than 64000 in this test case) in the top class.

I would appreciate it if you would double-check this to avoid possibility of my mistake. If you could not find this PR can help this test case, could you please clarify what cases can be resolved by this PR by using test cases? Then, we can discuss whether this PR would be useful for users or not.

What do you think?

@mgaido91
Copy link
Contributor Author

mgaido91 commented Oct 7, 2017

Thanks for your comment @kiszk. It make sense, actually, that a very small number like 10000 can cause exceptions. I will explain later why this happens, now I will start with a brief introduction about the problem, making a brief summary of my knowledge of the topic.

Code generation can create too many entries for the constant pool of the generated classes. This happens when we have thousands of columns. Many things can add new entries in the constant pool. After the PR you referenced for SPARK-18016, we have many constant pool: one for the outer class and one for each of the inner classes. In particular, in Spark code generation, the problem is that in the constant pool there are entries for:

1 - each attribute declared for a class: all the attributes are added to the outer class. Thus, here we have a limit. We should split them as well, but this is out of my focus now.
2 - each method declared for a class: the number of methods is limited by splitting them among the inner classes. This is controlled by the "number" which is the subject of this PR.
3 - each method which is referenced in a class: at the moment, even though the methods are declared in the inner classes, they are always used in the outer class. This is also a limit.

The 1 point explains you the reason why a very low number like 10000 can create the problem. Since in the outer class we have an attribute for each nested class, if we have a lot of them we can reach the limit. With the current value or similar values (in the use case I am working on I set it to 1000000), even with thousands of columns the number of inner classes is very small, thus this is not a problem. But since we have multiple fields for each column, the number of columns is limited by this problem.

In the use case I am working on I hit the problem 2 and 3, in the reverse order actually. I have a fix for the 3 too, even though I have not yet created a PR because I was not able to reproduce that problem either is a simple way which can be used in a UT and I was not sure about which might be the right approach for creating a PR for it (if someone has suggestion for this, please I'd really appreciate some advice). As I told, the use case is rather complex, and all my trials to reproduce it in an easy way were unsuccessful, since I always hit the 1 issue.

Then, after this very large introduction, I hope that it is clear that this PR's goal is to address those cases in which we have a lot of very small functions and this is the driver to reach the constant pool's limit. So far I have been unsuccessful reproducing this issue is an easy way, I am very sorry.

What I would be able to do, in a test case, is to look for the number of inner classes which are created (CodegenContext.classes): with a lower number they should be more that with a bigger one. If this is ok as UT I can do that.

Hope that this was clear enough. Please, don't hesitate asking for clarifications if needed.

Thanks.

@kiszk
Copy link
Member

kiszk commented Oct 7, 2017

Thank you for your clarification. I understand that you originally addressed JVM limit of 0xFFFF caused by a lot of small methods.

For 2, my idea is that it would be good to monitor the number of method entries and to split the class if it is large (e.g. 16384). Does it work for your case?
if (currClassSize > 1600000 || (the number of method entries [= # of new function names] > 16384))
For 3, you already have a solution (hopefully without introducing new config).

Finally, we could control # of entries for methods in problems 2 and 3 without new config. What do you think?

When we submit a PR, I think that it is necessary to have test case. We hope that you can create it. If it is hard to write a program to cause the issue, you can directly call internal APIs (e.g. CodeGeneratorSuite).

@mgaido91
Copy link
Contributor Author

mgaido91 commented Oct 7, 2017

Thanks @kiszk, that would most probably work, but I am not sure about which "number", which "limit" to set there. Moreover, that number depends also on the number of other items which contributes to add entries to the constant pool. This is the reason why I thought that a configuration might have been better, being more flexible.

For the problem 3, no new configuration. I am just adding a new method to the inner classes which wraps all the small methods, in this way they don't contribute to make the outer class constant pool growing. I know that a UT is necessary. Unfortunately so far I have not been able to create it. I will try with lower level APIs and I hope I'll be able to do that. Any suggestion of course is very welcomed. If I manage to find a UT, should I create a PR as a followup of SPARK-18016 or create a new ticket for it?

Thanks.

@kiszk
Copy link
Member

kiszk commented Oct 7, 2017

About which "number", which "limit" to set there, we want to see what problem happens in test cases or your precise description (what class for code generation causes a lot of small methods while the number of instance variables are not very large under a certain condition). Then, we would discuss what is a better solution.

For the PR for the problem 3, I think that it would be good to update this PR or create a new PR.

@maropu
Copy link
Member

maropu commented Oct 9, 2017

I feel it is a bit annoying to add a parameters for each Constant Pool issue and we better look for solutions so that less parameters (e.g., other metrics as @kiszk suggested) can almost solve the issue. I think splitting classes is not a silver bullet; if we can't compile gen'd code, we should fail over interpreted mode (this is filed in SPARK-21320).

// threshold of 1600k bytes to determine when a function should be inlined to a private, nested
// sub-class.
// threshold to determine when a function should be inlined to a private, nested sub-class
val generatedClassLengthThreshold = SQLConf.get.generatedClassLengthThreshold
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know this discussion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the interesting link. I assume using SparkEnv has the same problem, hasn't it?

@mgaido91
Copy link
Contributor Author

mgaido91 commented Nov 3, 2017

I am closing this because as @kiszk pointed out in his comment, there is no reliable way to get SQLConf here.

@mgaido91 mgaido91 closed this Nov 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants