-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-22215][SQL] Add configuration to set the threshold for generated class #19447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
| // threshold of 1600k bytes to determine when a function should be inlined to a private, nested | ||
| // sub-class. | ||
| // threshold to determine when a function should be inlined to a private, nested sub-class | ||
| val generatedClassLengthThreshold = SparkEnv.get.conf.getInt( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you ensure SparkEnv.get is always true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be empty? In
spark/sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java
Line 80 in 83488cc
| final SparkEnv sparkEnv = SparkEnv.get(); |
None...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code that you pointed out is not in CodeGenerator.scala. In my case at CodeGenerator.scala, there are some cases that SparkEnv.get is null. If you have not seen this null in you case, it would be good.
@gatorsmile will implement a better approach to get this conf soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // threshold to determine when a function should be inlined to a private, nested sub-class | ||
| val generatedClassLengthThreshold = SparkEnv.get.conf.getInt( | ||
| SQLConf.GENERATED_CLASS_LENGTH_THRESHOLD.key, | ||
| SQLConf.GENERATED_CLASS_LENGTH_THRESHOLD.defaultValue.get) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you simplify the above three lines into the following?
val generatedClassLengthThreshold = SparkEnv.get.conf.get(
SQLConf.GENERATED_CLASS_LENGTH_THRESHOLD)|
|
||
| val GENERATED_CLASS_LENGTH_THRESHOLD = | ||
| buildConf("spark.sql.codegen.generatedClass.size.threshold") | ||
| .doc("Threshold in bytes for the size of a generated class. If the generated class " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add .internal(). I think this is not a parameter for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for my ignorance. If users cannot set it, how can this parameter be changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not a frequently-used parameter for users. Also, other JVM implimentation-specific parameters are internal, e.g., WHOLESTAGE_HUGE_METHOD_LIMIT. cc: @rednaxelafx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you, but might you please explain me how to set a parameter which is internal? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users can change this value in the same way with other public parameters, but internal just means that parameters are not explicitly exposed to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the explanation, I'll add it immediately.
|
thank you for your comments @kiszk and @dongjoon-hyun, I changed a bit the approach according to a similar approach in the same file: https://github.com/mgaido91/spark/blob/c69be31314d9aa96c3920073beaf7cca46d507fa/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L1079 Is it good to you like this? |
|
Could you add a test case for this? Since it's configurable, you can set a small number and catch some exceptions? |
|
@dongjoon-hyun I am not sure how I can test it. The use case in which this was useful is quite complex and I have not been able to reproduce it in a simpler way. |
|
Do you mean Spark will work with small value like 1000? |
| "size is higher of this value, a private nested class is created and used." + | ||
| "This is useful to limit the number of named constants in the class " + | ||
| "and therefore its Constant Pool. The default is 1600k.") | ||
| .intConf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add .checkValue here with a valid minimum and maximum.
|
Yes, with small values it will produce a lot of small |
|
Instead of simply introducing a new conf, we need to answer three questions; otherwise, this is not useful to the users.
|
|
Here the answers to your questions @gatorsmile , please tell me if I need to elaborate more deeply. then he/she must reduce this value |
|
We still need to add a test case. We also should capture the exception and issue a better one; otherwise, users will not know what they should do when they hit such a confusing error. cc @kiszk @rednaxelafx Could you take a look at this PR? Do we still hit this using the current hard-coded value? If so, what is the easy way to write a test case? |
|
How about making a separated function for checking the threshold and then test it like #18810: https://github.com/apache/spark/pull/18810/files#diff-8bcc5aea39c73d4bf38aef6f6951d42cR363 |
|
@gatorsmile I am happy to take it. We have one test case that cause the error Ideally, we want to find better metric rather than string length of the class file. @maropu it would be good to make |
|
Yes, what I can do as a test case is that I can check how many inner classes are generated changing the configuration value, is it ok? Or doing what @maropu suggested. If these approaches are ok, I can go on with them. |
|
@mgaido91 When I reduced this value to 10000, this test case causes an exception I would appreciate it if you would double-check this to avoid possibility of my mistake. If you could not find this PR can help this test case, could you please clarify what cases can be resolved by this PR by using test cases? Then, we can discuss whether this PR would be useful for users or not. What do you think? |
|
Thanks for your comment @kiszk. It make sense, actually, that a very small number like 10000 can cause exceptions. I will explain later why this happens, now I will start with a brief introduction about the problem, making a brief summary of my knowledge of the topic. Code generation can create too many entries for the constant pool of the generated classes. This happens when we have thousands of columns. Many things can add new entries in the constant pool. After the PR you referenced for SPARK-18016, we have many constant pool: one for the outer class and one for each of the inner classes. In particular, in Spark code generation, the problem is that in the constant pool there are entries for: 1 - each attribute declared for a class: all the attributes are added to the outer class. Thus, here we have a limit. We should split them as well, but this is out of my focus now. The 1 point explains you the reason why a very low number like 10000 can create the problem. Since in the outer class we have an attribute for each nested class, if we have a lot of them we can reach the limit. With the current value or similar values (in the use case I am working on I set it to 1000000), even with thousands of columns the number of inner classes is very small, thus this is not a problem. But since we have multiple fields for each column, the number of columns is limited by this problem. In the use case I am working on I hit the problem 2 and 3, in the reverse order actually. I have a fix for the 3 too, even though I have not yet created a PR because I was not able to reproduce that problem either is a simple way which can be used in a UT and I was not sure about which might be the right approach for creating a PR for it (if someone has suggestion for this, please I'd really appreciate some advice). As I told, the use case is rather complex, and all my trials to reproduce it in an easy way were unsuccessful, since I always hit the 1 issue. Then, after this very large introduction, I hope that it is clear that this PR's goal is to address those cases in which we have a lot of very small functions and this is the driver to reach the constant pool's limit. So far I have been unsuccessful reproducing this issue is an easy way, I am very sorry. What I would be able to do, in a test case, is to look for the number of inner classes which are created ( Hope that this was clear enough. Please, don't hesitate asking for clarifications if needed. Thanks. |
|
Thank you for your clarification. I understand that you originally addressed For 2, my idea is that it would be good to monitor the number of method entries and to split the class if it is large (e.g. 16384). Does it work for your case? Finally, we could control # of entries for methods in problems 2 and 3 without new config. What do you think? When we submit a PR, I think that it is necessary to have test case. We hope that you can create it. If it is hard to write a program to cause the issue, you can directly call internal APIs (e.g. |
|
Thanks @kiszk, that would most probably work, but I am not sure about which "number", which "limit" to set there. Moreover, that number depends also on the number of other items which contributes to add entries to the constant pool. This is the reason why I thought that a configuration might have been better, being more flexible. For the problem 3, no new configuration. I am just adding a new method to the inner classes which wraps all the small methods, in this way they don't contribute to make the outer class constant pool growing. I know that a UT is necessary. Unfortunately so far I have not been able to create it. I will try with lower level APIs and I hope I'll be able to do that. Any suggestion of course is very welcomed. If I manage to find a UT, should I create a PR as a followup of SPARK-18016 or create a new ticket for it? Thanks. |
|
About which "number", which "limit" to set there, we want to see what problem happens in test cases or your precise description (what class for code generation causes a lot of small methods while the number of instance variables are not very large under a certain condition). Then, we would discuss what is a better solution. For the PR for the problem 3, I think that it would be good to update this PR or create a new PR. |
|
I feel it is a bit annoying to add a parameters for each Constant Pool issue and we better look for solutions so that less parameters (e.g., other metrics as @kiszk suggested) can almost solve the issue. I think splitting classes is not a silver bullet; if we can't compile gen'd code, we should fail over interpreted mode (this is filed in SPARK-21320). |
| // threshold of 1600k bytes to determine when a function should be inlined to a private, nested | ||
| // sub-class. | ||
| // threshold to determine when a function should be inlined to a private, nested sub-class | ||
| val generatedClassLengthThreshold = SQLConf.get.generatedClassLengthThreshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know this discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the interesting link. I assume using SparkEnv has the same problem, hasn't it?
|
I am closing this because as @kiszk pointed out in his comment, there is no reliable way to get |
What changes were proposed in this pull request?
SPARK-18016 introduced an arbitrary threshold for the size of a generated class (
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
Line 286 in 83488cc
In some cases making it smaller can help avoiding the error of exceeding the maximum number of entries in the Constant Pool.
This PR introduces a new configuration parameter, which defaults to the previous value, but it allows to set this to a smaller one if needed.
How was this patch tested?
manual tests