-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20871][SQL] limit logging of Janino code #18658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Jenkins this is ok to test |
|
I don't have super strong opinions here, but in my experience its not always easy to get users to rerun a failed query with a different logging level. Have we considered truncating or special casing the 64k limitation instead? |
|
@marmbrus How about something like?: Users with info logging enabled would get the code output when an issue arises but users who don't want to log the code could apply a log level of Warn or Error to the class. |
|
That seems reasonable. I'm kind of pro-truncation for very very large code. Even though its not great to have something truncated, outputting GBs of logs is also pretty bad for downstream consumers. |
|
Test build #79677 has finished for PR 18658 at commit
|
| // Only add extra debugging info to byte code when we are going to print the source code. | ||
| evaluator.setDebuggingInformation(true, true, false) | ||
| s"\n$formatted" | ||
| s"\n${CodeFormatter.format(code)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest dropping the string concatenation with \n here -- it requires an additional copy of the code to be held in-memory and for errors where the code is too long, this causes unnecessary additional pressure on the heap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
FYI for future reviewers as well, we've been running an extremely similar patch to PJ's on our distribution of Spark for the past several months and had no problems. Without this change, a code compilation failure often escalates into a heap OOM as the too-large source code is replicated in memory through the log statement. |
|
Test build #79683 has finished for PR 18658 at commit
|
| val msg = s"failed to compile: $e\n$formatted" | ||
| val msg = s"failed to compile: $e" | ||
| logError(msg, e) | ||
| logInfo(s"\n${CodeFormatter.format(code, 1000)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1000 -> maxLines = 1000
| val msg = s"failed to compile: $e\n$formatted" | ||
| val msg = s"failed to compile: $e" | ||
| logError(msg, e) | ||
| logInfo(s"\n${CodeFormatter.format(code, 1000)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here
|
When the size is too large, which exception will be thrown? Truncate for all the cases? or just when the size is too large? |
|
@gatorsmile I made the change you requested. |
| def format(code: CodeAndComment): String = { | ||
| def format(code: CodeAndComment): String = format(code, -1) | ||
|
|
||
| def format(code: CodeAndComment, maxLines: Int): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can maxLines just be an optional argument rather than declare two methods? this is an internal method so its signature is OK to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen I can make that change. Is there any need to scaladoc the maxLines param? Or is the function definition easy enough to follow?
def format(code: CodeAndComment, maxLines: Int = -1): String
|
Test build #79830 has finished for PR 18658 at commit
|
|
Test build #79834 has finished for PR 18658 at commit
|
| val msg = s"failed to compile: $e\n$formatted" | ||
| val msg = s"failed to compile: $e" | ||
| logError(msg, e) | ||
| logInfo(s"\n${CodeFormatter.format(code, maxLines = 1000)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only concern whether we should add an internal SQLConf for this hard-coded value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile I can add a config item to org.apache.spark.sql.internal.SQLConf.scala if you think this is appropriate.
Maybe something like:
val CODEGEN_LOGGING_MAX_LINES = buildConf("spark.sql.codegen.logging.maxLines")
.internal()
.doc("The maximum number of codegen lines to log when errors occur.")
.intConf
.createWithDefault(1000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me
|
Test build #79871 has finished for PR 18658 at commit
|
| val msg = s"failed to compile: $e\n$formatted" | ||
| val msg = s"failed to compile: $e" | ||
| logError(msg, e) | ||
| val maxLines = new SQLConf().loggingMaxLinesForCodegen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> SQLConf.get.loggingMaxLinesForCodegen
| val msg = s"failed to compile: $e\n$formatted" | ||
| val msg = s"failed to compile: $e" | ||
| logError(msg, e) | ||
| val maxLines = new SQLConf().loggingMaxLinesForCodegen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> SQLConf.get.loggingMaxLinesForCodegen
| val CODEGEN_LOGGING_MAX_LINES = buildConf("spark.sql.codegen.logging.maxLines") | ||
| .internal() | ||
| .doc("The maximum number of codegen lines to log when errors occur.") | ||
| .intConf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add checkValue
…codegen.logging.maxLines
|
@gatorsmile I've made the changes you requested. |
|
Test build #79882 has finished for PR 18658 at commit
|
|
LGTM |
|
Thanks! Merging to master. |
What changes were proposed in this pull request?
When the code that is generated is greater than 64k, then Janino compile will fail and CodeGenerator.scala will log the entire code at Error level.
SPARK-20871 suggests only logging the code at Debug level.
Since, the code is already logged at debug level, this Pull Request proposes not including the formatted code in the Error logging and exception message at all.
When an exception occurs, the code will be logged at Info level but truncated if it is more than 1000 lines long.
How was this patch tested?
Existing tests were run.
An extra test test case was added to CodeFormatterSuite to test the new maxLines parameter,