-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-15165][SQL] Codegen can break because toCommentSafeString is not actually safe #12939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #57927 has finished for PR 12939 at commit
|
| val suffix = if (str.length > len) "..." else "" | ||
| str.substring(0, len).replace("*/", "\\*\\/").replace("\\u", "\\\\u") + suffix | ||
| str.substring(0, len).replace("*/", "\\*\\/") | ||
| .replaceAll("(^|[^\\\\])(\\\\(\\\\\\\\)*u)", "$1\\\\$2") + suffix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we also have a comment at here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, make sense.
I've added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the implementation of org.apache.commons.lang3.StringEscapeUtils.escapeJava(...) sufficient to cover this case, instead of a custom regex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion @mhseiden .
I tried escaping by escapeJava and it may fix this issue but I noticed it escapes all of "", means the number of "" will be doubled.
For example, \\\u0022 will be \\\\\\u0022 but I expects only "" just before "u" will be escaped if the number of "" is odd.
|
Test build #58228 has finished for PR 12939 at commit
|
…5165 Conflicts: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
|
Test build #58333 has finished for PR 12939 at commit
|
…5165 Resolved conflict
|
Test build #58460 has finished for PR 12939 at commit
|
| def toCommentSafeString(str: String): String = { | ||
| val len = math.min(str.length, 128) | ||
| val suffix = if (str.length > len) "..." else "" | ||
| str.substring(0, len).replace("*/", "\\*\\/").replace("\\u", "\\\\u") + suffix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need to make sure that the comment string does not have */ in it, *\/ will be OK, one simpler solution could be
str.substring(0, len).replaceAll("(\\*|(u002A))(/|(\\\\u002F))", "$1\\\\/") + suffix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the advice.
I think "\u" should be escaped too otherwise, the compilation will fail when invalid unicode characters, like \u002X or \u001, are in literals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, LGTM
|
Test build #58675 has finished for PR 12939 at commit
|
|
LGTM, |
…not actually safe
## What changes were proposed in this pull request?
toCommentSafeString method replaces "\u" with "\\\\u" to avoid codegen breaking.
But if the even number of "\" is put before "u", like "\\\\u", in the string literal in the query, codegen can break.
Following code causes compilation error.
```
val df = Seq(...).toDF
df.select("'\\\\\\\\u002A/'").show
```
The reason of the compilation error is because "\\\\\\\\\\\\\\\\u002A/" is translated into "*/" (the end of comment).
Due to this unsafety, arbitrary code can be injected like as follows.
```
val df = Seq(...).toDF
// Inject "System.exit(1)"
df.select("'\\\\\\\\u002A/{System.exit(1);}/*'").show
```
## How was this patch tested?
Added new test cases.
Author: Kousuke Saruta <[email protected]>
Author: sarutak <[email protected]>
Closes #12939 from sarutak/SPARK-15165.
(cherry picked from commit c0c3ec3)
Signed-off-by: Davies Liu <[email protected]>
|
@sarutak Could you send another PR for 1.6 branch? |
|
You mean that if a character is not in the whitelist, the character should be escaped right? |
|
Just remove the character. |
|
Lots of punctuation characters like On the other hand, I noticed my another PR (#12979) can keep the readability of the comment and the safety. |
|
We can just expand the whitelist and add * + and such into that can't we? My main worry is that security is very difficult to get right, and having a whitelist substantially reduces the chance of corner cases that we didn't expect happening. |
|
O.K. Initially we add some characters to the whitelist and if we need some more characters, we'll consider whether it should be add or not at any time. How about this idea? |
|
@rxin If there is any new bug found on this, we could switch to white list, otherwise I'd like to have the current solution. |
|
@sarutak That's true. Should |
|
Either way works for me. |
|
|
|
I have one concern about the whitelisting approach. Even if each single character is safe, it's difficult to ensure any character sequences which consist of those safe characters are always safe. The place holder approach I mentioned above (#12979) may be safer because the place holder consists of only Anyway, I'll try the whitelisting approach. Let's discuss more. |
|
hm that's true. @davies want to review that one? |
|
Yeah, we could repurpose #12979 for security reason, will review that. |
… in generated code ## What changes were proposed in this pull request? This PR introduce place holder for comment in generated code and the purpose is same for #12939 but much safer. Generated code to be compiled doesn't include actual comments but includes place holder instead. Place holders in generated code will be replaced with actual comments only at the time of logging. Also, this PR can resolve SPARK-15205. ## How was this patch tested? Existing tests. Author: Kousuke Saruta <[email protected]> Closes #12979 from sarutak/SPARK-15205. (cherry picked from commit 22947cd) Signed-off-by: Davies Liu <[email protected]>
… in generated code (branch-1.6) ## What changes were proposed in this pull request? This PR introduce place holder for comment in generated code and the purpose is same for #12939 but much safer. Generated code to be compiled doesn't include actual comments but includes place holder instead. Place holders in generated code will be replaced with actual comments only at the time of logging. Also, this PR can resolve SPARK-15205. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Added new test cases. Author: Kousuke Saruta <[email protected]> Closes #13230 from sarutak/SPARK-15165-branch-1.6.
… in generated code (branch-1.6) This PR introduce place holder for comment in generated code and the purpose is same for apache#12939 but much safer. Generated code to be compiled doesn't include actual comments but includes place holder instead. Place holders in generated code will be replaced with actual comments only at the time of logging. Also, this PR can resolve SPARK-15205. (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Added new test cases. Author: Kousuke Saruta <[email protected]> Closes apache#13230 from sarutak/SPARK-15165-branch-1.6. (cherry picked from commit 9a18115) Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/GenerateColumnAccessor.scala
What changes were proposed in this pull request?
toCommentSafeString method replaces "\u" with "\u" to avoid codegen breaking.
But if the even number of "" is put before "u", like "\u", in the string literal in the query, codegen can break.
Following code causes compilation error.
The reason of the compilation error is because "\\\\u002A/" is translated into "*/" (the end of comment).
Due to this unsafety, arbitrary code can be injected like as follows.
How was this patch tested?
Added new test cases.