[SPARK-15165][SQL] Codegen can break because toCommentSafeString is not actually safe #12939

sarutak · 2016-05-05T22:30:59Z

What changes were proposed in this pull request?

toCommentSafeString method replaces "\u" with "\u" to avoid codegen breaking.
But if the even number of "" is put before "u", like "\u", in the string literal in the query, codegen can break.

Following code causes compilation error.

val df = Seq(...).toDF
df.select("'\\\\\\\\u002A/'").show

The reason of the compilation error is because "\\\\u002A/" is translated into "*/" (the end of comment).

Due to this unsafety, arbitrary code can be injected like as follows.

val df = Seq(...).toDF
// Inject "System.exit(1)"
df.select("'\\\\\\\\u002A/{System.exit(1);}/*'").show

How was this patch tested?

Added new test cases.

SparkQA · 2016-05-05T23:55:23Z

Test build #57927 has finished for PR 12939 at commit 30ad081.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sarutak · 2016-05-07T07:04:27Z

CC: @rxin , @davies

yhuai · 2016-05-09T20:59:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala

    val suffix = if (str.length > len) "..." else ""
-    str.substring(0, len).replace("*/", "\\*\\/").replace("\\u", "\\\\u") + suffix
+    str.substring(0, len).replace("*/", "\\*\\/")
+      .replaceAll("(^|[^\\\\])(\\\\(\\\\\\\\)*u)", "$1\\\\$2") + suffix


How about we also have a comment at here?

Yeah, make sense.
I've added.

Is the implementation of org.apache.commons.lang3.StringEscapeUtils.escapeJava(...) sufficient to cover this case, instead of a custom regex?

Thanks for the suggestion @mhseiden .
I tried escaping by escapeJava and it may fix this issue but I noticed it escapes all of "", means the number of "" will be doubled.
For example, \\\u0022 will be \\\\\\u0022 but I expects only "" just before "u" will be escaped if the number of "" is odd.

SparkQA · 2016-05-10T10:46:22Z

Test build #58228 has finished for PR 12939 at commit 15a23aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…5165 Conflicts: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

SparkQA · 2016-05-11T07:26:34Z

Test build #58333 has finished for PR 12939 at commit 7106f23.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…5165 Resolved conflict

SparkQA · 2016-05-12T09:12:53Z

Test build #58460 has finished for PR 12939 at commit 1140642.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-05-16T18:49:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala

  def toCommentSafeString(str: String): String = {
    val len = math.min(str.length, 128)
    val suffix = if (str.length > len) "..." else ""
-    str.substring(0, len).replace("*/", "\\*\\/").replace("\\u", "\\\\u") + suffix


We only need to make sure that the comment string does not have */ in it, *\/ will be OK, one simpler solution could be

str.substring(0, len).replaceAll("(\\*|(u002A))(/|(\\\\u002F))", "$1\\\\/") + suffix

Thanks for the advice.
I think "\u" should be escaped too otherwise, the compilation will fail when invalid unicode characters, like \u002X or \u001, are in literals.

Good point, LGTM

SparkQA · 2016-05-17T09:45:06Z

Test build #58675 has finished for PR 12939 at commit f2b7adb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-05-17T17:06:43Z

LGTM,
Merging this into master and 2.0, thanks!

…not actually safe ## What changes were proposed in this pull request? toCommentSafeString method replaces "\u" with "\\\\u" to avoid codegen breaking. But if the even number of "\" is put before "u", like "\\\\u", in the string literal in the query, codegen can break. Following code causes compilation error. ``` val df = Seq(...).toDF df.select("'\\\\\\\\u002A/'").show ``` The reason of the compilation error is because "\\\\\\\\\\\\\\\\u002A/" is translated into "*/" (the end of comment). Due to this unsafety, arbitrary code can be injected like as follows. ``` val df = Seq(...).toDF // Inject "System.exit(1)" df.select("'\\\\\\\\u002A/{System.exit(1);}/*'").show ``` ## How was this patch tested? Added new test cases. Author: Kousuke Saruta <[email protected]> Author: sarutak <[email protected]> Closes #12939 from sarutak/SPARK-15165. (cherry picked from commit c0c3ec3) Signed-off-by: Davies Liu <[email protected]>

davies · 2016-05-17T17:10:35Z

@sarutak Could you send another PR for 1.6 branch?

rxin · 2016-05-18T01:54:02Z

@davies @sarutak I'm wondering if we should go with a whitelist approach, i.e. only whitelisting a-z0-9 and () []. It wouldn't sacrifice readability as much, but would be a lot safer. WDYT?

sarutak · 2016-05-18T02:04:41Z

You mean that if a character is not in the whitelist, the character should be escaped right?
e.g. abcd*012\u{}[] in comment should be escaped to abcd\*012\\u{}[] ?
Or, just remove those characters from comment?

rxin · 2016-05-18T02:28:08Z

Just remove the character.

sarutak · 2016-05-18T05:50:53Z

Lots of punctuation characters like *, + can be used as an operator in expressions so I'm afraid comments in generated code will be difficult to read if characters are removed based on the whitelist.

On the other hand, I noticed my another PR (#12979) can keep the readability of the comment and the safety.

rxin · 2016-05-18T05:53:47Z

We can just expand the whitelist and add * + and such into that can't we? My main worry is that security is very difficult to get right, and having a whitelist substantially reduces the chance of corner cases that we didn't expect happening.

sarutak · 2016-05-18T06:00:10Z

O.K. Initially we add some characters to the whitelist and if we need some more characters, we'll consider whether it should be add or not at any time. How about this idea?

davies · 2016-05-18T06:02:59Z

@rxin If there is any new bug found on this, we could switch to white list, otherwise I'd like to have the current solution.

rxin · 2016-05-18T06:04:57Z

@davies this is the 2nd security bug with codegen we found already.

@sarutak sgtm.

davies · 2016-05-18T06:13:27Z

@sarutak That's true. Should / also be common used?

davies · 2016-05-18T06:19:59Z

Either way works for me.

sarutak · 2016-05-18T06:42:15Z

/ is used as the division operator and * is used as the multiplication operator so it's good to add those characters but we should remove */ so we need to add \*(?!/) and (?<!\*)/ to the whitelist instead of just / and *.

sarutak · 2016-05-18T07:10:22Z

I have one concern about the whitelisting approach. Even if each single character is safe, it's difficult to ensure any character sequences which consist of those safe characters are always safe.
E.g. * and / are safe themselves but */ is not safe. It's difficult to ensure there are no unsafe combination such.

The place holder approach I mentioned above (#12979) may be safer because the place holder consists of only {, comment_placeholder, numbers and }.

Anyway, I'll try the whitelisting approach. Let's discuss more.

rxin · 2016-05-18T07:15:59Z

hm that's true. @davies want to review that one?

davies · 2016-05-18T17:07:19Z

Yeah, we could repurpose #12979 for security reason, will review that.

… in generated code ## What changes were proposed in this pull request? This PR introduce place holder for comment in generated code and the purpose is same for #12939 but much safer. Generated code to be compiled doesn't include actual comments but includes place holder instead. Place holders in generated code will be replaced with actual comments only at the time of logging. Also, this PR can resolve SPARK-15205. ## How was this patch tested? Existing tests. Author: Kousuke Saruta <[email protected]> Closes #12979 from sarutak/SPARK-15205. (cherry picked from commit 22947cd) Signed-off-by: Davies Liu <[email protected]>

… in generated code (branch-1.6) ## What changes were proposed in this pull request? This PR introduce place holder for comment in generated code and the purpose is same for #12939 but much safer. Generated code to be compiled doesn't include actual comments but includes place holder instead. Place holders in generated code will be replaced with actual comments only at the time of logging. Also, this PR can resolve SPARK-15205. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Added new test cases. Author: Kousuke Saruta <[email protected]> Closes #13230 from sarutak/SPARK-15165-branch-1.6.

… in generated code (branch-1.6) This PR introduce place holder for comment in generated code and the purpose is same for apache#12939 but much safer. Generated code to be compiled doesn't include actual comments but includes place holder instead. Place holders in generated code will be replaced with actual comments only at the time of logging. Also, this PR can resolve SPARK-15205. (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Added new test cases. Author: Kousuke Saruta <[email protected]> Closes apache#13230 from sarutak/SPARK-15165-branch-1.6. (cherry picked from commit 9a18115) Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/GenerateColumnAccessor.scala

Made toCommentSafeString method safer

30ad081

yhuai reviewed May 9, 2016
View reviewed changes

sarutak added 2 commits May 10, 2016 13:19

Merge branch 'master' of git://git.apache.org/spark into SPARK-15165

ed2e1e7

Added comment

15a23aa

Merge branch 'master' of https://github.com/apache/spark into SPARK-1…

7106f23

…5165 Conflicts: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

Merge branch 'master' of https://github.com/apache/spark into SPARK-1…

1140642

…5165 Resolved conflict

davies reviewed May 16, 2016
View reviewed changes

sarutak added 3 commits May 17, 2016 13:07

Merge branch 'master' of git://git.apache.org/spark into SPARK-15165

447b24d

Added some more test cases

43a340f

Minor fixes

f2b7adb

asfgit closed this in c0c3ec3 May 17, 2016

sarutak mentioned this pull request May 18, 2016

[SPARK-15165][SPARK-15205][SQL] Introduce place holder for comments in generated code #12979

Closed

sarutak mentioned this pull request May 20, 2016

[SPARK-15165][SPARK-15205][SQL] Introduce place holder for comments in generated code (branch-1.6) #13230

Closed

sarutak deleted the SPARK-15165 branch June 4, 2021 20:47

[SPARK-15165][SQL] Codegen can break because toCommentSafeString is not actually safe #12939

[SPARK-15165][SQL] Codegen can break because toCommentSafeString is not actually safe #12939

Uh oh!

Conversation

sarutak commented May 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 5, 2016

Uh oh!

sarutak commented May 7, 2016

Uh oh!

yhuai May 9, 2016

Choose a reason for hiding this comment

Uh oh!

sarutak May 10, 2016

Choose a reason for hiding this comment

Uh oh!

mhseiden May 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sarutak May 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 10, 2016

Uh oh!

SparkQA commented May 11, 2016

Uh oh!

SparkQA commented May 12, 2016

Uh oh!

davies May 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sarutak May 17, 2016

Choose a reason for hiding this comment

Uh oh!

davies May 17, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 17, 2016

Uh oh!

davies commented May 17, 2016

Uh oh!

davies commented May 17, 2016

Uh oh!

rxin commented May 18, 2016

Uh oh!

sarutak commented May 18, 2016

Uh oh!

rxin commented May 18, 2016

Uh oh!

sarutak commented May 18, 2016

Uh oh!

rxin commented May 18, 2016

Uh oh!

sarutak commented May 18, 2016

Uh oh!

davies commented May 18, 2016

Uh oh!

rxin commented May 18, 2016

Uh oh!

davies commented May 18, 2016

Uh oh!

davies commented May 18, 2016

Uh oh!

sarutak commented May 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sarutak commented May 18, 2016

Uh oh!

rxin commented May 18, 2016

Uh oh!

davies commented May 18, 2016

Uh oh!

Reviewers

Assignees

sarutak commented May 5, 2016 •

edited

Loading

mhseiden May 10, 2016 •

edited

Loading

sarutak May 10, 2016 •

edited

Loading

davies May 16, 2016 •

edited

Loading

sarutak commented May 18, 2016 •

edited

Loading