Skip to content

Conversation

@sarutak
Copy link
Member

@sarutak sarutak commented May 5, 2016

What changes were proposed in this pull request?

toCommentSafeString method replaces "\u" with "\u" to avoid codegen breaking.
But if the even number of "" is put before "u", like "\u", in the string literal in the query, codegen can break.

Following code causes compilation error.

val df = Seq(...).toDF
df.select("'\\\\\\\\u002A/'").show

The reason of the compilation error is because "\\\\u002A/" is translated into "*/" (the end of comment).

Due to this unsafety, arbitrary code can be injected like as follows.

val df = Seq(...).toDF
// Inject "System.exit(1)"
df.select("'\\\\\\\\u002A/{System.exit(1);}/*'").show

How was this patch tested?

Added new test cases.

@SparkQA
Copy link

SparkQA commented May 5, 2016

Test build #57927 has finished for PR 12939 at commit 30ad081.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sarutak
Copy link
Member Author

sarutak commented May 7, 2016

CC: @rxin , @davies

val suffix = if (str.length > len) "..." else ""
str.substring(0, len).replace("*/", "\\*\\/").replace("\\u", "\\\\u") + suffix
str.substring(0, len).replace("*/", "\\*\\/")
.replaceAll("(^|[^\\\\])(\\\\(\\\\\\\\)*u)", "$1\\\\$2") + suffix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we also have a comment at here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, make sense.
I've added.

Copy link

@mhseiden mhseiden May 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the implementation of org.apache.commons.lang3.StringEscapeUtils.escapeJava(...) sufficient to cover this case, instead of a custom regex?

Copy link
Member Author

@sarutak sarutak May 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion @mhseiden .
I tried escaping by escapeJava and it may fix this issue but I noticed it escapes all of "", means the number of "" will be doubled.
For example, \\\u0022 will be \\\\\\u0022 but I expects only "" just before "u" will be escaped if the number of "" is odd.

@SparkQA
Copy link

SparkQA commented May 10, 2016

Test build #58228 has finished for PR 12939 at commit 15a23aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

…5165

Conflicts:
	sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
@SparkQA
Copy link

SparkQA commented May 11, 2016

Test build #58333 has finished for PR 12939 at commit 7106f23.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 12, 2016

Test build #58460 has finished for PR 12939 at commit 1140642.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def toCommentSafeString(str: String): String = {
val len = math.min(str.length, 128)
val suffix = if (str.length > len) "..." else ""
str.substring(0, len).replace("*/", "\\*\\/").replace("\\u", "\\\\u") + suffix
Copy link
Contributor

@davies davies May 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need to make sure that the comment string does not have */ in it, *\/ will be OK, one simpler solution could be

str.substring(0, len).replaceAll("(\\*|(u002A))(/|(\\\\u002F))", "$1\\\\/") + suffix

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the advice.
I think "\u" should be escaped too otherwise, the compilation will fail when invalid unicode characters, like \u002X or \u001, are in literals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, LGTM

@SparkQA
Copy link

SparkQA commented May 17, 2016

Test build #58675 has finished for PR 12939 at commit f2b7adb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented May 17, 2016

LGTM,
Merging this into master and 2.0, thanks!

@asfgit asfgit closed this in c0c3ec3 May 17, 2016
asfgit pushed a commit that referenced this pull request May 17, 2016
…not actually safe

## What changes were proposed in this pull request?

toCommentSafeString method replaces "\u" with "\\\\u" to avoid codegen breaking.
But if the even number of "\" is put before "u", like "\\\\u", in the string literal in the query, codegen can break.

Following code causes compilation error.

```
val df = Seq(...).toDF
df.select("'\\\\\\\\u002A/'").show
```

The reason of the compilation error is because "\\\\\\\\\\\\\\\\u002A/" is translated into "*/" (the end of comment).

Due to this unsafety, arbitrary code can be injected like as follows.

```
val df = Seq(...).toDF
// Inject "System.exit(1)"
df.select("'\\\\\\\\u002A/{System.exit(1);}/*'").show
```

## How was this patch tested?

Added new test cases.

Author: Kousuke Saruta <[email protected]>
Author: sarutak <[email protected]>

Closes #12939 from sarutak/SPARK-15165.

(cherry picked from commit c0c3ec3)
Signed-off-by: Davies Liu <[email protected]>
@davies
Copy link
Contributor

davies commented May 17, 2016

@sarutak Could you send another PR for 1.6 branch?

@rxin
Copy link
Contributor

rxin commented May 18, 2016

@davies @sarutak I'm wondering if we should go with a whitelist approach, i.e. only whitelisting a-z0-9 and () []. It wouldn't sacrifice readability as much, but would be a lot safer. WDYT?

@sarutak
Copy link
Member Author

sarutak commented May 18, 2016

You mean that if a character is not in the whitelist, the character should be escaped right?
e.g. abcd*012\u{}[] in comment should be escaped to abcd\*012\\u{}[] ?
Or, just remove those characters from comment?

@rxin
Copy link
Contributor

rxin commented May 18, 2016

Just remove the character.

@sarutak
Copy link
Member Author

sarutak commented May 18, 2016

Lots of punctuation characters like *, + can be used as an operator in expressions so I'm afraid comments in generated code will be difficult to read if characters are removed based on the whitelist.

On the other hand, I noticed my another PR (#12979) can keep the readability of the comment and the safety.

@rxin
Copy link
Contributor

rxin commented May 18, 2016

We can just expand the whitelist and add * + and such into that can't we? My main worry is that security is very difficult to get right, and having a whitelist substantially reduces the chance of corner cases that we didn't expect happening.

@sarutak
Copy link
Member Author

sarutak commented May 18, 2016

O.K. Initially we add some characters to the whitelist and if we need some more characters, we'll consider whether it should be add or not at any time. How about this idea?

@davies
Copy link
Contributor

davies commented May 18, 2016

@rxin If there is any new bug found on this, we could switch to white list, otherwise I'd like to have the current solution.

@rxin
Copy link
Contributor

rxin commented May 18, 2016

@davies this is the 2nd security bug with codegen we found already.

@sarutak sgtm.

@davies
Copy link
Contributor

davies commented May 18, 2016

@sarutak That's true. Should / also be common used?

@davies
Copy link
Contributor

davies commented May 18, 2016

Either way works for me.

@sarutak
Copy link
Member Author

sarutak commented May 18, 2016

/ is used as the division operator and * is used as the multiplication operator so it's good to add those characters but we should remove */ so we need to add \*(?!/) and (?<!\*)/ to the whitelist instead of just / and *.

@sarutak
Copy link
Member Author

sarutak commented May 18, 2016

I have one concern about the whitelisting approach. Even if each single character is safe, it's difficult to ensure any character sequences which consist of those safe characters are always safe.
E.g. * and / are safe themselves but */ is not safe. It's difficult to ensure there are no unsafe combination such.

The place holder approach I mentioned above (#12979) may be safer because the place holder consists of only {, comment_placeholder, numbers and }.

Anyway, I'll try the whitelisting approach. Let's discuss more.

@rxin
Copy link
Contributor

rxin commented May 18, 2016

hm that's true. @davies want to review that one?

@davies
Copy link
Contributor

davies commented May 18, 2016

Yeah, we could repurpose #12979 for security reason, will review that.

asfgit pushed a commit that referenced this pull request May 20, 2016
… in generated code

## What changes were proposed in this pull request?

This PR introduce place holder for comment in generated code and the purpose  is same for #12939 but much safer.

Generated code to be compiled doesn't include actual comments but includes place holder instead.

Place holders in generated code will be replaced with actual comments only at the time of  logging.

Also, this PR can resolve SPARK-15205.

## How was this patch tested?

Existing tests.

Author: Kousuke Saruta <[email protected]>

Closes #12979 from sarutak/SPARK-15205.

(cherry picked from commit 22947cd)
Signed-off-by: Davies Liu <[email protected]>
asfgit pushed a commit that referenced this pull request May 20, 2016
… in generated code (branch-1.6)

## What changes were proposed in this pull request?

This PR introduce place holder for comment in generated code and the purpose is same for #12939 but much safer.

Generated code to be compiled doesn't include actual comments but includes place holder instead.

Place holders in generated code will be replaced with actual comments only at the time of logging.

Also, this PR can resolve SPARK-15205.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Added new test cases.

Author: Kousuke Saruta <[email protected]>

Closes #13230 from sarutak/SPARK-15165-branch-1.6.
zzcclp pushed a commit to zzcclp/spark that referenced this pull request May 22, 2016
… in generated code (branch-1.6)

This PR introduce place holder for comment in generated code and the purpose is same for apache#12939 but much safer.

Generated code to be compiled doesn't include actual comments but includes place holder instead.

Place holders in generated code will be replaced with actual comments only at the time of logging.

Also, this PR can resolve SPARK-15205.

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Added new test cases.

Author: Kousuke Saruta <[email protected]>

Closes apache#13230 from sarutak/SPARK-15165-branch-1.6.

(cherry picked from commit 9a18115)

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/GenerateColumnAccessor.scala
@sarutak sarutak deleted the SPARK-15165 branch June 4, 2021 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants