Skip to content

Commit e5fee3e

Browse files
joderskyrxin
authored andcommitted
[SPARK-17647][SQL] Fix backslash escaping in 'LIKE' patterns.
## What changes were proposed in this pull request? This patch fixes a bug in the way LIKE patterns are translated to Java regexes. The bug causes any character following an escaped backslash to be escaped, i.e. there is double-escaping. A concrete example is the following pattern:`'%\\%'`. The expected Java regex that this pattern should correspond to (according to the behavior described below) is `'.*\\.*'`, however the current situation leads to `'.*\\%'` instead. --- Update: in light of the discussion that ensued, we should explicitly define the expected behaviour of LIKE expressions, especially in certain edge cases. With the help of gatorsmile, we put together a list of different RDBMS and their variations wrt to certain standard features. | RDBMS\Features | Wildcards | Default escape [1] | Case sensitivity | | --- | --- | --- | --- | | [MS SQL Server](https://msdn.microsoft.com/en-us/library/ms179859.aspx) | _, %, [], [^] | none | no | | [Oracle](https://docs.oracle.com/cd/B12037_01/server.101/b10759/conditions016.htm) | _, % | none | yes | | [DB2 z/OS](http://www.ibm.com/support/knowledgecenter/SSEPEK_11.0.0/sqlref/src/tpc/db2z_likepredicate.html) | _, % | none | yes | | [MySQL](http://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html) | _, % | none | no | | [PostreSQL](https://www.postgresql.org/docs/9.0/static/functions-matching.html) | _, % | \ | yes | | [Hive](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) | _, % | none | yes | | Current Spark | _, % | \ | yes | [1] Default escape character: most systems do not have a default escape character, instead the user can specify one by calling a like expression with an escape argument [A] LIKE [B] ESCAPE [C]. This syntax is currently not supported by Spark, however I would volunteer to implement this feature in a separate ticket. The specifications are often quite terse and certain scenarios are undocumented, so here is a list of scenarios that I am uncertain about and would appreciate any input. Specifically I am looking for feedback on whether or not Spark's current behavior should be changed. 1. [x] Ending a pattern with the escape sequence, e.g. `like 'a\'`. PostreSQL gives an error: 'LIKE pattern must not end with escape character', which I personally find logical. Currently, Spark allows "non-terminated" escapes and simply ignores them as part of the pattern. According to [DB2's documentation](http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00130n.html), ending a pattern in an escape character is invalid. _Proposed new behaviour in Spark: throw AnalysisException_ 2. [x] Empty input, e.g. `'' like ''` Postgres and DB2 will match empty input only if the pattern is empty as well, any other combination of empty input will not match. Spark currently follows this rule. 3. [x] Escape before a non-special character, e.g. `'a' like '\a'`. Escaping a non-wildcard character is not really documented but PostgreSQL just treats it verbatim, which I also find the least surprising behavior. Spark does the same. According to [DB2's documentation](http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00130n.html), it is invalid to follow an escape character with anything other than an escape character, an underscore or a percent sign. _Proposed new behaviour in Spark: throw AnalysisException_ The current specification is also described in the operator's source code in this patch. ## How was this patch tested? Extra case in regex unit tests. Author: Jakob Odersky <[email protected]> This patch had conflicts when merged, resolved by Committer: Reynold Xin <[email protected]> Closes #15398 from jodersky/SPARK-17647.
1 parent 01ff035 commit e5fee3e

File tree

4 files changed

+153
-87
lines changed

4 files changed

+153
-87
lines changed

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,30 @@ abstract class StringRegexExpression extends BinaryExpression
6969
* Simple RegEx pattern matching function
7070
*/
7171
@ExpressionDescription(
72-
usage = "str _FUNC_ pattern - Returns true if `str` matches `pattern`, or false otherwise.")
72+
usage = "str _FUNC_ pattern - Returns true if str matches pattern, " +
73+
"null if any arguments are null, false otherwise.",
74+
extended = """
75+
Arguments:
76+
str - a string expression
77+
pattern - a string expression. The pattern is a string which is matched literally, with
78+
exception to the following special symbols:
79+
80+
_ matches any one character in the input (similar to . in posix regular expressions)
81+
82+
% matches zero ore more characters in the input (similar to .* in posix regular
83+
expressions)
84+
85+
The escape character is '\'. If an escape character precedes a special symbol or another
86+
escape character, the following character is matched literally. It is invalid to escape
87+
any other character.
88+
89+
Examples:
90+
> SELECT '%SystemDrive%\Users\John' _FUNC_ '\%SystemDrive\%\\Users%'
91+
true
92+
93+
See also:
94+
Use RLIKE to match with standard regular expressions.
95+
""")
7396
case class Like(left: Expression, right: Expression) extends StringRegexExpression {
7497

7598
override def escape(v: String): String = StringUtils.escapeLikeRegex(v)

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala

Lines changed: 31 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -19,32 +19,44 @@ package org.apache.spark.sql.catalyst.util
1919

2020
import java.util.regex.{Pattern, PatternSyntaxException}
2121

22+
import org.apache.spark.sql.AnalysisException
2223
import org.apache.spark.unsafe.types.UTF8String
2324

2425
object StringUtils {
2526

26-
// replace the _ with .{1} exactly match 1 time of any character
27-
// replace the % with .*, match 0 or more times with any character
28-
def escapeLikeRegex(v: String): String = {
29-
if (!v.isEmpty) {
30-
"(?s)" + (' ' +: v.init).zip(v).flatMap {
31-
case (prev, '\\') => ""
32-
case ('\\', c) =>
33-
c match {
34-
case '_' => "_"
35-
case '%' => "%"
36-
case _ => Pattern.quote("\\" + c)
37-
}
38-
case (prev, c) =>
27+
/**
28+
* Validate and convert SQL 'like' pattern to a Java regular expression.
29+
*
30+
* Underscores (_) are converted to '.' and percent signs (%) are converted to '.*', other
31+
* characters are quoted literally. Escaping is done according to the rules specified in
32+
* [[org.apache.spark.sql.catalyst.expressions.Like]] usage documentation. An invalid pattern will
33+
* throw an [[AnalysisException]].
34+
*
35+
* @param pattern the SQL pattern to convert
36+
* @return the equivalent Java regular expression of the pattern
37+
*/
38+
def escapeLikeRegex(pattern: String): String = {
39+
val in = pattern.toIterator
40+
val out = new StringBuilder()
41+
42+
def fail(message: String) = throw new AnalysisException(
43+
s"the pattern '$pattern' is invalid, $message")
44+
45+
while (in.hasNext) {
46+
in.next match {
47+
case '\\' if in.hasNext =>
48+
val c = in.next
3949
c match {
40-
case '_' => "."
41-
case '%' => ".*"
42-
case _ => Pattern.quote(Character.toString(c))
50+
case '_' | '%' | '\\' => out ++= Pattern.quote(Character.toString(c))
51+
case _ => fail(s"the escape character is not allowed to precede '$c'")
4352
}
44-
}.mkString
45-
} else {
46-
v
53+
case '\\' => fail("it is not allowed to end with the escape character")
54+
case '_' => out ++= "."
55+
case '%' => out ++= ".*"
56+
case c => out ++= Pattern.quote(Character.toString(c))
57+
}
4758
}
59+
"(?s)" + out.result() // (?s) enables dotall mode, causing "." to match new lines
4860
}
4961

5062
private[this] val trueStrings = Set("t", "true", "y", "yes", "1").map(UTF8String.fromString)

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/RegexpExpressionsSuite.scala

Lines changed: 96 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -18,16 +18,38 @@
1818
package org.apache.spark.sql.catalyst.expressions
1919

2020
import org.apache.spark.SparkFunSuite
21+
import org.apache.spark.sql.AnalysisException
2122
import org.apache.spark.sql.catalyst.dsl.expressions._
22-
import org.apache.spark.sql.types.StringType
23+
import org.apache.spark.sql.types.{IntegerType, StringType}
2324

2425
/**
2526
* Unit tests for regular expression (regexp) related SQL expressions.
2627
*/
2728
class RegexpExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
2829

29-
test("LIKE literal Regular Expression") {
30-
checkEvaluation(Literal.create(null, StringType).like("a"), null)
30+
/**
31+
* Check if a given expression evaluates to an expected output, in case the input is
32+
* a literal and in case the input is in the form of a row.
33+
* @tparam A type of input
34+
* @param mkExpr the expression to test for a given input
35+
* @param input value that will be used to create the expression, as literal and in the form
36+
* of a row
37+
* @param expected the expected output of the expression
38+
* @param inputToExpression an implicit conversion from the input type to its corresponding
39+
* sql expression
40+
*/
41+
def checkLiteralRow[A](mkExpr: Expression => Expression, input: A, expected: Any)
42+
(implicit inputToExpression: A => Expression): Unit = {
43+
checkEvaluation(mkExpr(input), expected) // check literal input
44+
45+
val regex = 'a.string.at(0)
46+
checkEvaluation(mkExpr(regex), expected, create_row(input)) // check row input
47+
}
48+
49+
test("LIKE Pattern") {
50+
51+
// null handling
52+
checkLiteralRow(Literal.create(null, StringType).like(_), "a", null)
3153
checkEvaluation(Literal.create("a", StringType).like(Literal.create(null, StringType)), null)
3254
checkEvaluation(Literal.create(null, StringType).like(Literal.create(null, StringType)), null)
3355
checkEvaluation(
@@ -39,45 +61,64 @@ class RegexpExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
3961
checkEvaluation(
4062
Literal.create(null, StringType).like(NonFoldableLiteral.create(null, StringType)), null)
4163

42-
checkEvaluation("abdef" like "abdef", true)
43-
checkEvaluation("a_%b" like "a\\__b", true)
44-
checkEvaluation("addb" like "a_%b", true)
45-
checkEvaluation("addb" like "a\\__b", false)
46-
checkEvaluation("addb" like "a%\\%b", false)
47-
checkEvaluation("a_%b" like "a%\\%b", true)
48-
checkEvaluation("addb" like "a%", true)
49-
checkEvaluation("addb" like "**", false)
50-
checkEvaluation("abc" like "a%", true)
51-
checkEvaluation("abc" like "b%", false)
52-
checkEvaluation("abc" like "bc%", false)
53-
checkEvaluation("a\nb" like "a_b", true)
54-
checkEvaluation("ab" like "a%b", true)
55-
checkEvaluation("a\nb" like "a%b", true)
56-
}
64+
// simple patterns
65+
checkLiteralRow("abdef" like _, "abdef", true)
66+
checkLiteralRow("a_%b" like _, "a\\__b", true)
67+
checkLiteralRow("addb" like _, "a_%b", true)
68+
checkLiteralRow("addb" like _, "a\\__b", false)
69+
checkLiteralRow("addb" like _, "a%\\%b", false)
70+
checkLiteralRow("a_%b" like _, "a%\\%b", true)
71+
checkLiteralRow("addb" like _, "a%", true)
72+
checkLiteralRow("addb" like _, "**", false)
73+
checkLiteralRow("abc" like _, "a%", true)
74+
checkLiteralRow("abc" like _, "b%", false)
75+
checkLiteralRow("abc" like _, "bc%", false)
76+
checkLiteralRow("a\nb" like _, "a_b", true)
77+
checkLiteralRow("ab" like _, "a%b", true)
78+
checkLiteralRow("a\nb" like _, "a%b", true)
79+
80+
// empty input
81+
checkLiteralRow("" like _, "", true)
82+
checkLiteralRow("a" like _, "", false)
83+
checkLiteralRow("" like _, "a", false)
84+
85+
// SI-17647 double-escaping backslash
86+
checkLiteralRow("""\\\\""" like _, """%\\%""", true)
87+
checkLiteralRow("""%%""" like _, """%%""", true)
88+
checkLiteralRow("""\__""" like _, """\\\__""", true)
89+
checkLiteralRow("""\\\__""" like _, """%\\%\%""", false)
90+
checkLiteralRow("""_\\\%""" like _, """%\\""", false)
91+
92+
// unicode
93+
// scalastyle:off nonascii
94+
checkLiteralRow("a\u20ACa" like _, "_\u20AC_", true)
95+
checkLiteralRow("a€a" like _, "_€_", true)
96+
checkLiteralRow("a€a" like _, "_\u20AC_", true)
97+
checkLiteralRow("a\u20ACa" like _, "_€_", true)
98+
// scalastyle:on nonascii
99+
100+
// invalid escaping
101+
val invalidEscape = intercept[AnalysisException] {
102+
evaluate("""a""" like """\a""")
103+
}
104+
assert(invalidEscape.getMessage.contains("pattern"))
105+
106+
val endEscape = intercept[AnalysisException] {
107+
evaluate("""a""" like """a\""")
108+
}
109+
assert(endEscape.getMessage.contains("pattern"))
110+
111+
// case
112+
checkLiteralRow("A" like _, "a%", false)
113+
checkLiteralRow("a" like _, "A%", false)
114+
checkLiteralRow("AaA" like _, "_a_", true)
57115

58-
test("LIKE Non-literal Regular Expression") {
59-
val regEx = 'a.string.at(0)
60-
checkEvaluation("abcd" like regEx, null, create_row(null))
61-
checkEvaluation("abdef" like regEx, true, create_row("abdef"))
62-
checkEvaluation("a_%b" like regEx, true, create_row("a\\__b"))
63-
checkEvaluation("addb" like regEx, true, create_row("a_%b"))
64-
checkEvaluation("addb" like regEx, false, create_row("a\\__b"))
65-
checkEvaluation("addb" like regEx, false, create_row("a%\\%b"))
66-
checkEvaluation("a_%b" like regEx, true, create_row("a%\\%b"))
67-
checkEvaluation("addb" like regEx, true, create_row("a%"))
68-
checkEvaluation("addb" like regEx, false, create_row("**"))
69-
checkEvaluation("abc" like regEx, true, create_row("a%"))
70-
checkEvaluation("abc" like regEx, false, create_row("b%"))
71-
checkEvaluation("abc" like regEx, false, create_row("bc%"))
72-
checkEvaluation("a\nb" like regEx, true, create_row("a_b"))
73-
checkEvaluation("ab" like regEx, true, create_row("a%b"))
74-
checkEvaluation("a\nb" like regEx, true, create_row("a%b"))
75-
76-
checkEvaluation(Literal.create(null, StringType) like regEx, null, create_row("bc%"))
116+
// example
117+
checkLiteralRow("""%SystemDrive%\Users\John""" like _, """\%SystemDrive\%\\Users%""", true)
77118
}
78119

79-
test("RLIKE literal Regular Expression") {
80-
checkEvaluation(Literal.create(null, StringType) rlike "abdef", null)
120+
test("RLIKE Regular Expression") {
121+
checkLiteralRow(Literal.create(null, StringType) rlike _, "abdef", null)
81122
checkEvaluation("abdef" rlike Literal.create(null, StringType), null)
82123
checkEvaluation(Literal.create(null, StringType) rlike Literal.create(null, StringType), null)
83124
checkEvaluation("abdef" rlike NonFoldableLiteral.create("abdef", StringType), true)
@@ -87,42 +128,32 @@ class RegexpExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
87128
checkEvaluation(
88129
Literal.create(null, StringType) rlike NonFoldableLiteral.create(null, StringType), null)
89130

90-
checkEvaluation("abdef" rlike "abdef", true)
91-
checkEvaluation("abbbbc" rlike "a.*c", true)
131+
checkLiteralRow("abdef" rlike _, "abdef", true)
132+
checkLiteralRow("abbbbc" rlike _, "a.*c", true)
92133

93-
checkEvaluation("fofo" rlike "^fo", true)
94-
checkEvaluation("fo\no" rlike "^fo\no$", true)
95-
checkEvaluation("Bn" rlike "^Ba*n", true)
96-
checkEvaluation("afofo" rlike "fo", true)
97-
checkEvaluation("afofo" rlike "^fo", false)
98-
checkEvaluation("Baan" rlike "^Ba?n", false)
99-
checkEvaluation("axe" rlike "pi|apa", false)
100-
checkEvaluation("pip" rlike "^(pi)*$", false)
134+
checkLiteralRow("fofo" rlike _, "^fo", true)
135+
checkLiteralRow("fo\no" rlike _, "^fo\no$", true)
136+
checkLiteralRow("Bn" rlike _, "^Ba*n", true)
137+
checkLiteralRow("afofo" rlike _, "fo", true)
138+
checkLiteralRow("afofo" rlike _, "^fo", false)
139+
checkLiteralRow("Baan" rlike _, "^Ba?n", false)
140+
checkLiteralRow("axe" rlike _, "pi|apa", false)
141+
checkLiteralRow("pip" rlike _, "^(pi)*$", false)
101142

102-
checkEvaluation("abc" rlike "^ab", true)
103-
checkEvaluation("abc" rlike "^bc", false)
104-
checkEvaluation("abc" rlike "^ab", true)
105-
checkEvaluation("abc" rlike "^bc", false)
143+
checkLiteralRow("abc" rlike _, "^ab", true)
144+
checkLiteralRow("abc" rlike _, "^bc", false)
145+
checkLiteralRow("abc" rlike _, "^ab", true)
146+
checkLiteralRow("abc" rlike _, "^bc", false)
106147

107148
intercept[java.util.regex.PatternSyntaxException] {
108149
evaluate("abbbbc" rlike "**")
109150
}
110-
}
111-
112-
test("RLIKE Non-literal Regular Expression") {
113-
val regEx = 'a.string.at(0)
114-
checkEvaluation("abdef" rlike regEx, true, create_row("abdef"))
115-
checkEvaluation("abbbbc" rlike regEx, true, create_row("a.*c"))
116-
checkEvaluation("fofo" rlike regEx, true, create_row("^fo"))
117-
checkEvaluation("fo\no" rlike regEx, true, create_row("^fo\no$"))
118-
checkEvaluation("Bn" rlike regEx, true, create_row("^Ba*n"))
119-
120151
intercept[java.util.regex.PatternSyntaxException] {
121-
evaluate("abbbbc" rlike regEx, create_row("**"))
152+
val regex = 'a.string.at(0)
153+
evaluate("abbbbc" rlike regex, create_row("**"))
122154
}
123155
}
124156

125-
126157
test("RegexReplace") {
127158
val row1 = create_row("100-200", "(\\d+)", "num")
128159
val row2 = create_row("100-200", "(\\d+)", "###")

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/StringUtilsSuite.scala

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,9 @@ class StringUtilsSuite extends SparkFunSuite {
2424

2525
test("escapeLikeRegex") {
2626
assert(escapeLikeRegex("abdef") === "(?s)\\Qa\\E\\Qb\\E\\Qd\\E\\Qe\\E\\Qf\\E")
27-
assert(escapeLikeRegex("a\\__b") === "(?s)\\Qa\\E_.\\Qb\\E")
27+
assert(escapeLikeRegex("a\\__b") === "(?s)\\Qa\\E\\Q_\\E.\\Qb\\E")
2828
assert(escapeLikeRegex("a_%b") === "(?s)\\Qa\\E..*\\Qb\\E")
29-
assert(escapeLikeRegex("a%\\%b") === "(?s)\\Qa\\E.*%\\Qb\\E")
29+
assert(escapeLikeRegex("a%\\%b") === "(?s)\\Qa\\E.*\\Q%\\E\\Qb\\E")
3030
assert(escapeLikeRegex("a%") === "(?s)\\Qa\\E.*")
3131
assert(escapeLikeRegex("**") === "(?s)\\Q*\\E\\Q*\\E")
3232
assert(escapeLikeRegex("a_b") === "(?s)\\Qa\\E.\\Qb\\E")

0 commit comments

Comments
 (0)