-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17647][SQL] Fix backslash escaping in 'LIKE' patterns. #15398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'll add some more tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we get rid of the fold here? The fold makes it more difficult to parse (e.g. what's the return type, what's escaping, what's next).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I updated it in the latest commit
|
Test build #66547 has finished for PR 15398 at commit
|
|
Test build #66552 has finished for PR 15398 at commit
|
|
Test build #66549 has finished for PR 15398 at commit
|
|
Every non-java-pattern character is quoted now, updated the StringUtils test suite |
|
Test build #66561 has finished for PR 15398 at commit
|
|
cc @viirya, I changed your original code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "\\a"? Previously it is \Q\a\E, now it seems becoming \Qa\E.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every character after a backslash is quoted, so \\a becomes \Q\\E\Qa\E. Is this not the intended behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\Q\\E\Qa\E is correct. But doesn't it become \Qa\E in this change?
For \\a, the prefixing \\ will go the next branch and enable escaping. Then the next char a will be quoted here. So it becomes \Qa\E. BTW, before this change, it will be \Q\a\E.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're misunderstanding each other, there are so many different contexts in which a backslash is used :)
The prefix '\\' that matches the escape char here is the scala source escape for the literal character '\' (If scala had triple quotes for chars as it does for strings, I could have written '''\'''). So, in the if statement \\ will consume one character, a \.
The subsequent \ will be quoted in the escaping case, and the final a will also be quoted in the last case, resulting in a \Q\\E\Qa\E.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. @jodersky. Thanks for clarifying.
|
Also cc @yhuai and @JoshRosen @mengxr Please check whether the changes here can satisfy what you want. Thanks! |
|
@jodersky Thank you for the patch! How about we add a summary related to the behavior of escape in other databases in pr description and jira? Also, let's document our behavior in the doc. |
|
I think we should consider other databases behavior and see if our behavior makes sense. |
|
I see. Agree. Will do an investigation with @jodersky Thanks! |
|
Test build #66918 has finished for PR 15398 at commit
|
|
The style issues are weird. Running scalastyle in sbt worked fine...anyway, will investigate once we decide on the like behaviour to adopt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to turn off scalastyle when having unicode.
// scalastyle:off
// non ascii characters are not allowed in the source code, so we disable the scalastyle.
checkEvaluation("a€a" like "_€_", true)
...
// scalastyle:onThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more than 100 characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this empty line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I just saw those. It turns out that scalastyle is a scoped task, so you have to run test:scalastyle in sbt to make sure everything gets checked.
|
Test build #66974 has finished for PR 15398 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated: I got an exception from DB2 like Postgre after explicitly specifying the escape. DB2 follows the standard. It does not have a default escape. See the example:
select actkwd from act where actkwd like '%A%\' escape '\'
SQL0130N The ESCAPE clause is not a single character, or the pattern string contains an invalid occurrence of the escape character. SQLSTATE=22025
cc @yhuai
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the description of SQL -130
SQL0130N
The ESCAPE clause is not a single character, or the pattern string contains an invalid occurrence of the escape character.
Explanation
The escape character must be a single character no more than two bytes in length. It can only appear in the pattern string if it is followed by itself, a percent sign, or an underscore. For more information about the ESCAPE clause on the LIKE predicate, refer to the SQL Reference.
User response
Correct the pattern string or the escape character accordingly.`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The escape character must be a single character no more than two bytes in length. It can only appear in the pattern string if it is followed by itself, a percent sign, or an underscore.
I guess this means that point 3 should also throw an exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To answer your point 3, I did a try in DB2.
db2 => select actkwd from act where actkwd like '%A%\a' escape '\'
SQL0130N The ESCAPE clause is not a single character, or the pattern string
contains an invalid occurrence of the escape character. SQLSTATE=22025
In DB2, normally, our design is very conservative. If we think this could be a user error, we will stop it with an error. We do not want to give users any surprise.
|
As @jodersky said, SQL2003 syntax is like
@yhuai @rxin Should we allow users to specify the ESCAPE characters? Thanks! |
|
Sounds like a good idea. It doesn't need to be in this PR though ... |
|
@rxin thanks for the feedback. It was my plan to add the escape syntax in a separate PR. The main questions that I'm concerned with here are the points I enumerate in this PRs description, namely:
|
|
Also cc @yhuai , @JoshRosen and @mengxr |
|
For escaping before a non-special character, I don't know if DB2 is special. Because as I try, MySQL behaving like PostgreSQL. |
|
That is a design decision we need to make. Personally, MySQL and PostgreSQL are not good examples we should follow. Let us ask ourselves a simple question:
|
|
Also checked SQL Server. It behaves differently from both
Let me check Oracle now |
|
Also cc @hvanhovell who is the major reviewer about parser-related PRs. |
|
@hvanhovell, I just wanted to add that option 2 from my last comment won't work: if we replace Considering that, another potential solution would be change the ANTLR parser to handle escaping differently in LIKE expressions. This would however introduce a special case which complicates the overall rules. |
|
Test build #69932 has finished for PR 15398 at commit
|
|
@mengxr, @rxin, @yhuai, @hvanhovell, what's the status of this PR? How much works remains before this is good to go? What decisions are we blocking on? I ask because this conflicts with a small perf. optimization PR in |
|
Ping @jodersky, @rxin, @mengxr, @yhuai, @hvanhovell we also run into this bug. Will be great we have it by 2.2 release. Thanks. |
| _ matches any one character in the input (similar to . in posix regular expressions) | ||
| % matches zero ore more characters in the input (similar to .* in posix regular |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ore -> or?
|
Re-checked the current change, I think it is in a good shape. Do we have unsolved issues or decisions on this? ping @jodersky Would you like to update this with master? Thanks. |
|
LGTM |
|
I've resolved the conflict and merged this in master/branch-2.1. Thanks. |
This patch fixes a bug in the way LIKE patterns are translated to Java regexes. The bug causes any character following an escaped backslash to be escaped, i.e. there is double-escaping. A concrete example is the following pattern:`'%\\%'`. The expected Java regex that this pattern should correspond to (according to the behavior described below) is `'.*\\.*'`, however the current situation leads to `'.*\\%'` instead. --- Update: in light of the discussion that ensued, we should explicitly define the expected behaviour of LIKE expressions, especially in certain edge cases. With the help of gatorsmile, we put together a list of different RDBMS and their variations wrt to certain standard features. | RDBMS\Features | Wildcards | Default escape [1] | Case sensitivity | | --- | --- | --- | --- | | [MS SQL Server](https://msdn.microsoft.com/en-us/library/ms179859.aspx) | _, %, [], [^] | none | no | | [Oracle](https://docs.oracle.com/cd/B12037_01/server.101/b10759/conditions016.htm) | _, % | none | yes | | [DB2 z/OS](http://www.ibm.com/support/knowledgecenter/SSEPEK_11.0.0/sqlref/src/tpc/db2z_likepredicate.html) | _, % | none | yes | | [MySQL](http://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html) | _, % | none | no | | [PostreSQL](https://www.postgresql.org/docs/9.0/static/functions-matching.html) | _, % | \ | yes | | [Hive](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) | _, % | none | yes | | Current Spark | _, % | \ | yes | [1] Default escape character: most systems do not have a default escape character, instead the user can specify one by calling a like expression with an escape argument [A] LIKE [B] ESCAPE [C]. This syntax is currently not supported by Spark, however I would volunteer to implement this feature in a separate ticket. The specifications are often quite terse and certain scenarios are undocumented, so here is a list of scenarios that I am uncertain about and would appreciate any input. Specifically I am looking for feedback on whether or not Spark's current behavior should be changed. 1. [x] Ending a pattern with the escape sequence, e.g. `like 'a\'`. PostreSQL gives an error: 'LIKE pattern must not end with escape character', which I personally find logical. Currently, Spark allows "non-terminated" escapes and simply ignores them as part of the pattern. According to [DB2's documentation](http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00130n.html), ending a pattern in an escape character is invalid. _Proposed new behaviour in Spark: throw AnalysisException_ 2. [x] Empty input, e.g. `'' like ''` Postgres and DB2 will match empty input only if the pattern is empty as well, any other combination of empty input will not match. Spark currently follows this rule. 3. [x] Escape before a non-special character, e.g. `'a' like '\a'`. Escaping a non-wildcard character is not really documented but PostgreSQL just treats it verbatim, which I also find the least surprising behavior. Spark does the same. According to [DB2's documentation](http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00130n.html), it is invalid to follow an escape character with anything other than an escape character, an underscore or a percent sign. _Proposed new behaviour in Spark: throw AnalysisException_ The current specification is also described in the operator's source code in this patch. Extra case in regex unit tests. Author: Jakob Odersky <[email protected]> This patch had conflicts when merged, resolved by Committer: Reynold Xin <[email protected]> Closes #15398 from jodersky/SPARK-17647. (cherry picked from commit e5fee3e) Signed-off-by: Reynold Xin <[email protected]>
|
This seems to have broken all of the builds in branch-2.1, e.g. https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.1-compile-maven-hadoop-2.6/591/consoleFull https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/ |
|
I pushed a commit. Hopefully that fixes it. |
|
Glad to see this being merged. Let me know if there is anything more I can do to help with any issues. |
|
Submit a PR to fix the issues in my PR. Thanks! |
## What changes were proposed in this pull request? This patch fixes a bug in the way LIKE patterns are translated to Java regexes. The bug causes any character following an escaped backslash to be escaped, i.e. there is double-escaping. A concrete example is the following pattern:`'%\\%'`. The expected Java regex that this pattern should correspond to (according to the behavior described below) is `'.*\\.*'`, however the current situation leads to `'.*\\%'` instead. --- Update: in light of the discussion that ensued, we should explicitly define the expected behaviour of LIKE expressions, especially in certain edge cases. With the help of gatorsmile, we put together a list of different RDBMS and their variations wrt to certain standard features. | RDBMS\Features | Wildcards | Default escape [1] | Case sensitivity | | --- | --- | --- | --- | | [MS SQL Server](https://msdn.microsoft.com/en-us/library/ms179859.aspx) | _, %, [], [^] | none | no | | [Oracle](https://docs.oracle.com/cd/B12037_01/server.101/b10759/conditions016.htm) | _, % | none | yes | | [DB2 z/OS](http://www.ibm.com/support/knowledgecenter/SSEPEK_11.0.0/sqlref/src/tpc/db2z_likepredicate.html) | _, % | none | yes | | [MySQL](http://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html) | _, % | none | no | | [PostreSQL](https://www.postgresql.org/docs/9.0/static/functions-matching.html) | _, % | \ | yes | | [Hive](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) | _, % | none | yes | | Current Spark | _, % | \ | yes | [1] Default escape character: most systems do not have a default escape character, instead the user can specify one by calling a like expression with an escape argument [A] LIKE [B] ESCAPE [C]. This syntax is currently not supported by Spark, however I would volunteer to implement this feature in a separate ticket. The specifications are often quite terse and certain scenarios are undocumented, so here is a list of scenarios that I am uncertain about and would appreciate any input. Specifically I am looking for feedback on whether or not Spark's current behavior should be changed. 1. [x] Ending a pattern with the escape sequence, e.g. `like 'a\'`. PostreSQL gives an error: 'LIKE pattern must not end with escape character', which I personally find logical. Currently, Spark allows "non-terminated" escapes and simply ignores them as part of the pattern. According to [DB2's documentation](http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00130n.html), ending a pattern in an escape character is invalid. _Proposed new behaviour in Spark: throw AnalysisException_ 2. [x] Empty input, e.g. `'' like ''` Postgres and DB2 will match empty input only if the pattern is empty as well, any other combination of empty input will not match. Spark currently follows this rule. 3. [x] Escape before a non-special character, e.g. `'a' like '\a'`. Escaping a non-wildcard character is not really documented but PostgreSQL just treats it verbatim, which I also find the least surprising behavior. Spark does the same. According to [DB2's documentation](http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00130n.html), it is invalid to follow an escape character with anything other than an escape character, an underscore or a percent sign. _Proposed new behaviour in Spark: throw AnalysisException_ The current specification is also described in the operator's source code in this patch. ## How was this patch tested? Extra case in regex unit tests. Author: Jakob Odersky <[email protected]> This patch had conflicts when merged, resolved by Committer: Reynold Xin <[email protected]> Closes apache#15398 from jodersky/SPARK-17647.
What changes were proposed in this pull request?
This patch fixes a bug in the way LIKE patterns are translated to Java regexes. The bug causes any character following an escaped backslash to be escaped, i.e. there is double-escaping.
A concrete example is the following pattern:
'%\\%'. The expected Java regex that this pattern should correspond to (according to the behavior described below) is'.*\\.*', however the current situation leads to'.*\\%'instead.Update: in light of the discussion that ensued, we should explicitly define the expected behaviour of LIKE expressions, especially in certain edge cases. With the help of @gatorsmile, we put together a list of different RDBMS and their variations wrt to certain standard features.
[1] Default escape character: most systems do not have a default escape character, instead the user can specify one by calling a like expression with an escape argument [A] LIKE [B] ESCAPE [C]. This syntax is currently not supported by Spark, however I would volunteer to implement this feature in a separate ticket.
The specifications are often quite terse and certain scenarios are undocumented, so here is a list of scenarios that I am uncertain about and would appreciate any input. Specifically I am looking for feedback on whether or not Spark's current behavior should be changed.
like 'a\'.PostreSQL gives an error: 'LIKE pattern must not end with escape character', which I personally find logical. Currently, Spark allows "non-terminated" escapes and simply ignores them as part of the pattern.
According to DB2's documentation, ending a pattern in an escape character is invalid.
Proposed new behaviour in Spark: throw AnalysisException
'' like ''Postgres and DB2 will match empty input only if the pattern is empty as well, any other combination of empty input will not match. Spark currently follows this rule.
'a' like '\a'.Escaping a non-wildcard character is not really documented but PostgreSQL just treats it verbatim, which I also find the least surprising behavior. Spark does the same.
According to DB2's documentation, it is invalid to follow an escape character with anything other than an escape character, an underscore or a percent sign.
Proposed new behaviour in Spark: throw AnalysisException
The current specification is also described in the operator's source code in this patch.
How was this patch tested?
Extra case in regex unit tests.