-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-21261][DOCS]SQL Regex document fix #18477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another example that needs the same change near the end of the file too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to fix this? I remember in the doc, we use unescaped characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya I'm not an expert here, but reading the docs on line 160, I think this needs to be escaped in order to be consistent with Spark 2 default behavior? my assumption was that this was just never updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, when I wrote the docs on line 160, I was suggested to use unescaped characters.
Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser. For example, to match "\abc", a regular expression for
regexpcan be "^\abc$".
Actually, you need to write like this in spark-shell:
scala> sql("SELECT like('\\\\abc', '\\\\\\\\abc')").show
+---------------+
|\abc LIKE \\abc|
+---------------+
| true|
+---------------+
scala> sql("SELECT regexp_replace('100-200', '(\\\\d+)', 'num')").show
+-----------------------------------+
|regexp_replace(100-200, (\d+), num)|
+-----------------------------------+
| num-num|
+-----------------------------------+
The behavior of Spark 2 when parsing SQL string literal reads \\\\abc as \abc and (\\\\d+) as (\d+) in spark-shell.
But in spark-sql, you write the queries like this:
spark-sql> SELECT like('\\abc', '\\\\abc');
true
Time taken: 0.061 seconds, Fetched 1 row(s)
spark-sql> SELECT regexp_replace('100-200', '(\\d+)', 'num');
num-num
Time taken: 0.117 seconds, Fetched 1 row(s)
So depending how the shell environment processes string escaping, the query looks different. In the docs, it seems to me that writing in unescaped style can avoid this confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the better fix to make it clear that this example uses unescaped style @viirya ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add spark-sql and scala to make it clear.
|
@visaxin Could you address the comment? |
1850d87 to
8a7dd55
Compare
8a7dd55 to
08adf17
Compare
|
@gatorsmile Done |
| spark-sql> SELECT _FUNC_('100-200', '(\\d+)-(\\d+)', 1); | ||
| 100 | ||
| scala> SELECT _FUNC_('100-200', '(\\\\d+)-(\\\\d+)', 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scala> spark.sql("SELECT regexp_extract('100-200', '(\\d+)-(\\d+)', 1)").collect()
| 100 | ||
| scala> SELECT _FUNC_('100-200', '(\\\\d+)-(\\\\d+)', 1); | ||
| 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array([100])
| num-num | ||
| scala> SELECT _FUNC_('100-200', '(\\\\d+)', 'num'); | ||
| num-num |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scala> spark.sql("SELECT regexp_replace('100-200', '(\\d+)', 'num')").collect()
Array([num-num])
|
LGTM except the above three comments. |
|
@visaxin this one is old but could you update it per the last review comments? |
|
Can one of the admins verify this patch? |
|
Ping @visaxin |
|
Ping @visaxin |
|
I took this over at #21808 |
## What changes were proposed in this pull request? Fix regexes in spark-sql command examples. This takes over #18477 ## How was this patch tested? Existing tests. I verified the existing example doesn't work in spark-sql, but new ones does. Author: Sean Owen <[email protected]> Closes #21808 from srowen/SPARK-21261.
SQL regex docs change:
SELECT _FUNC_('100-200', '(\d+)', 'num') => SELECT _FUNC_('100-200', '(\\d+)', 'num')