Skip to content

Conversation

@visaxin
Copy link

@visaxin visaxin commented Jun 30, 2017

SQL regex docs change:
SELECT _FUNC_('100-200', '(\d+)', 'num') => SELECT _FUNC_('100-200', '(\\d+)', 'num')

@gf53520
Copy link
Contributor

gf53520 commented Jun 30, 2017

test this please

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another example that needs the same change near the end of the file too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to fix this? I remember in the doc, we use unescaped characters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya I'm not an expert here, but reading the docs on line 160, I think this needs to be escaped in order to be consistent with Spark 2 default behavior? my assumption was that this was just never updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, when I wrote the docs on line 160, I was suggested to use unescaped characters.

Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser. For example, to match "\abc", a regular expression for regexp can be "^\abc$".

Actually, you need to write like this in spark-shell:

scala> sql("SELECT like('\\\\abc', '\\\\\\\\abc')").show
+---------------+
|\abc LIKE \\abc|
+---------------+
|           true|
+---------------+

scala> sql("SELECT regexp_replace('100-200', '(\\\\d+)', 'num')").show
+-----------------------------------+
|regexp_replace(100-200, (\d+), num)|
+-----------------------------------+
|                            num-num|
+-----------------------------------+

The behavior of Spark 2 when parsing SQL string literal reads \\\\abc as \abc and (\\\\d+) as (\d+) in spark-shell.

But in spark-sql, you write the queries like this:

spark-sql> SELECT like('\\abc', '\\\\abc');
true
Time taken: 0.061 seconds, Fetched 1 row(s)

spark-sql> SELECT regexp_replace('100-200', '(\\d+)', 'num');
num-num
Time taken: 0.117 seconds, Fetched 1 row(s)

So depending how the shell environment processes string escaping, the query looks different. In the docs, it seems to me that writing in unescaped style can avoid this confusion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the better fix to make it clear that this example uses unescaped style @viirya ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if we can.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add spark-sql and scala to make it clear.

@gatorsmile
Copy link
Member

@visaxin Could you address the comment?

@visaxin
Copy link
Author

visaxin commented Oct 24, 2017

@gatorsmile Done

spark-sql> SELECT _FUNC_('100-200', '(\\d+)-(\\d+)', 1);
100
scala> SELECT _FUNC_('100-200', '(\\\\d+)-(\\\\d+)', 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scala> spark.sql("SELECT regexp_extract('100-200', '(\\d+)-(\\d+)', 1)").collect()

100
scala> SELECT _FUNC_('100-200', '(\\\\d+)-(\\\\d+)', 1);
100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array([100])

num-num
scala> SELECT _FUNC_('100-200', '(\\\\d+)', 'num');
num-num
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scala> spark.sql("SELECT regexp_replace('100-200', '(\\d+)', 'num')").collect()
Array([num-num])

@gatorsmile
Copy link
Member

LGTM except the above three comments.

@srowen
Copy link
Member

srowen commented May 11, 2018

@visaxin this one is old but could you update it per the last review comments?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Jul 2, 2018

Ping @visaxin

@HyukjinKwon
Copy link
Member

Ping @visaxin

@srowen
Copy link
Member

srowen commented Jul 18, 2018

I took this over at #21808
I don't think this change is even right as it introduces scala-shell examples.

@srowen srowen mentioned this pull request Jul 18, 2018
asfgit pushed a commit that referenced this pull request Jul 18, 2018
## What changes were proposed in this pull request?

Fix regexes in spark-sql command examples.
This takes over #18477

## How was this patch tested?

Existing tests. I verified the existing example doesn't work in spark-sql, but new ones does.

Author: Sean Owen <[email protected]>

Closes #21808 from srowen/SPARK-21261.
@asfgit asfgit closed this in 1a4fda8 Jul 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants