[SPARK-18952] Regex strings not properly escaped in codegen for aggregations #16361

brkyvz · 2016-12-20T23:17:03Z

What changes were proposed in this pull request?

If I use the function regexp_extract, and then in my regex string, use \, i.e. escape character, this fails codegen, because the \ character is not properly escaped when codegen'd.

Example stack trace:

/* 059 */     private int maxSteps = 2;
/* 060 */     private int numRows = 0;
/* 061 */     private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add("date_format(window#325.start, yyyy-MM-dd HH:mm)", org.apache.spark.sql.types.DataTypes.StringType)
/* 062 */     .add("regexp_extract(source#310.description, ([a-zA-Z]+)\[.*, 1)", org.apache.spark.sql.types.DataTypes.StringType);
/* 063 */     private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add("sum", org.apache.spark.sql.types.DataTypes.LongType);
/* 064 */     private Object emptyVBase;

...

org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 62, Column 58: Invalid escape sequence
	at org.codehaus.janino.Scanner.scanLiteralCharacter(Scanner.java:918)
	at org.codehaus.janino.Scanner.produce(Scanner.java:604)
	at org.codehaus.janino.Parser.peekRead(Parser.java:3239)
	at org.codehaus.janino.Parser.parseArguments(Parser.java:3055)
	at org.codehaus.janino.Parser.parseSelector(Parser.java:2914)
	at org.codehaus.janino.Parser.parseUnaryExpression(Parser.java:2617)
	at org.codehaus.janino.Parser.parseMultiplicativeExpression(Parser.java:2573)
	at org.codehaus.janino.Parser.parseAdditiveExpression(Parser.java:2552)

In the codegend expression, the literal should use \\ instead of \

A similar problem was solved here: #15156.

How was this patch tested?

Regression test in DataFrameAggregationSuite

brkyvz · 2016-12-20T23:17:43Z

cc @JoshRosen

JoshRosen · 2016-12-20T23:21:24Z

...ore/src/main/scala/org/apache/spark/sql/execution/aggregate/VectorizedHashMapGenerator.scala

    val generatedSchema: String =
      s"new org.apache.spark.sql.types.StructType()" +
        (groupingKeySchema ++ bufferSchema).map { key =>
+          val keyName = ctx.addReferenceObj("keyName", key.name)


You can use the unnamed / anonymous version of addReferenceObj here:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

Line 95 in 84284e8

def addReferenceObj(obj: Any): String = {

done. Thanks for the suggestion!

JoshRosen · 2016-12-20T23:21:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/RowBasedHashMapGenerator.scala

    val generatedKeySchema: String =
      s"new org.apache.spark.sql.types.StructType()" +
        groupingKeySchema.map { key =>
+          val keyName = ctx.addReferenceObj("keyName", key.name)


Same here; I'd use the unnamed overload.

SparkQA · 2016-12-21T02:01:21Z

Test build #70432 has finished for PR 16361 at commit b858204.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-21T02:19:29Z

Test build #70433 has finished for PR 16361 at commit c2de5ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-12-23T01:58:24Z

it seems to that the grouping key alias is only used for execution(logical Aggregate node doesn't need grouping expression to be named), can we just alias them with k1,k2, ... to avoid this problem? i.e. in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L221

JoshRosen · 2017-01-03T23:19:10Z

@cloud-fan, while I agree that the field names aren't really used here, I don't think patterns.scala is the right place to fix this because it seems a little dodgy to be picking a safe name way over there to fix an unsafe interpolation over here. I'd be supportive of a change to set dummy names here in the *HashMapGenerator classes or to make a builder for StructTypes with anonymous / safe field names, but that's a larger change that I think should be done separately (it could also be accompanied by a change to remove the need to pass in full StructTypes when we only need the field count and types).

Given this, I'm going to retest and merge this as-is.

JoshRosen · 2017-01-03T23:19:15Z

Jenkins, retest this please

SparkQA · 2017-01-03T23:28:31Z

Test build #70838 has finished for PR 16361 at commit c2de5ee.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2017-01-03T23:49:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/RowBasedHashMapGenerator.scala

    val generatedKeySchema: String =
      s"new org.apache.spark.sql.types.StructType()" +
        groupingKeySchema.map { key =>
+          val keyName = ctx.addReferenceObj(key.name)


Looks like you now need to use addReferenceMinorObj after a conflicting change was merged. We'll need to use the current version of the patch when backporting into branch-2.1, though.

JoshRosen · 2017-01-09T20:01:05Z

LGTM pending Jenkins. Thanks!

SparkQA · 2017-01-09T22:13:09Z

Test build #71088 has finished for PR 16361 at commit df3a852.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2017-01-09T23:17:58Z

Merged to master.

… for aggregations ## What changes were proposed in this pull request? Backport for #16361 to 2.1 branch. ## How was this patch tested? Unit tests Author: Burak Yavuz <[email protected]> Closes #16518 from brkyvz/reg-break-2.1.

…gations ## What changes were proposed in this pull request? If I use the function regexp_extract, and then in my regex string, use `\`, i.e. escape character, this fails codegen, because the `\` character is not properly escaped when codegen'd. Example stack trace: ``` /* 059 */ private int maxSteps = 2; /* 060 */ private int numRows = 0; /* 061 */ private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add("date_format(window#325.start, yyyy-MM-dd HH:mm)", org.apache.spark.sql.types.DataTypes.StringType) /* 062 */ .add("regexp_extract(source#310.description, ([a-zA-Z]+)\[.*, 1)", org.apache.spark.sql.types.DataTypes.StringType); /* 063 */ private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add("sum", org.apache.spark.sql.types.DataTypes.LongType); /* 064 */ private Object emptyVBase; ... org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 62, Column 58: Invalid escape sequence at org.codehaus.janino.Scanner.scanLiteralCharacter(Scanner.java:918) at org.codehaus.janino.Scanner.produce(Scanner.java:604) at org.codehaus.janino.Parser.peekRead(Parser.java:3239) at org.codehaus.janino.Parser.parseArguments(Parser.java:3055) at org.codehaus.janino.Parser.parseSelector(Parser.java:2914) at org.codehaus.janino.Parser.parseUnaryExpression(Parser.java:2617) at org.codehaus.janino.Parser.parseMultiplicativeExpression(Parser.java:2573) at org.codehaus.janino.Parser.parseAdditiveExpression(Parser.java:2552) ``` In the codegend expression, the literal should use `\\` instead of `\` A similar problem was solved here: apache#15156. ## How was this patch tested? Regression test in `DataFrameAggregationSuite` Author: Burak Yavuz <[email protected]> Closes apache#16361 from brkyvz/reg-break.

…ors. ## What changes were proposed in this pull request? When fixing schema field names using escape characters with `addReferenceMinorObj()` at [SPARK-18952](https://issues.apache.org/jira/browse/SPARK-18952) (apache#16361), double-quotes around the names were remained and the names become something like `"((java.lang.String) references[1])"`. ```java /* 055 */ private int maxSteps = 2; /* 056 */ private int numRows = 0; /* 057 */ private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add("((java.lang.String) references[1])", org.apache.spark.sql.types.DataTypes.StringType); /* 058 */ private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add("((java.lang.String) references[2])", org.apache.spark.sql.types.DataTypes.LongType); /* 059 */ private Object emptyVBase; ``` We should remove the double-quotes to refer the values in `references` properly: ```java /* 055 */ private int maxSteps = 2; /* 056 */ private int numRows = 0; /* 057 */ private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add(((java.lang.String) references[1]), org.apache.spark.sql.types.DataTypes.StringType); /* 058 */ private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add(((java.lang.String) references[2]), org.apache.spark.sql.types.DataTypes.LongType); /* 059 */ private Object emptyVBase; ``` ## How was this patch tested? Existing tests. Author: Takuya UESHIN <[email protected]> Closes apache#19491 from ueshin/issues/SPARK-22273.

…ors. ## What changes were proposed in this pull request? When fixing schema field names using escape characters with `addReferenceMinorObj()` at [SPARK-18952](https://issues.apache.org/jira/browse/SPARK-18952) (#16361), double-quotes around the names were remained and the names become something like `"((java.lang.String) references[1])"`. ```java /* 055 */ private int maxSteps = 2; /* 056 */ private int numRows = 0; /* 057 */ private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add("((java.lang.String) references[1])", org.apache.spark.sql.types.DataTypes.StringType); /* 058 */ private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add("((java.lang.String) references[2])", org.apache.spark.sql.types.DataTypes.LongType); /* 059 */ private Object emptyVBase; ``` We should remove the double-quotes to refer the values in `references` properly: ```java /* 055 */ private int maxSteps = 2; /* 056 */ private int numRows = 0; /* 057 */ private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add(((java.lang.String) references[1]), org.apache.spark.sql.types.DataTypes.StringType); /* 058 */ private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add(((java.lang.String) references[2]), org.apache.spark.sql.types.DataTypes.LongType); /* 059 */ private Object emptyVBase; ``` ## How was this patch tested? Existing tests. Author: Takuya UESHIN <[email protected]> Closes #19491 from ueshin/issues/SPARK-22273. (cherry picked from commit e0503a7) Signed-off-by: gatorsmile <[email protected]>

…ors. ## What changes were proposed in this pull request? When fixing schema field names using escape characters with `addReferenceMinorObj()` at [SPARK-18952](https://issues.apache.org/jira/browse/SPARK-18952) (apache#16361), double-quotes around the names were remained and the names become something like `"((java.lang.String) references[1])"`. ```java /* 055 */ private int maxSteps = 2; /* 056 */ private int numRows = 0; /* 057 */ private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add("((java.lang.String) references[1])", org.apache.spark.sql.types.DataTypes.StringType); /* 058 */ private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add("((java.lang.String) references[2])", org.apache.spark.sql.types.DataTypes.LongType); /* 059 */ private Object emptyVBase; ``` We should remove the double-quotes to refer the values in `references` properly: ```java /* 055 */ private int maxSteps = 2; /* 056 */ private int numRows = 0; /* 057 */ private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add(((java.lang.String) references[1]), org.apache.spark.sql.types.DataTypes.StringType); /* 058 */ private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add(((java.lang.String) references[2]), org.apache.spark.sql.types.DataTypes.LongType); /* 059 */ private Object emptyVBase; ``` ## How was this patch tested? Existing tests. Author: Takuya UESHIN <[email protected]> Closes apache#19491 from ueshin/issues/SPARK-22273. (cherry picked from commit e0503a7) Signed-off-by: gatorsmile <[email protected]>

brkyvz added 2 commits December 20, 2016 14:51

Save

88e29bb

Fixed

b858204

JoshRosen reviewed Dec 20, 2016

View reviewed changes

use unnamed version

c2de5ee

JoshRosen reviewed Jan 3, 2017

View reviewed changes

fix mc

df3a852

brkyvz mentioned this pull request Jan 9, 2017

[BACKPORT][SPARK-18952] Regex strings not properly escaped in codegen for aggregations #16518

Closed

asfgit closed this in faabe69 Jan 9, 2017

ueshin mentioned this pull request Oct 13, 2017

[SPARK-22273][SQL] Fix key/value schema field names in HashMapGenerators. #19491

Closed

brkyvz deleted the reg-break branch February 3, 2019 20:58

[SPARK-18952] Regex strings not properly escaped in codegen for aggregations #16361

[SPARK-18952] Regex strings not properly escaped in codegen for aggregations #16361

Uh oh!

Conversation

brkyvz commented Dec 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

brkyvz commented Dec 20, 2016

Uh oh!

JoshRosen Dec 20, 2016

Choose a reason for hiding this comment

Uh oh!

brkyvz Dec 20, 2016

Choose a reason for hiding this comment

Uh oh!

JoshRosen Dec 20, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 21, 2016

Uh oh!

SparkQA commented Dec 21, 2016

Uh oh!

cloud-fan commented Dec 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoshRosen commented Jan 3, 2017

Uh oh!

JoshRosen commented Jan 3, 2017

Uh oh!

SparkQA commented Jan 3, 2017

Uh oh!

JoshRosen Jan 3, 2017

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Jan 9, 2017

Uh oh!

SparkQA commented Jan 9, 2017

Uh oh!

JoshRosen commented Jan 9, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

brkyvz commented Dec 20, 2016 •

edited

Loading

cloud-fan commented Dec 23, 2016 •

edited

Loading