Skip to content

Conversation

@TongWei1105
Copy link
Contributor

@TongWei1105 TongWei1105 commented Feb 10, 2022

What changes were proposed in this pull request?

bug fix

Why are the changes needed?

When spark.sql.parser.quotedRegexColumnNames=true

SELECT `(C3)?+.+`,`C1` * C2 FROM (SELECT 3 AS C1,2 AS C2,1 AS C3) T;

The above query will throw an exception

Error: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'multiply'
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:370)
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:266)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78)
        at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62)
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:44)
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:266)
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:261)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:275)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'multiply'
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:49)
        at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:155)
        at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1700)
        at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1671)
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:342)
        at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
        at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:342)
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:339)
        at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:408)
        at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244)
        at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406)
        at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359)
        at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:339)
        at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.expandStarExpression(Analyzer.scala:1671)
        at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.$anonfun$buildExpandedProjectList$1(Analyzer.scala:1656) 

It works fine in hive, because hive treats a pattern with all alphabets/digits and "_" as a normal string

  /**
   * Returns whether the pattern is a regex expression (instead of a normal
   * string). Normal string is a string with all alphabets/digits and "_".
   */
  static boolean isRegex(String pattern, HiveConf conf) {
    String qIdSupport = HiveConf.getVar(conf, HiveConf.ConfVars.HIVE_QUOTEDID_SUPPORT);
    if ( "column".equals(qIdSupport)) {
      return false;
    }
    for (int i = 0; i < pattern.length(); i++) {
      if (!Character.isLetterOrDigit(pattern.charAt(i))
          && pattern.charAt(i) != '_') {
        return true;
      }
    }
    return false;
  }
0: jdbc:hive2://hiveserver-inc.> set hive.support.quoted.identifiers=none;
No rows affected (0.003 seconds)
0: jdbc:hive2://hiveserver-inc.> SELECT `(C3)?+.+`,`C1` * C2 FROM (SELECT 3 AS C1,2 AS C2,1 AS C3) T;
22/02/10 19:01:43 INFO ql.Driver: OK
+-------+-------+------+
| t.c1  | t.c2  | _c1  |
+-------+-------+------+
| 3     | 2     | 6    |
+-------+-------+------+
1 row selected (0.136 seconds)

In this pr, we add the isRegex method to check whether the pattern is a regex expression

Does this PR introduce any user-facing change?

NO

How was this patch tested?

UT

@github-actions github-actions bot added the SQL label Feb 10, 2022
@TongWei1105
Copy link
Contributor Author

@AngersZhuuuu

@TongWei1105 TongWei1105 changed the title [SPARK-38173][SQL][WIP] Quoted column cannot be recognized correctly when quotedRegexColumnNa… [SPARK-38173][SQL] Quoted column cannot be recognized correctly when quotedRegexColumnNa… Feb 10, 2022
@AngersZhuuuu
Copy link
Contributor

AngersZhuuuu commented Feb 11, 2022

please change the title and enable GA. Also you can paste hive code in the PR desc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can make it more like a scala code.
to use

pattern.find()
or 
pattern.forall()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@AngersZhuuuu AngersZhuuuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AngersZhuuuu
Copy link
Contributor

ping @cloud-fan

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@wugx
Copy link

wugx commented Feb 14, 2022

  private def canApplyRegex(ctx: ParserRuleContext): Boolean = withOrigin(ctx) {
    var parent = ctx.getParent
    if(!isRegex(ctx.getText)) return false

    while (parent != null) {
      if (parent.isInstanceOf[NamedExpressionContext]) return true
      parent = parent.getParent
    }
    return false
  }

The isRegex function only needs to be called once, If it is not regular, it can return false directly.

@TongWei1105
Copy link
Contributor Author

  private def canApplyRegex(ctx: ParserRuleContext): Boolean = withOrigin(ctx) {
    var parent = ctx.getParent
    if(!isRegex(ctx.getText)) return false

    while (parent != null) {
      if (parent.isInstanceOf[NamedExpressionContext]) return true
      parent = parent.getParent
    }
    return false
  }

The isRegex function only needs to be called once, If it is not regular, it can return false directly.

Thanks. Updated

@AngersZhuuuu
Copy link
Contributor

better use

if conf.supportQuotedRegexColumnName && canApplyRegex(ctx) && isRegex(columnNameRegex)

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 1ef5638 Feb 16, 2022
dongjoon-hyun pushed a commit that referenced this pull request Jan 10, 2023
…when quotedRegexColumnNames is true

### What changes were proposed in this pull request?
backporting #35476 to 3.2

### Why are the changes needed?
bug fixing in 3.2

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
new UT

Closes #39473 from huaxingao/3.2.

Authored-by: huaxingao <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
…when quotedRegexColumnNames is true

### What changes were proposed in this pull request?
backporting apache#35476 to 3.2

### Why are the changes needed?
bug fixing in 3.2

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
new UT

Closes apache#39473 from huaxingao/3.2.

Authored-by: huaxingao <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants