-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-49902][SQL] Catch underlying runtime errors in RegExpReplace
#48379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-49902][SQL] Catch underlying runtime errors in RegExpReplace
#48379
Conversation
|
@cloud-fan Can you look at this PR? Thanks! |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/CollationSQLRegexpSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
Outdated
Show resolved
Hide resolved
| sql(s"CREATE TABLE IF NOT EXISTS $tableName(s STRING)") | ||
| sql(s"INSERT INTO $tableName VALUES('first last')") | ||
| val query = s"SELECT regexp_replace(s, '(?<first>[a-zA-Z]+) (?<last>[a-zA-Z]+)', " + | ||
| s"'$$3 $$1') FROM $tableName" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. Is there any other databases support regexp_replace(s, '(?<first>[a-zA-Z]+) (?<last>[a-zA-Z]+)', '$$3 $$1') ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the designer of RegExpReplace didn't realize the feature(In fact, a bug here). We should forbid it or consider a better way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I think check the rep expression if contains $1, $2 ... looks easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@beliefer The designer of RegExpReplace very well may have thought of this case and in my opinion it works exactly as intended. It's just that sometimes these indexes are out of bounds and we just throw the library error in this case. This PR aims to handle these exceptions better.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/CollationSQLRegexpSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/CollationSQLRegexpSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/CollationSQLRegexpSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
Show resolved
Hide resolved
|
@harshmotw-db Could you re-trigger only the failed GitHub action, please. |
sql/core/src/test/scala/org/apache/spark/sql/CollationSQLRegexpSuite.scala
Outdated
Show resolved
Hide resolved
RegExpReplace
|
+1, LGTM. Merging to master. |


What changes were proposed in this pull request?
Earlier, runtime errors in underlying libraries were not caught during runtime in the RegExpReplace expression. The underlying errors were thrown directly to the user. For example, it wouldn't be uncommon to see issues like
java.lang.IndexOutOfBoundsException: No group 3. This PR introduces a change to catch these underlying issues and throw a SparkException instead which details the input on which the exception failed. The new Spark Exception looks something likeorg.apache.spark.SparkException: Could not perform regexp_replace for source = <source>, pattern = <pattern>, replacement = <replacement> and position = <position>.Why are the changes needed?
Two reasons. First, the new exception details which row the given error occurred on, which makes it easier for the user to debug the query or Spark developers to identify bugs. Second, a Spark Exception is generally considered expected behavior indicating that there were no unintended issues in the query's execution.
Does this PR introduce any user-facing change?
Yes, a better exception is thrown when RegExpReplace fails.
How was this patch tested?
Unit test in both codegen as well as interpreted mode.
Was this patch authored or co-authored using generative AI tooling?
No.