Skip to content

Conversation

@harshmotw-db
Copy link
Contributor

What changes were proposed in this pull request?

Earlier, runtime errors in underlying libraries were not caught during runtime in the RegExpReplace expression. The underlying errors were thrown directly to the user. For example, it wouldn't be uncommon to see issues like java.lang.IndexOutOfBoundsException: No group 3. This PR introduces a change to catch these underlying issues and throw a SparkException instead which details the input on which the exception failed. The new Spark Exception looks something like org.apache.spark.SparkException: Could not perform regexp_replace for source = <source>, pattern = <pattern>, replacement = <replacement> and position = <position>.

Why are the changes needed?

Two reasons. First, the new exception details which row the given error occurred on, which makes it easier for the user to debug the query or Spark developers to identify bugs. Second, a Spark Exception is generally considered expected behavior indicating that there were no unintended issues in the query's execution.

Does this PR introduce any user-facing change?

Yes, a better exception is thrown when RegExpReplace fails.

How was this patch tested?

Unit test in both codegen as well as interpreted mode.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot removed the CONNECT label Oct 8, 2024
@harshmotw-db harshmotw-db marked this pull request as ready for review October 8, 2024 02:07
@harshmotw-db
Copy link
Contributor Author

@cloud-fan Can you look at this PR? Thanks!

@harshmotw-db harshmotw-db requested a review from MaxGekk October 10, 2024 22:39
sql(s"CREATE TABLE IF NOT EXISTS $tableName(s STRING)")
sql(s"INSERT INTO $tableName VALUES('first last')")
val query = s"SELECT regexp_replace(s, '(?<first>[a-zA-Z]+) (?<last>[a-zA-Z]+)', " +
s"'$$3 $$1') FROM $tableName"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. Is there any other databases support regexp_replace(s, '(?<first>[a-zA-Z]+) (?<last>[a-zA-Z]+)', '$$3 $$1') ?

Copy link
Contributor Author

@harshmotw-db harshmotw-db Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that MySQL behaves similar to SparkSQL.
Valid Query (notice $2 instead of $3):
Screenshot 2024-10-16 at 3 18 57 PM

Invalid Query (notice $3):
Screenshot 2024-10-16 at 3 19 25 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the designer of RegExpReplace didn't realize the feature(In fact, a bug here). We should forbid it or consider a better way.

Copy link
Contributor

@beliefer beliefer Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think check the rep expression if contains $1, $2 ... looks easier.

Copy link
Contributor Author

@harshmotw-db harshmotw-db Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beliefer The designer of RegExpReplace very well may have thought of this case and in my opinion it works exactly as intended. It's just that sometimes these indexes are out of bounds and we just throw the library error in this case. This PR aims to handle these exceptions better.

@harshmotw-db harshmotw-db requested a review from MaxGekk October 18, 2024 06:00
@harshmotw-db harshmotw-db requested a review from beliefer October 22, 2024 04:38
@harshmotw-db
Copy link
Contributor Author

@beliefer @MaxGekk I don't believe the linters fail has anything to do with this PR. The error says:

python/pyspark/sql/types.py:3276: error: Module has no attribute "isdtype"  [attr-defined]

and the PR doesn't make any changes in types.py.

@MaxGekk
Copy link
Member

MaxGekk commented Oct 22, 2024

@harshmotw-db Could you re-trigger only the failed GitHub action, please.

@MaxGekk MaxGekk changed the title [SPARK-49902][SQL] Catch underlying runtime errors in RegExpReplace [SPARK-49902][SQL] Catch underlying runtime errors in RegExpReplace Oct 23, 2024
@MaxGekk
Copy link
Member

MaxGekk commented Oct 23, 2024

+1, LGTM. Merging to master.
Thank you, @harshmotw-db and @beliefer for review.

@MaxGekk MaxGekk closed this in 35d2ef9 Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants