-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Fix painless's regex lexer and error messages #23634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Without this change, if write a script with multiple regexes *sometimes* the lexer will decide to look at them like one big regex and then some trailing garbage. Like this discuss post: https://discuss.elastic.co/t/error-with-the-split-function-in-painless-script/79021 ``` def val = /\\\\/.split(ctx._source.event_data.param17); if (val[2] =~ /\\./) { def val2 = /\\./.split(val[2]); ctx._source['user_crash'] = val2[0] } else { ctx._source['user_crash'] = val[2] } ``` The error message you get from the lexer is `lexer_no_viable_alt_exception` right after the *second* regex. With this change each regex is just a single regex like it ought to be. As a bonus, while looking into this issue I found that the error reporting for regexes wasn't very nice. If you specify an invalid pattern then you get an error marker on the start of the pattern with the JVM's regex error message which attempts to point you to the location in the regex but is totally unreadable in the JSON response. This change fixes the location to point to the appropriate spot inside the pattern and removes the portion of the JVM's error message that doesn't render well. It is no longer needed now that we point users to the appropriate spot in the pattern.
|
|
||
| STRING: ( '"' ( '\\"' | '\\\\' | ~[\\"] )*? '"' ) | ( '\'' ( '\\\'' | '\\\\' | ~[\\'] )*? '\'' ); | ||
| REGEX: '/' ( ~('/' | '\n') | '\\' ~'\n' )+ '/' [cilmsUux]* { slashIsRegex() }?; | ||
| REGEX: '/' ( '\\' ~'\n' | ~('/' | '\n') )+? '/' [cilmsUux]* { slashIsRegex() }?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This non-greedy pattern works well for strings so I figured I should use it for regexes as well. I had to reorder the escape characters to the front so they wouldn't be ignored because order is important in non-greedy rules.
| } | ||
|
|
||
| public void testRegexIsNonGreedy() { | ||
| assertEquals(true, exec("def s = /\\\\/.split('.\\\\.'); return s[1] ==~ /\\./")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This blew up mightily without the lexer change.
| public void testSlashesEscapePattern() { | ||
| assertEquals(true, exec("return '//' ==~ /\\/\\//")); | ||
| public void testBackslashEscapesForwardSlash() { | ||
| assertEquals(true, exec("'//' ==~ /\\/\\//")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one blows up if you don't reorder the lexer rules after you switch it to non-greedy.
| } catch (PatternSyntaxException exception) { | ||
| throw createError(exception); | ||
| } catch (PatternSyntaxException e) { | ||
| throw new Location(location.getSourceName(), location.getOffset() + 1 + e.getIndex()).createError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this while debugging this issue and it bothered me more than it should have. I'm quite happy PatternSyntaxException seems designed for you to extract this information.
The 1 is for the leading slash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not possible to fix this value where the Location object is originally generated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need the offset from the exception to get the location right. The +1 bit is to get the leading /. What kind of fix were you thinking about? An override of createError maybe?
jdconrad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for fixing this. Left one minor comment.
* master: Docs: Fix language on a few snippets Painless: Fix regex lexer and error messages (elastic#23634) Skip 5.4 bwc test for new name for now Count through the primary in list of strings test Skip testing new name if it isn't known Wait for all shards in list of strings test Deprecate request_cache for clear-cache (elastic#23638)
Without this change, if write a script with multiple regexes *sometimes* the lexer will decide to look at them like one big regex and then some trailing garbage. Like this discuss post: https://discuss.elastic.co/t/error-with-the-split-function-in-painless-script/79021 ``` def val = /\\\\/.split(ctx._source.event_data.param17); if (val[2] =~ /\\./) { def val2 = /\\./.split(val[2]); ctx._source['user_crash'] = val2[0] } else { ctx._source['user_crash'] = val[2] } ``` The error message you get from the lexer is `lexer_no_viable_alt_exception` right after the *second* regex. With this change each regex is just a single regex like it ought to be. As a bonus, while looking into this issue I found that the error reporting for regexes wasn't very nice. If you specify an invalid pattern then you get an error marker on the start of the pattern with the JVM's regex error message which attempts to point you to the location in the regex but is totally unreadable in the JSON response. This change fixes the location to point to the appropriate spot inside the pattern and removes the portion of the JVM's error message that doesn't render well. It is no longer needed now that we point users to the appropriate spot in the pattern.
Without this change, if write a script with multiple regexes
sometimes the lexer will decide to look at them like one
big regex and then some trailing garbage. Like this discuss post:
https://discuss.elastic.co/t/error-with-the-split-function-in-painless-script/79021
The error message you get from the lexer is
lexer_no_viable_alt_exceptionright after the second regex.
With this change each regex is just a single regex like it ought to be.
As a bonus, while looking into this issue I found that the error
reporting for regexes wasn't very nice. If you specify an invalid
pattern then you get an error marker on the start of the pattern
with the JVM's regex error message which attempts to point you to the
location in the regex but is totally unreadable in the JSON response.
This change fixes the location to point to the appropriate spot
inside the pattern and removes the portion of the JVM's error message
that doesn't render well. It is no longer needed now that we point
users to the appropriate spot in the pattern.