-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
When I say "no matches" I mean unrepeatable by processLexeme when it tries to identify the rule that brought the regex engine to a halt. This could be for several reasons:
- rules that match 0 width strings
- rules with look-behinds/look-aheads that no longer match the smaller string
processLexemehas access to
Possibly others. The 2nd one may not easily be fixable (but I'm working on that in a separate PR). But I have a feeling there may be more like the two examples below that COULD be fixed.
The two I noticed (same rule) are in fortran.js and irpf90.js:
className: 'number',
(?=\\b|\\+|\\-|\\.)(?=\\.\\d|\\d)(?:\\d+)?(?:\\.?\\d*)(?:[de][+-]?\\d+)?\\b\\.?All the matches here are either look ahead or optional so it can actually match 0 width areas of text around digits (and does so plenty). The existing system will filter these out thankfully but there must be a cost (esp. when doing auto-analysis) of firing this rule and then ignoring it over and over.
I'd recommend we consider changing these rules so it's impossible to match a 0 width piece of text. (and in general discourage rules like that... or maybe this is the only way to match such numbers? I dunno. If so though it seems perhaps the filter should be tweaked a bit and not labeled "Parser should not reach this point" if it's just normal daily operations to filter out rules that don't ALWAYS match content, but sometimes do.
Here is the current code that handles this (console logging is mine):
/*
Parser should not reach this point as all types of lexemes should be caught
earlier, but if it does due to some bug make sure it advances at least one
character forward to prevent infinite looping.
*/
console.log("should never reach this point")
mode_buffer += lexeme;
return lexeme.length || 1;Running the full test suite (all green) results in 1,911 occurrences of reaching the point we should never reach.
Removing both the number rules mentioned above still results in 1,693 occurrences.
Thoughts?