Skip to content

Some existing rules are unrepeatable and should be corrected #2140

@joshgoebel

Description

@joshgoebel

When I say "no matches" I mean unrepeatable by processLexeme when it tries to identify the rule that brought the regex engine to a halt. This could be for several reasons:

  • rules that match 0 width strings
  • rules with look-behinds/look-aheads that no longer match the smaller string processLexeme has access to

Possibly others. The 2nd one may not easily be fixable (but I'm working on that in a separate PR). But I have a feeling there may be more like the two examples below that COULD be fixed.


The two I noticed (same rule) are in fortran.js and irpf90.js:

className: 'number',
(?=\\b|\\+|\\-|\\.)(?=\\.\\d|\\d)(?:\\d+)?(?:\\.?\\d*)(?:[de][+-]?\\d+)?\\b\\.?

All the matches here are either look ahead or optional so it can actually match 0 width areas of text around digits (and does so plenty). The existing system will filter these out thankfully but there must be a cost (esp. when doing auto-analysis) of firing this rule and then ignoring it over and over.

I'd recommend we consider changing these rules so it's impossible to match a 0 width piece of text. (and in general discourage rules like that... or maybe this is the only way to match such numbers? I dunno. If so though it seems perhaps the filter should be tweaked a bit and not labeled "Parser should not reach this point" if it's just normal daily operations to filter out rules that don't ALWAYS match content, but sometimes do.


Here is the current code that handles this (console logging is mine):

      /*
      Parser should not reach this point as all types of lexemes should be caught
      earlier, but if it does due to some bug make sure it advances at least one
      character forward to prevent infinite looping.
      */
      console.log("should never reach this point")
      mode_buffer += lexeme;
      return lexeme.length || 1;

Running the full test suite (all green) results in 1,911 occurrences of reaching the point we should never reach.

Removing both the number rules mentioned above still results in 1,693 occurrences.

Thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    help welcomeCould use help from community

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions