Request: Support regex look-ahead for begin and end matchers #1349

logicplace · 2016-11-14T01:38:53Z

Solves #1214 at least in the positive case. At the time, I'd thought negative lookaheads would work without such a change but maybe I was wrong?

As a note, I chose to do new tags instead because parsing the regex for lookaheads seemed more costly than useful, especially in edge cases.

Note, when specifying the contents of these tags, do not use lookahead syntax, it adds it for you.

Example:
{ begin: /\w+/, afterBegin: /\s*:/, end: /,/, }

As an example of the issue this solves:
{ begin: /\w+(?=\s*:)/, end: /,/, }

Adds \w+(?=\s*:) to the terminator but it then lexes just the \w+ part because the lookahead is 0-width, subMode will therefore fail to find the begin tag it had matched originally (and may match something else inappropriately!)

…s correctly

joshgoebel · 2019-09-11T01:23:51Z

It's also possible to fix this just by automatically detecting look-aheads and doing this same thing (but without requiring any manual changes to files):

Rewrite the regex for the terminators to include the FULL match (begin never changes)
Set a flag that the rule includes a terminator so we can reprocess it later
In processLexeme if the flag is set, recalculate lexeme

My use case was regarding beginning matches, but I think this same strategy could apply to end also, not sure.

Here is what my fixer upper looks like:

    const lookaheadRegex = /(?<=\/|\\\\\\\\|[^\\][^\\])\(\?\=/
    function fixupLookAhead(regex, rule)  {
      let str = regex
      if (typeof regex === "object") {
        str = regex.toString().slice(1,-1)
      }
      if (str.match(lookaheadRegex)) {
        // console.log("before", str)
        str = str.replace(lookaheadRegex,"(")
        // console.log("after", str)
        // console.log("lookahead")
        rule.hasLookahead = true
      }
      return new RegExp(str)
    }

Might get hard though to do all the proper escape detecting you'd need to do to be sure it was a lookahead... I THINK the above does that... but welcome to be proven wrong.

joshgoebel · 2019-09-11T13:37:41Z

@marcoscaceres Would a PR along these lines be acceptable for merging? IE:

Rewriting look-aheads so that the termination match loop INCLUDES the lookahead as part of the match
(processLexeme needs this EXTRA part of the match so that the rule that matches the WHOLE buffer will still match the small segment we're evaluating at the moment)
Inside processLexeme recalculate lexeme [AFTER new_mode is calculated] (to reduce it's scope) IF the expression that originally caused the match was a lookahead

This seems to be a very natural fit with how the parser is currently designed. Seems it would resolve 2-3 open issues here (all related to look-ahead matching).

OR more generally could we just feed processLexeme it's own lookahead buffer (say 10-20 chars)... so it constantly had MORE data than it actually needs (to allow lookaheads) and that after new_mode is calcaulted it just truncates lexeme removing the extra lookahead chars.

That would be a much simpler fix (no rewriting complex regexs with more complex regexs)- wouldn't solve ALL cases, but many. Most of the cases I've seen involve very small lookaheads.

joshgoebel · 2019-09-11T14:10:14Z

Seems quite nice:

     while (true) {
        top.terminators.lastIndex = index;
        match = top.terminators.exec(value);
        if (!match)
          break;
        // calculate a bit extra than we need for lookahead matching
        lookahead = value.substring(match.index, match.index + match[0].length + extra_lookahead - 1)
        count = processLexeme(value.substring(index, match.index), lookahead, match[0].length);
        index = match.index + count;
      }

// and in processLexeme
var new_mode = subMode(lexeme, top);
lexeme = lexeme.substring(0, actual_length)

And of course extra_lookahead could be a tuneable knob (for those doing really crazy things)... though I'm not sure what the harm would be in making this larger... I set mine to 10 for testing. I'm only doing a single character look-aheads though.

logicplace · 2019-09-11T16:15:10Z

The first solution is really just short of opening a portal to the realm of cthulu. What about cases like (?=a)|(?=b) or god forbid cases like (?=a(?=b))? Parsing these out, especially with regex, is an unreasonable feat, which is why I didn't do it.

Looking ahead some amount of characters just seems faulty. Reasonable, but faulty. Tuning it is fine until someone really wants it to be unbounded.

egor-rogov · 2019-09-11T23:05:27Z

At first glance extra-lookahead approach seems practical and reasonable. Yes in some cases it may be faulty, but at least it can be documented.
I'm willing to review and test such PR.

joshgoebel · 2019-09-14T13:30:53Z

The first solution is really just short of opening a portal to the realm of cthulu. What about cases like (?=a)|(?=b) or god forbid cases like (?=a(?=b))? Parsing these out, especially with regex, is an unreasonable feat, which is why I didn't do it.

I believe my solution would work for all those cases, or are you merely pointing out that I left off the /g? I agree it gets a little crazy, but it is testable and deterministic. They would all merely be converted to simple matches. Perhaps there are edge cases, but the samples you provide aren't them - unless I'm missing something.

opening a portal to the realm of cthulu

Honestly it's not that bad, it only looks scary because of the escaping. One could probably write a less scary version that instead called an escapeRegex function...

Looking ahead some amount of characters just seems faulty. Reasonable, but faulty. Tuning it is fine until someone really wants it to be unbounded.

Is there a third solution here I'm missing?

I'm willing to review and test such PR.

Do we still need @logicplace to sign off? They seem kind of anti-both approaches - at least that's the way I'm reading it. :(

joshgoebel · 2019-09-14T13:32:28Z

Tuning it is fine until someone really wants it to be unbounded.

It also wouldn't be that hard to rework it so it's simply unbounded... I was just trying to get consensus before i spent a LOT of time on this.

So far locally I have BOTH the above solutions working and they seem to work just fine.

Looking ahead some amount of characters just seems faulty. Reasonable, but faulty. Tuning it is fine until someone really wants it to be unbounded.

Also it seems a HUGE improvement to me to turn something that is entirely broken now into a tiny edge case that's only broken if say someone has a 50 character or larger lookahead (easy to document the edge case), etc... am I mistaken in this?

logicplace · 2019-09-14T18:28:58Z

I'm only really against the first one. It's not about how scary it looks, there's an old article on the problems with ppl trying to parse recursive syntaxes (HTML in that case) with regex. There is always going to exist an edge case that will screw this up and do something bad. If you need me to find one as proof I can spend some time doing so.

The second one is unideal to me, but acceptable. My only worry is if someone tries to make it unbounded then parses something large it'll copy a silly amount of data many times. Hopefully they'd notice this, but probably not.

If you're adding extra parameters to tune it anyway I also don't see what's unacceptable about my original solution which solves all these issues efficiently in the first place? Did I miss some conversation about it? Cause I'm only coming into here seeing "why don't we do it this way instead". It strikes me a bit funny, but I'll be happy to see the issue solved, anyway. It's not my project and you don't need my blessing I'm just someone who also ran into this issue and wanted it solved.

egor-rogov · 2019-09-14T20:06:59Z

I also don't see what's unacceptable about my original solution which solves all these issues efficiently in the first place?

I'm personally not a huge fan of inventing new attributes when the job can be done within the standard regexp syntax (although I'm guilty of inventing one on such attributes myself).

That's why I vote for Josh's second solution, which seems less error-prone to me.

joshgoebel · 2019-09-20T18:53:40Z

I'm personally not a huge fan of inventing new attributes when the job can be done within the standard regexp syntax (although I'm guilty of inventing one on such attributes myself).

Yes, completely. "Just works" as people would expect it to (since it's a built-in feature of regex) vs requiring adding a new special attribute (that then must be learned/discovered) is WAY better.

joshgoebel · 2019-09-20T18:54:27Z

The second one is unideal to me, but acceptable. My only worry is if someone tries to make it unbounded then parses something large it'll copy a silly amount of data many times. Hopefully they'd notice this, but probably not.

Are there any performance regression tests laying around anywhere? If so I'd be happy to use them to test this... Are JS engines stupid about strings? Slicing does a full copy in RAM?

From all my quick googles the string handling seems very advanced/performant... there might not be a reason to not just slice the WHOLE remainder (then you could look ahead all the way to the end of the string). That's what Sam Stephenson does with String scanner:

https://github.com/sstephenson/strscan-js/blob/master/src/strscan.coffee

That's old code, but if it worked back then... though I'm not sure they ever used it on HUGE strings.... most the complaining I found was the OPPOSITE problem... you start with a huge string then you slice it and now JS keeps the original around forever (likely because the slices are string snippets, et)... when people might REALLY like it to be GCed, but that's not a problem for us.

Based on this my vote would would be to just use the remainder of the whole string and no need to for tuning at all. If that sounds reasonable I can make a PR.

egor-rogov · 2019-09-22T20:21:25Z

Based on this my vote would would be to just use the remainder of the whole string and no need to for tuning at all. If that sounds reasonable I can make a PR.

I think it's reasonable to start with that. Then we can test performance regression (by comparing before/after speed) and decide whether we can stop there or it will require further improvements/tuning knobs etc.

joshgoebel · 2019-09-23T01:43:22Z

Then we can test performance regression (by comparing before/after speed) and decide whether we can stop there or it will require further improvements/tuning knobs etc.

From what I'm reading it shouldn't matter or should be inconsequential (so I'm not that worried). I'll whip something up and test it with my new theme then push a PR for review/testing.

joshgoebel · 2019-09-23T19:10:43Z

For reference: #2135

joshgoebel · 2019-09-24T18:42:26Z

Ok, this is NOT as trivial as I thought, but I haven't given up yet. Lots of nuance and edge cases. My first attempt did still pass all but like 7-10 tests (which is something), but dialing this down will be fun.

joshgoebel · 2019-09-25T05:17:53Z

I have a working version now with only 1 failing test:

#2138

This is an actual issue with Highlight.js that needs to be clarified though.

It's still very rough and quite slow. It drives all beginning matches from the main content itself (as the original)... BUT it does NOT re-apply any of the begin rules - it simply keeps track of which rule matched in the first place and uses that to set the mode. So whatever a rule matches the first time is what counts, period, including any look behind or look-ahead (the rule has access to the FULL content).

This makes it obvious we have a few broken rules (IMHO) that will match NOTHING repeatedly, so I have a small check for those and if a rules has fired both BEGIN/END without advancing then we just advance the cursor 1 step and try again. I'll open a separate issue on this point.

This currently isn't efficient because I got it working by running the regexes in sequence (vs parallel). Tests take 10s vs 6s on my machine, so obviously the idea to make them run in parallel was the smart move.

Pretty sure I have a way to have my cake and eat it too though, so I'll circle back in a few days and see if I can get the speed back up to the original. There is even a fair chance it would be faster since we're only running the regexes a single time - instead of every begin regex essentially running twice as it works now.

joshgoebel · 2019-09-25T05:30:42Z

A more comprehensive parser is likely going to find a few actual bugs/glitches in the syntaxes also. First one:

#2139

egor-rogov · 2019-09-25T09:20:27Z

@yyyc514, I will look into these.

joshgoebel · 2019-09-25T12:53:17Z

Regarding the rules that match nothing repeatedly (mentioned above):

#2140

joshgoebel · 2019-10-09T17:59:58Z

It works. And it's faster. :-) All test pass. Now I'm looking into the complexness that is end mode matching. :-)

egor-rogov · 2019-10-09T18:46:52Z

Cool!

joshgoebel · 2019-10-10T15:25:20Z

Ok, I pushed a clean working PR. I'm going to label it WIP because I think I'd like to clean up some of the existing languages hacks, but I think it's ready for review and discussion.

joshgoebel · 2019-10-10T18:46:51Z

Closing as we don’t plan to merge this (for reasons mentioned above).

Ongoing discussion can continue at #1696 And #2135. My PR is working quite well now for anyone wanting to test.

joshgoebel · 2019-10-25T05:05:31Z

@logicplace Can you think of the types of situations you might use a lookahead on the END expression?

Adds afterBegin and afterEnd tags to perform end-of-element lookahead…

4adf4c9

…s correctly

Sannis added the bug label Jan 3, 2017

GegznaV mentioned this pull request Nov 20, 2018

Positive lookahead isn't working #1550

Closed

Marcos Cáceres added 2 commits August 17, 2019 11:29

Merge branch 'master' into after-tags

bc40ff2

Merge branch 'master' into after-tags

a6c8b3d

joshgoebel mentioned this pull request Sep 11, 2019

Regex look-ahead expressions do not work properly (begin) #1696

Closed

joshgoebel changed the title ~~Adds afterBegin and afterEnd tags to perform end-of-element lookaheads correctly~~ Request: Support regex look-ahead for begin and end matchers Oct 7, 2019

joshgoebel self-assigned this Oct 7, 2019

joshgoebel added the enhancement An enhancement or new feature label Oct 7, 2019

joshgoebel mentioned this pull request Oct 7, 2019

The engine does not consider the possibility of using lookahead assertion in end of 'begin' attribute #1214

Closed

joshgoebel closed this Oct 10, 2019

joshgoebel mentioned this pull request Oct 24, 2019

Regex look-ahead expressions do not work properly (end matchers) #2216

Closed

Request: Support regex look-ahead for begin and end matchers #1349

Request: Support regex look-ahead for begin and end matchers #1349

Uh oh!

Conversation

logicplace commented Nov 14, 2016

Uh oh!

joshgoebel commented Sep 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joshgoebel commented Sep 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joshgoebel commented Sep 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

logicplace commented Sep 11, 2019

Uh oh!

egor-rogov commented Sep 11, 2019

Uh oh!

joshgoebel commented Sep 14, 2019

Uh oh!

joshgoebel commented Sep 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

logicplace commented Sep 14, 2019

Uh oh!

egor-rogov commented Sep 14, 2019

Uh oh!

joshgoebel commented Sep 20, 2019

Uh oh!

joshgoebel commented Sep 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

egor-rogov commented Sep 22, 2019

Uh oh!

joshgoebel commented Sep 23, 2019

Uh oh!

joshgoebel commented Sep 23, 2019

Uh oh!

joshgoebel commented Sep 24, 2019

Uh oh!

joshgoebel commented Sep 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joshgoebel commented Sep 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

egor-rogov commented Sep 25, 2019

Uh oh!

joshgoebel commented Sep 25, 2019

Uh oh!

joshgoebel commented Oct 9, 2019

Uh oh!

egor-rogov commented Oct 9, 2019

Uh oh!

joshgoebel commented Oct 10, 2019

Uh oh!

joshgoebel commented Oct 10, 2019

Uh oh!

joshgoebel commented Oct 25, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

joshgoebel commented Sep 11, 2019 •

edited

Loading

joshgoebel commented Sep 11, 2019 •

edited

Loading

joshgoebel commented Sep 11, 2019 •

edited

Loading

joshgoebel commented Sep 14, 2019 •

edited

Loading

joshgoebel commented Sep 20, 2019 •

edited

Loading

joshgoebel commented Sep 25, 2019 •

edited

Loading

joshgoebel commented Sep 25, 2019 •

edited

Loading