Skip to content

Conversation

@logicplace
Copy link
Contributor

Solves #1214 at least in the positive case. At the time, I'd thought negative lookaheads would work without such a change but maybe I was wrong?

As a note, I chose to do new tags instead because parsing the regex for lookaheads seemed more costly than useful, especially in edge cases.

Note, when specifying the contents of these tags, do not use lookahead syntax, it adds it for you.

Example:
{ begin: /\w+/, afterBegin: /\s*:/, end: /,/, }

As an example of the issue this solves:
{ begin: /\w+(?=\s*:)/, end: /,/, }

Adds \w+(?=\s*:) to the terminator but it then lexes just the \w+ part because the lookahead is 0-width, subMode will therefore fail to find the begin tag it had matched originally (and may match something else inappropriately!)

@joshgoebel
Copy link
Member

joshgoebel commented Sep 11, 2019

It's also possible to fix this just by automatically detecting look-aheads and doing this same thing (but without requiring any manual changes to files):

  • Rewrite the regex for the terminators to include the FULL match (begin never changes)
  • Set a flag that the rule includes a terminator so we can reprocess it later
  • In processLexeme if the flag is set, recalculate lexeme

My use case was regarding beginning matches, but I think this same strategy could apply to end also, not sure.

Here is what my fixer upper looks like:

    const lookaheadRegex = /(?<=\/|\\\\\\\\|[^\\][^\\])\(\?\=/
    function fixupLookAhead(regex, rule)  {
      let str = regex
      if (typeof regex === "object") {
        str = regex.toString().slice(1,-1)
      }
      if (str.match(lookaheadRegex)) {
        // console.log("before", str)
        str = str.replace(lookaheadRegex,"(")
        // console.log("after", str)
        // console.log("lookahead")
        rule.hasLookahead = true
      }
      return new RegExp(str)
    }

Might get hard though to do all the proper escape detecting you'd need to do to be sure it was a lookahead... I THINK the above does that... but welcome to be proven wrong.

@joshgoebel
Copy link
Member

joshgoebel commented Sep 11, 2019

@marcoscaceres Would a PR along these lines be acceptable for merging? IE:

  • Rewriting look-aheads so that the termination match loop INCLUDES the lookahead as part of the match
  • (processLexeme needs this EXTRA part of the match so that the rule that matches the WHOLE buffer will still match the small segment we're evaluating at the moment)
  • Inside processLexeme recalculate lexeme [AFTER new_mode is calculated] (to reduce it's scope) IF the expression that originally caused the match was a lookahead

This seems to be a very natural fit with how the parser is currently designed. Seems it would resolve 2-3 open issues here (all related to look-ahead matching).


OR more generally could we just feed processLexeme it's own lookahead buffer (say 10-20 chars)... so it constantly had MORE data than it actually needs (to allow lookaheads) and that after new_mode is calcaulted it just truncates lexeme removing the extra lookahead chars.

That would be a much simpler fix (no rewriting complex regexs with more complex regexs)- wouldn't solve ALL cases, but many. Most of the cases I've seen involve very small lookaheads.

@joshgoebel
Copy link
Member

joshgoebel commented Sep 11, 2019

Seems quite nice:

     while (true) {
        top.terminators.lastIndex = index;
        match = top.terminators.exec(value);
        if (!match)
          break;
        // calculate a bit extra than we need for lookahead matching
        lookahead = value.substring(match.index, match.index + match[0].length + extra_lookahead - 1)
        count = processLexeme(value.substring(index, match.index), lookahead, match[0].length);
        index = match.index + count;
      }

// and in processLexeme
var new_mode = subMode(lexeme, top);
lexeme = lexeme.substring(0, actual_length)

And of course extra_lookahead could be a tuneable knob (for those doing really crazy things)... though I'm not sure what the harm would be in making this larger... I set mine to 10 for testing. I'm only doing a single character look-aheads though.

@logicplace
Copy link
Contributor Author

The first solution is really just short of opening a portal to the realm of cthulu. What about cases like (?=a)|(?=b) or god forbid cases like (?=a(?=b))? Parsing these out, especially with regex, is an unreasonable feat, which is why I didn't do it.

Looking ahead some amount of characters just seems faulty. Reasonable, but faulty. Tuning it is fine until someone really wants it to be unbounded.

@egor-rogov
Copy link
Collaborator

At first glance extra-lookahead approach seems practical and reasonable. Yes in some cases it may be faulty, but at least it can be documented.
I'm willing to review and test such PR.

@joshgoebel
Copy link
Member

The first solution is really just short of opening a portal to the realm of cthulu. What about cases like (?=a)|(?=b) or god forbid cases like (?=a(?=b))? Parsing these out, especially with regex, is an unreasonable feat, which is why I didn't do it.

I believe my solution would work for all those cases, or are you merely pointing out that I left off the /g? I agree it gets a little crazy, but it is testable and deterministic. They would all merely be converted to simple matches. Perhaps there are edge cases, but the samples you provide aren't them - unless I'm missing something.

opening a portal to the realm of cthulu

Honestly it's not that bad, it only looks scary because of the escaping. One could probably write a less scary version that instead called an escapeRegex function...

Looking ahead some amount of characters just seems faulty. Reasonable, but faulty. Tuning it is fine until someone really wants it to be unbounded.

Is there a third solution here I'm missing?

I'm willing to review and test such PR.

Do we still need @logicplace to sign off? They seem kind of anti-both approaches - at least that's the way I'm reading it. :(

@joshgoebel
Copy link
Member

joshgoebel commented Sep 14, 2019

Tuning it is fine until someone really wants it to be unbounded.

It also wouldn't be that hard to rework it so it's simply unbounded... I was just trying to get consensus before i spent a LOT of time on this.

So far locally I have BOTH the above solutions working and they seem to work just fine.

Looking ahead some amount of characters just seems faulty. Reasonable, but faulty. Tuning it is fine until someone really wants it to be unbounded.

Also it seems a HUGE improvement to me to turn something that is entirely broken now into a tiny edge case that's only broken if say someone has a 50 character or larger lookahead (easy to document the edge case), etc... am I mistaken in this?

@logicplace
Copy link
Contributor Author

I'm only really against the first one. It's not about how scary it looks, there's an old article on the problems with ppl trying to parse recursive syntaxes (HTML in that case) with regex. There is always going to exist an edge case that will screw this up and do something bad. If you need me to find one as proof I can spend some time doing so.

The second one is unideal to me, but acceptable. My only worry is if someone tries to make it unbounded then parses something large it'll copy a silly amount of data many times. Hopefully they'd notice this, but probably not.

If you're adding extra parameters to tune it anyway I also don't see what's unacceptable about my original solution which solves all these issues efficiently in the first place? Did I miss some conversation about it? Cause I'm only coming into here seeing "why don't we do it this way instead". It strikes me a bit funny, but I'll be happy to see the issue solved, anyway. It's not my project and you don't need my blessing I'm just someone who also ran into this issue and wanted it solved.

@egor-rogov
Copy link
Collaborator

I also don't see what's unacceptable about my original solution which solves all these issues efficiently in the first place?

I'm personally not a huge fan of inventing new attributes when the job can be done within the standard regexp syntax (although I'm guilty of inventing one on such attributes myself).

That's why I vote for Josh's second solution, which seems less error-prone to me.

@joshgoebel
Copy link
Member

I'm personally not a huge fan of inventing new attributes when the job can be done within the standard regexp syntax (although I'm guilty of inventing one on such attributes myself).

Yes, completely. "Just works" as people would expect it to (since it's a built-in feature of regex) vs requiring adding a new special attribute (that then must be learned/discovered) is WAY better.

@joshgoebel
Copy link
Member

joshgoebel commented Sep 20, 2019

The second one is unideal to me, but acceptable. My only worry is if someone tries to make it unbounded then parses something large it'll copy a silly amount of data many times. Hopefully they'd notice this, but probably not.

Are there any performance regression tests laying around anywhere? If so I'd be happy to use them to test this... Are JS engines stupid about strings? Slicing does a full copy in RAM?

From all my quick googles the string handling seems very advanced/performant... there might not be a reason to not just slice the WHOLE remainder (then you could look ahead all the way to the end of the string). That's what Sam Stephenson does with String scanner:

https://github.com/sstephenson/strscan-js/blob/master/src/strscan.coffee

That's old code, but if it worked back then... though I'm not sure they ever used it on HUGE strings.... most the complaining I found was the OPPOSITE problem... you start with a huge string then you slice it and now JS keeps the original around forever (likely because the slices are string snippets, et)... when people might REALLY like it to be GCed, but that's not a problem for us.

Based on this my vote would would be to just use the remainder of the whole string and no need to for tuning at all. If that sounds reasonable I can make a PR.

@egor-rogov
Copy link
Collaborator

Based on this my vote would would be to just use the remainder of the whole string and no need to for tuning at all. If that sounds reasonable I can make a PR.

I think it's reasonable to start with that. Then we can test performance regression (by comparing before/after speed) and decide whether we can stop there or it will require further improvements/tuning knobs etc.

@joshgoebel
Copy link
Member

Then we can test performance regression (by comparing before/after speed) and decide whether we can stop there or it will require further improvements/tuning knobs etc.

From what I'm reading it shouldn't matter or should be inconsequential (so I'm not that worried). I'll whip something up and test it with my new theme then push a PR for review/testing.

@joshgoebel
Copy link
Member

For reference: #2135

@joshgoebel
Copy link
Member

Ok, this is NOT as trivial as I thought, but I haven't given up yet. Lots of nuance and edge cases. My first attempt did still pass all but like 7-10 tests (which is something), but dialing this down will be fun.

@joshgoebel
Copy link
Member

joshgoebel commented Sep 25, 2019

I have a working version now with only 1 failing test:

#2138

This is an actual issue with Highlight.js that needs to be clarified though.


It's still very rough and quite slow. It drives all beginning matches from the main content itself (as the original)... BUT it does NOT re-apply any of the begin rules - it simply keeps track of which rule matched in the first place and uses that to set the mode. So whatever a rule matches the first time is what counts, period, including any look behind or look-ahead (the rule has access to the FULL content).

This makes it obvious we have a few broken rules (IMHO) that will match NOTHING repeatedly, so I have a small check for those and if a rules has fired both BEGIN/END without advancing then we just advance the cursor 1 step and try again. I'll open a separate issue on this point.

This currently isn't efficient because I got it working by running the regexes in sequence (vs parallel). Tests take 10s vs 6s on my machine, so obviously the idea to make them run in parallel was the smart move.

Pretty sure I have a way to have my cake and eat it too though, so I'll circle back in a few days and see if I can get the speed back up to the original. There is even a fair chance it would be faster since we're only running the regexes a single time - instead of every begin regex essentially running twice as it works now.

@joshgoebel
Copy link
Member

joshgoebel commented Sep 25, 2019

A more comprehensive parser is likely going to find a few actual bugs/glitches in the syntaxes also. First one:

#2139

@egor-rogov
Copy link
Collaborator

@yyyc514, I will look into these.

@joshgoebel
Copy link
Member

Regarding the rules that match nothing repeatedly (mentioned above):

#2140

@joshgoebel joshgoebel changed the title Adds afterBegin and afterEnd tags to perform end-of-element lookaheads correctly Request: Support regex look-ahead for begin and end matchers Oct 7, 2019
@joshgoebel joshgoebel self-assigned this Oct 7, 2019
@joshgoebel joshgoebel added the enhancement An enhancement or new feature label Oct 7, 2019
@joshgoebel
Copy link
Member

It works. And it's faster. :-) All test pass. Now I'm looking into the complexness that is end mode matching. :-)

@egor-rogov
Copy link
Collaborator

Cool!

@joshgoebel
Copy link
Member

Ok, I pushed a clean working PR. I'm going to label it WIP because I think I'd like to clean up some of the existing languages hacks, but I think it's ready for review and discussion.

@joshgoebel
Copy link
Member

Closing as we don’t plan to merge this (for reasons mentioned above).

Ongoing discussion can continue at #1696 And #2135. My PR is working quite well now for anyone wanting to test.

@joshgoebel
Copy link
Member

@logicplace Can you think of the types of situations you might use a lookahead on the END expression?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug enhancement An enhancement or new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants