-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Request: Support regex look-ahead for begin and end matchers #1349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
It's also possible to fix this just by automatically detecting look-aheads and doing this same thing (but without requiring any manual changes to files):
My use case was regarding beginning matches, but I think this same strategy could apply to end also, not sure. Here is what my fixer upper looks like: const lookaheadRegex = /(?<=\/|\\\\\\\\|[^\\][^\\])\(\?\=/
function fixupLookAhead(regex, rule) {
let str = regex
if (typeof regex === "object") {
str = regex.toString().slice(1,-1)
}
if (str.match(lookaheadRegex)) {
// console.log("before", str)
str = str.replace(lookaheadRegex,"(")
// console.log("after", str)
// console.log("lookahead")
rule.hasLookahead = true
}
return new RegExp(str)
}Might get hard though to do all the proper escape detecting you'd need to do to be sure it was a lookahead... I THINK the above does that... but welcome to be proven wrong. |
|
@marcoscaceres Would a PR along these lines be acceptable for merging? IE:
This seems to be a very natural fit with how the parser is currently designed. Seems it would resolve 2-3 open issues here (all related to look-ahead matching). OR more generally could we just feed That would be a much simpler fix (no rewriting complex regexs with more complex regexs)- wouldn't solve ALL cases, but many. Most of the cases I've seen involve very small lookaheads. |
|
Seems quite nice: while (true) {
top.terminators.lastIndex = index;
match = top.terminators.exec(value);
if (!match)
break;
// calculate a bit extra than we need for lookahead matching
lookahead = value.substring(match.index, match.index + match[0].length + extra_lookahead - 1)
count = processLexeme(value.substring(index, match.index), lookahead, match[0].length);
index = match.index + count;
}
// and in processLexeme
var new_mode = subMode(lexeme, top);
lexeme = lexeme.substring(0, actual_length)And of course extra_lookahead could be a tuneable knob (for those doing really crazy things)... though I'm not sure what the harm would be in making this larger... I set mine to 10 for testing. I'm only doing a single character look-aheads though. |
|
The first solution is really just short of opening a portal to the realm of cthulu. What about cases like Looking ahead some amount of characters just seems faulty. Reasonable, but faulty. Tuning it is fine until someone really wants it to be unbounded. |
|
At first glance extra-lookahead approach seems practical and reasonable. Yes in some cases it may be faulty, but at least it can be documented. |
I believe my solution would work for all those cases, or are you merely pointing out that I left off the
Honestly it's not that bad, it only looks scary because of the escaping. One could probably write a less scary version that instead called an escapeRegex function...
Is there a third solution here I'm missing?
Do we still need @logicplace to sign off? They seem kind of anti-both approaches - at least that's the way I'm reading it. :( |
It also wouldn't be that hard to rework it so it's simply unbounded... I was just trying to get consensus before i spent a LOT of time on this. So far locally I have BOTH the above solutions working and they seem to work just fine.
Also it seems a HUGE improvement to me to turn something that is entirely broken now into a tiny edge case that's only broken if say someone has a 50 character or larger lookahead (easy to document the edge case), etc... am I mistaken in this? |
|
I'm only really against the first one. It's not about how scary it looks, there's an old article on the problems with ppl trying to parse recursive syntaxes (HTML in that case) with regex. There is always going to exist an edge case that will screw this up and do something bad. If you need me to find one as proof I can spend some time doing so. The second one is unideal to me, but acceptable. My only worry is if someone tries to make it unbounded then parses something large it'll copy a silly amount of data many times. Hopefully they'd notice this, but probably not. If you're adding extra parameters to tune it anyway I also don't see what's unacceptable about my original solution which solves all these issues efficiently in the first place? Did I miss some conversation about it? Cause I'm only coming into here seeing "why don't we do it this way instead". It strikes me a bit funny, but I'll be happy to see the issue solved, anyway. It's not my project and you don't need my blessing I'm just someone who also ran into this issue and wanted it solved. |
I'm personally not a huge fan of inventing new attributes when the job can be done within the standard regexp syntax (although I'm guilty of inventing one on such attributes myself). That's why I vote for Josh's second solution, which seems less error-prone to me. |
Yes, completely. "Just works" as people would expect it to (since it's a built-in feature of regex) vs requiring adding a new special attribute (that then must be learned/discovered) is WAY better. |
Are there any performance regression tests laying around anywhere? If so I'd be happy to use them to test this... Are JS engines stupid about strings? Slicing does a full copy in RAM? From all my quick googles the string handling seems very advanced/performant... there might not be a reason to not just slice the WHOLE remainder (then you could look ahead all the way to the end of the string). That's what Sam Stephenson does with String scanner: https://github.com/sstephenson/strscan-js/blob/master/src/strscan.coffee That's old code, but if it worked back then... though I'm not sure they ever used it on HUGE strings.... most the complaining I found was the OPPOSITE problem... you start with a huge string then you slice it and now JS keeps the original around forever (likely because the slices are string snippets, et)... when people might REALLY like it to be GCed, but that's not a problem for us. Based on this my vote would would be to just use the remainder of the whole string and no need to for tuning at all. If that sounds reasonable I can make a PR. |
I think it's reasonable to start with that. Then we can test performance regression (by comparing before/after speed) and decide whether we can stop there or it will require further improvements/tuning knobs etc. |
From what I'm reading it shouldn't matter or should be inconsequential (so I'm not that worried). I'll whip something up and test it with my new theme then push a PR for review/testing. |
|
For reference: #2135 |
|
Ok, this is NOT as trivial as I thought, but I haven't given up yet. Lots of nuance and edge cases. My first attempt did still pass all but like 7-10 tests (which is something), but dialing this down will be fun. |
|
I have a working version now with only 1 failing test: This is an actual issue with Highlight.js that needs to be clarified though. It's still very rough and quite slow. It drives all beginning matches from the main content itself (as the original)... BUT it does NOT re-apply any of the begin rules - it simply keeps track of which rule matched in the first place and uses that to set the mode. So whatever a rule matches the first time is what counts, period, including any look behind or look-ahead (the rule has access to the FULL content). This makes it obvious we have a few broken rules (IMHO) that will match NOTHING repeatedly, so I have a small check for those and if a rules has fired both BEGIN/END without advancing then we just advance the cursor 1 step and try again. I'll open a separate issue on this point. This currently isn't efficient because I got it working by running the regexes in sequence (vs parallel). Tests take 10s vs 6s on my machine, so obviously the idea to make them run in parallel was the smart move. Pretty sure I have a way to have my cake and eat it too though, so I'll circle back in a few days and see if I can get the speed back up to the original. There is even a fair chance it would be faster since we're only running the regexes a single time - instead of every begin regex essentially running twice as it works now. |
|
A more comprehensive parser is likely going to find a few actual bugs/glitches in the syntaxes also. First one: |
|
@yyyc514, I will look into these. |
|
Regarding the rules that match nothing repeatedly (mentioned above): |
|
It works. And it's faster. :-) All test pass. Now I'm looking into the complexness that is end mode matching. :-) |
|
Cool! |
|
Ok, I pushed a clean working PR. I'm going to label it WIP because I think I'd like to clean up some of the existing languages hacks, but I think it's ready for review and discussion. |
|
@logicplace Can you think of the types of situations you might use a lookahead on the END expression? |
Solves #1214 at least in the positive case. At the time, I'd thought negative lookaheads would work without such a change but maybe I was wrong?
As a note, I chose to do new tags instead because parsing the regex for lookaheads seemed more costly than useful, especially in edge cases.
Note, when specifying the contents of these tags, do not use lookahead syntax, it adds it for you.
Example:
{ begin: /\w+/, afterBegin: /\s*:/, end: /,/, }As an example of the issue this solves:
{ begin: /\w+(?=\s*:)/, end: /,/, }Adds \w+(?=\s*:) to the terminator but it then lexes just the \w+ part because the lookahead is 0-width, subMode will therefore fail to find the begin tag it had matched originally (and may match something else inappropriately!)