-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Expose dissect and grok to painless #67825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This adds two new "flavored regex" constructs in addition to the `/foo/` style java regexes we have now. Now you can do `g/foo/` to get a "grok flavored" regular expression and `d/foo/` to ge ta dissect flavored regular expresion. These return a `FlavoredPattern` interface which is intentionally very very simple so we can abstract the common bits of grok and dissect and because, quite frankly, we're not sure what else we should add to it yet.
|
I'm creating this as draft for a little while so I can review it in github and leave questions for reviewers. |
nik9000
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another open question that I didn't have a place for inline: do we actually want the syntax to be g/pattern/ and d/pattern/ or to be something else. g might be confused with perl's global syntax. Perl is awesome and terrible and where everyone stole the /pattern/ syntax from in the first place. It's good to look to your forebearers, I guess.
One thing that is kind of debateable and kind of not - are these really regular expressions at all? I mean, grok certainly is a regular expression dialect, descended from joni, descended from onigurama. Dissect is a pattern matching language, certainly, but it doesn't have any of the syntax you'd expect from regular expressions. In the end I figure it just simplifies our life to think of the all as "flavors of regular expressions".
|
|
||
| if (name == null || (name.isEmpty() && !skip)) { | ||
| throw new DissectException.KeyParse(key, "The key name could be determined"); | ||
| throw new DissectException.KeyParse(key, "The key name could not be determined"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bumped into a few small things in disect - error messages and figured I'd clean them up while I was looking at them. I can certainly break them into a separate PR if it'd make life easier.
| api 'org.ow2.asm:asm-analysis:7.2' | ||
| api 'org.ow2.asm:asm:7.2' | ||
| api project(':libs:elasticsearch-grok') | ||
| api project(':libs:elasticsearch-dissect') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First big question: do we want this in "core painless" or do we want to make an extension point? Grok and dissect are libraries, but grok has a few dependencies.
Second question that impacts on the first: do we want to limit these regex flavors to certain contexts? Like just runtime fields. I think it'd be complex to explain that to folks though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nik9000 I gave this specific issue a lot more thought over the weekend, and it's definitely my greatest concern. I think when we try to do the separation of core-painless from the plugin, this will be really hard to separate if it's part of the grammar. (Though, I guess we already have a way to turn on/off regex, so maybe this just needs a similar way to do that?). I do wonder if we should instead consider making grok/dissect instance bindings and have them called as static methods. This would allow the grok instance (singleton for Painless) to have both the watchdog and a cache independent of core-painless, and then they become dependent on whitelisting as opposed to grammar changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a grok method makes sense. While these are similar to regexes, they are also distinctly different. I would prefer we not add these dependencies directly to painless. Instead, as Jack suggested, we can make them available through our normal extension mechanisms. But we don't need grok in eg score scripts, so it's not something that needs to be available to all contexts, and we should continue to strive to keep core painless unencumbered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we don't need grok in eg score scripts, so it's not something that needs to be available to all contexts, and we should continue to strive to keep core painless unencumbered.
I'm not sure that is true. If we're comfortable exposing grok for runtime fields then I think they'd end up in all contexts anyway, if just transitively through runtime fields.
I certainly understand wanting to keep painless modular. Y'all prefer grok and dissect to be methods that compile the pattern instance binding style? That'd work, and it'd keep all of the watchdog stuff out of core painless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not overly concerned with the contexts simply because someone could use grok in a runtime field that could then be used as part of a score script. I get that it's hard to remove things once they're part of the context whitelist, but since it can be used indirectly anyway I don't have a strong desire to keep it out of other contexts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a fair point that through runtime fields it can probably effectively be used everywhere. I do think the modular argument is strong through; there is nothing about grok that implies to me it needs to be part of the language itself. It can work just as well as a method, which would keep painless free of additional external deps.
|
|
||
| STRING: ( '"' ( '\\"' | '\\\\' | ~[\\"] )*? '"' ) | ( '\'' ( '\\\'' | '\\\\' | ~[\\'] )*? '\'' ); | ||
| REGEX: '/' ( '\\' ~'\n' | ~('/' | '\n') )+? '/' [cilmsUux]* { isSlashRegex() }?; | ||
| REGEX: [dg]? '/' ( '\\' ~'\n' | ~('/' | '\n') )+? '/' [cilmsUux]* { isSlashRegex() }?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we made the regex flavor pluggable then we're replace [dg]? with [a-z]? or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would propose a more descriptive name if possible e.g. grok and dissect rather than g and d. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with single letters because the regex flags are single letter and it "felt similar". I'd kind of prefer single letters over longer things, but I'm not super attached either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do stick with grammar changes, I prefer the single letters as well. Something like grok/.../ seems quite awkward.
| /** | ||
| * Suppliers the watchdog that prevents grok from running forever. | ||
| */ | ||
| private final Supplier<MatcherWatchdog> grokWatchdog; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This watchdog is what prevents grok from "running forever". It basically calls Grok#interrupt when grok's have run too long. Grok itself periodically checks its own Grok#interrupt flag and Thread.currentThread.interrupted. It'd be better for us if we could register a callback that it checks. But it doesn't support that. So we need this watchdog thing. The watchdog is a shared component that tracks groks and interrupts them if they'd run too long. It seems heavy, especially for painless. And the plumbing is unpleasant. But without modifying joni we're stuck with it. And modifying joni doesn't seem like a small project.
Another way to do it might be to plumb a listener for task cancellation into Task and have it call any running Groks. The more I type this the more I'm thinking modifying joni is the right thing.....
| return grokPatternBank; | ||
| } | ||
|
|
||
| public void addToGrokPatternBank(String name, String pattern) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grok allows you to add things to the pattern bank and I started plumbing that through using the scripts options. But it turns out that we have an assertion that options contains some very specific stuff off in Script so I never finished it. I've left this here so we can talk about it. Grok really does want to let folks write a pattern bank and options seems like an OK way to do it, its just a change to how we'd been using options. So I left this in so we could discuss it.
| assertEquals(1, exec("int d = 1; return d/1;")); | ||
| assertEquals(1, exec("int g = 1; return g/1;")); | ||
| assertEquals(1, exec("int j = 1; return j/1;")); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to be extra paranoid here that the compiler doesn't see g/ as the key for a grok regex when it is really division. It doesn't, but let's add an explicit test to make sure it never does.
| ); | ||
| throw errorLocation.createError( | ||
| new IllegalArgumentException( | ||
| "Could not compile java regex constant [" + pattern + "] with flags [" + flags + "]: " + pse.getDescription(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I was debugging stuff I noticed that I wasn't getting good error message from bad java regexes. It turns out that pse.getDescription() has the actual cause of the syntax error in it and it isn't included in toString! I've added this because I wanted it when I was debugging stuff but it isn't really related to the rest of the change.
|
|
||
| // ml deps | ||
| api project(':libs:elasticsearch-grok') | ||
| compileOnly project(':libs:elasticsearch-grok') // Extending painless gets us the grok runtime classes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unfortunate.
| g/^%{IP:clientip} - -/.map(doc['message.keyword'].value).clientip | ||
| - match: { hits.total.value: 1 } | ||
| - match: { hits.hits.0.fields.clientip.0: 40.135.0.0 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't specify the pattern bank over http right now because we don't allow extra options. Presumably we could build a special syntax for the pattern bank if we wanted, but that seems like something worth doing in a follow up rather than here.
| @@ -0,0 +1,73 @@ | |||
| --- | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm adding these to runtime fields just to prove that everything is plugged in. I expect the scripts themselves to get simpler and simpler over time. But you can use grok for runtime fields after this PR!
|
Pinging @elastic/es-core-infra (Team:Core/Infra) |
|
I expect some failures in ml that I have yet to debug. I don't think they'll force us to change the design, but they might. And we might get their fix for free if we decide that we want these to be outside of core painless. |
|
I left a comment, but wanted to leave a more general explanation of my thoughts here. I'm happy to see this PR is relatively simple, great work! However, I do think we should keep grok and painless separated. There are some aspects of grok that it is unclear how to expose here (custom patterns), so we should stay flexible for how the user can add that in the future. In the case of grok, we have discussed before the idea with ingest of having a GrokService that handle's compilation of the patterns and registration of custom patterns. With that idea in mind, much of the development over 7.x in painless was geared toward allowing methods backed not only by static methods, but also even internal services. Maintaining Painless is difficult just with the existing feature set, so we should continue to endeavor to keep it as small as possible to minimize that maintenance burden, and separate concerns for user facing features vs fundamental concepts in the language. |
This adds a `grok` and a `dissect` method to runtime fields which
returns a `Matcher` style object you can use to get the matched
patterns. A fairly simple script to extract the "verb" from an apache
log line with `grok` would look like this:
```
String verb = grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.verb;
if (verb != null) {
emit(verb);
}
```
And `dissect` would look like:
```
String verb = dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{status} %{size}').extract(doc["message"].value)?.verb;
if (verb != null) {
emit(verb);
}
```
We'll work later to get it down to a clean looking one liner, but for
now, this'll do.
The `grok` and `dissect` methods are special in that they only run at
script compile time. You can't pass non-constants to them. They'll
produce compile errors if you send in a bad pattern. This is nice
because they can be expensive to "compile" and there are many other
optimizations we can make when the patterns are available up front.
Closes elastic#67825
|
Replaced by #68088. |
This adds a `grok` and a `dissect` method to runtime fields which
returns a `Matcher` style object you can use to get the matched
patterns. A fairly simple script to extract the "verb" from an apache
log line with `grok` would look like this:
```
String verb = grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.verb;
if (verb != null) {
emit(verb);
}
```
And `dissect` would look like:
```
String verb = dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{status} %{size}').extract(doc["message"].value)?.verb;
if (verb != null) {
emit(verb);
}
```
We'll work later to get it down to a clean looking one liner, but for
now, this'll do.
The `grok` and `dissect` methods are special in that they only run at
script compile time. You can't pass non-constants to them. They'll
produce compile errors if you send in a bad pattern. This is nice
because they can be expensive to "compile" and there are many other
optimizations we can make when the patterns are available up front.
Closes #67825
…8088) This adds a `grok` and a `dissect` method to runtime fields which returns a `Matcher` style object you can use to get the matched patterns. A fairly simple script to extract the "verb" from an apache log line with `grok` would look like this: ``` String verb = grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.verb; if (verb != null) { emit(verb); } ``` And `dissect` would look like: ``` String verb = dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{status} %{size}').extract(doc["message"].value)?.verb; if (verb != null) { emit(verb); } ``` We'll work later to get it down to a clean looking one liner, but for now, this'll do. The `grok` and `dissect` methods are special in that they only run at script compile time. You can't pass non-constants to them. They'll produce compile errors if you send in a bad pattern. This is nice because they can be expensive to "compile" and there are many other optimizations we can make when the patterns are available up front. Closes elastic#67825
…68332) This adds a `grok` and a `dissect` method to runtime fields which returns a `Matcher` style object you can use to get the matched patterns. A fairly simple script to extract the "verb" from an apache log line with `grok` would look like this: ``` String verb = grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.verb; if (verb != null) { emit(verb); } ``` And `dissect` would look like: ``` String verb = dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{status} %{size}').extract(doc["message"].value)?.verb; if (verb != null) { emit(verb); } ``` We'll work later to get it down to a clean looking one liner, but for now, this'll do. The `grok` and `dissect` methods are special in that they only run at script compile time. You can't pass non-constants to them. They'll produce compile errors if you send in a bad pattern. This is nice because they can be expensive to "compile" and there are many other optimizations we can make when the patterns are available up front. Closes #67825
This adds two new "flavored regex" constructs in addition to the
/foo/style java regexes we have now. Now you can dog/foo/to get a"grok flavored" regular expression and
d/foo/to ge ta dissectflavored regular expresion. These return a
FlavoredPatterninterfacewhich is intentionally very very simple so we can abstract the common
bits of grok and dissect and because, quite frankly, we're not sure what
else we should add to it yet.
Closes #67825