Skip to content

Conversation

@nik9000
Copy link
Member

@nik9000 nik9000 commented Jan 21, 2021

This adds two new "flavored regex" constructs in addition to the
/foo/ style java regexes we have now. Now you can do g/foo/ to get a
"grok flavored" regular expression and d/foo/ to ge ta dissect
flavored regular expresion. These return a FlavoredPattern interface
which is intentionally very very simple so we can abstract the common
bits of grok and dissect and because, quite frankly, we're not sure what
else we should add to it yet.

Closes #67825

This adds two new "flavored regex" constructs in addition to the
`/foo/` style java regexes we have now. Now you can do `g/foo/` to get a
"grok flavored" regular expression and `d/foo/` to ge ta dissect
flavored regular expresion. These return a `FlavoredPattern` interface
which is intentionally very very simple so we can abstract the common
bits of grok and dissect and because, quite frankly, we're not sure what
else we should add to it yet.
@nik9000
Copy link
Member Author

nik9000 commented Jan 21, 2021

I'm creating this as draft for a little while so I can review it in github and leave questions for reviewers.

Copy link
Member Author

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another open question that I didn't have a place for inline: do we actually want the syntax to be g/pattern/ and d/pattern/ or to be something else. g might be confused with perl's global syntax. Perl is awesome and terrible and where everyone stole the /pattern/ syntax from in the first place. It's good to look to your forebearers, I guess.

One thing that is kind of debateable and kind of not - are these really regular expressions at all? I mean, grok certainly is a regular expression dialect, descended from joni, descended from onigurama. Dissect is a pattern matching language, certainly, but it doesn't have any of the syntax you'd expect from regular expressions. In the end I figure it just simplifies our life to think of the all as "flavors of regular expressions".


if (name == null || (name.isEmpty() && !skip)) {
throw new DissectException.KeyParse(key, "The key name could be determined");
throw new DissectException.KeyParse(key, "The key name could not be determined");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bumped into a few small things in disect - error messages and figured I'd clean them up while I was looking at them. I can certainly break them into a separate PR if it'd make life easier.

api 'org.ow2.asm:asm-analysis:7.2'
api 'org.ow2.asm:asm:7.2'
api project(':libs:elasticsearch-grok')
api project(':libs:elasticsearch-dissect')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First big question: do we want this in "core painless" or do we want to make an extension point? Grok and dissect are libraries, but grok has a few dependencies.

Second question that impacts on the first: do we want to limit these regex flavors to certain contexts? Like just runtime fields. I think it'd be complex to explain that to folks though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nik9000 I gave this specific issue a lot more thought over the weekend, and it's definitely my greatest concern. I think when we try to do the separation of core-painless from the plugin, this will be really hard to separate if it's part of the grammar. (Though, I guess we already have a way to turn on/off regex, so maybe this just needs a similar way to do that?). I do wonder if we should instead consider making grok/dissect instance bindings and have them called as static methods. This would allow the grok instance (singleton for Painless) to have both the watchdog and a cache independent of core-painless, and then they become dependent on whitelisting as opposed to grammar changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a grok method makes sense. While these are similar to regexes, they are also distinctly different. I would prefer we not add these dependencies directly to painless. Instead, as Jack suggested, we can make them available through our normal extension mechanisms. But we don't need grok in eg score scripts, so it's not something that needs to be available to all contexts, and we should continue to strive to keep core painless unencumbered.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't need grok in eg score scripts, so it's not something that needs to be available to all contexts, and we should continue to strive to keep core painless unencumbered.

I'm not sure that is true. If we're comfortable exposing grok for runtime fields then I think they'd end up in all contexts anyway, if just transitively through runtime fields.

I certainly understand wanting to keep painless modular. Y'all prefer grok and dissect to be methods that compile the pattern instance binding style? That'd work, and it'd keep all of the watchdog stuff out of core painless.

Copy link
Contributor

@jdconrad jdconrad Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not overly concerned with the contexts simply because someone could use grok in a runtime field that could then be used as part of a score script. I get that it's hard to remove things once they're part of the context whitelist, but since it can be used indirectly anyway I don't have a strong desire to keep it out of other contexts.

Copy link
Member

@rjernst rjernst Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a fair point that through runtime fields it can probably effectively be used everywhere. I do think the modular argument is strong through; there is nothing about grok that implies to me it needs to be part of the language itself. It can work just as well as a method, which would keep painless free of additional external deps.


STRING: ( '"' ( '\\"' | '\\\\' | ~[\\"] )*? '"' ) | ( '\'' ( '\\\'' | '\\\\' | ~[\\'] )*? '\'' );
REGEX: '/' ( '\\' ~'\n' | ~('/' | '\n') )+? '/' [cilmsUux]* { isSlashRegex() }?;
REGEX: [dg]? '/' ( '\\' ~'\n' | ~('/' | '\n') )+? '/' [cilmsUux]* { isSlashRegex() }?;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we made the regex flavor pluggable then we're replace [dg]? with [a-z]? or something like that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose a more descriptive name if possible e.g. grok and dissect rather than g and d. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with single letters because the regex flags are single letter and it "felt similar". I'd kind of prefer single letters over longer things, but I'm not super attached either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do stick with grammar changes, I prefer the single letters as well. Something like grok/.../ seems quite awkward.

/**
* Suppliers the watchdog that prevents grok from running forever.
*/
private final Supplier<MatcherWatchdog> grokWatchdog;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This watchdog is what prevents grok from "running forever". It basically calls Grok#interrupt when grok's have run too long. Grok itself periodically checks its own Grok#interrupt flag and Thread.currentThread.interrupted. It'd be better for us if we could register a callback that it checks. But it doesn't support that. So we need this watchdog thing. The watchdog is a shared component that tracks groks and interrupts them if they'd run too long. It seems heavy, especially for painless. And the plumbing is unpleasant. But without modifying joni we're stuck with it. And modifying joni doesn't seem like a small project.

Another way to do it might be to plumb a listener for task cancellation into Task and have it call any running Groks. The more I type this the more I'm thinking modifying joni is the right thing.....

return grokPatternBank;
}

public void addToGrokPatternBank(String name, String pattern) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grok allows you to add things to the pattern bank and I started plumbing that through using the scripts options. But it turns out that we have an assertion that options contains some very specific stuff off in Script so I never finished it. I've left this here so we can talk about it. Grok really does want to let folks write a pattern bank and options seems like an OK way to do it, its just a change to how we'd been using options. So I left this in so we could discuss it.

assertEquals(1, exec("int d = 1; return d/1;"));
assertEquals(1, exec("int g = 1; return g/1;"));
assertEquals(1, exec("int j = 1; return j/1;"));
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to be extra paranoid here that the compiler doesn't see g/ as the key for a grok regex when it is really division. It doesn't, but let's add an explicit test to make sure it never does.

);
throw errorLocation.createError(
new IllegalArgumentException(
"Could not compile java regex constant [" + pattern + "] with flags [" + flags + "]: " + pse.getDescription(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was debugging stuff I noticed that I wasn't getting good error message from bad java regexes. It turns out that pse.getDescription() has the actual cause of the syntax error in it and it isn't included in toString! I've added this because I wanted it when I was debugging stuff but it isn't really related to the rest of the change.


// ml deps
api project(':libs:elasticsearch-grok')
compileOnly project(':libs:elasticsearch-grok') // Extending painless gets us the grok runtime classes
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunate.

g/^%{IP:clientip} - -/.map(doc['message.keyword'].value).clientip
- match: { hits.total.value: 1 }
- match: { hits.hits.0.fields.clientip.0: 40.135.0.0 }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't specify the pattern bank over http right now because we don't allow extra options. Presumably we could build a special syntax for the pattern bank if we wanted, but that seems like something worth doing in a follow up rather than here.

@@ -0,0 +1,73 @@
---
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding these to runtime fields just to prove that everything is plugged in. I expect the scripts themselves to get simpler and simpler over time. But you can use grok for runtime fields after this PR!

@nik9000 nik9000 marked this pull request as ready for review January 21, 2021 15:25
@nik9000 nik9000 added the :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache label Jan 21, 2021
@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Jan 21, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@nik9000 nik9000 added v7.12.0 v8.0.0 and removed Team:Core/Infra Meta label for core/infra team labels Jan 21, 2021
@nik9000 nik9000 requested review from javanna, jdconrad, rjernst, romseygeek and stu-elastic and removed request for rjernst and stu-elastic January 21, 2021 15:25
@nik9000
Copy link
Member Author

nik9000 commented Jan 21, 2021

I expect some failures in ml that I have yet to debug. I don't think they'll force us to change the design, but they might. And we might get their fix for free if we decide that we want these to be outside of core painless.

@rjernst
Copy link
Member

rjernst commented Jan 25, 2021

I left a comment, but wanted to leave a more general explanation of my thoughts here. I'm happy to see this PR is relatively simple, great work! However, I do think we should keep grok and painless separated. There are some aspects of grok that it is unclear how to expose here (custom patterns), so we should stay flexible for how the user can add that in the future. In the case of grok, we have discussed before the idea with ingest of having a GrokService that handle's compilation of the patterns and registration of custom patterns. With that idea in mind, much of the development over 7.x in painless was geared toward allowing methods backed not only by static methods, but also even internal services. Maintaining Painless is difficult just with the existing feature set, so we should continue to endeavor to keep it as small as possible to minimize that maintenance burden, and separate concerns for user facing features vs fundamental concepts in the language.

nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Jan 27, 2021
This adds a `grok` and a `dissect` method to runtime fields which
returns a `Matcher` style object you can use to get the matched
patterns. A fairly simple script to extract the "verb" from an apache
log line with `grok` would look like this:
```
String verb = grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.verb;
if (verb != null) {
  emit(verb);
}
```

And `dissect` would look like:
```
String verb = dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{status} %{size}').extract(doc["message"].value)?.verb;
if (verb != null) {
  emit(verb);
}
```

We'll work later to get it down to a clean looking one liner, but for
now, this'll do.

The `grok` and `dissect` methods are special in that they only run at
script compile time. You can't pass non-constants to them. They'll
produce compile errors if you send in a bad pattern. This is nice
because they can be expensive to "compile" and there are many other
optimizations we can make when the patterns are available up front.

Closes elastic#67825
@nik9000
Copy link
Member Author

nik9000 commented Jan 27, 2021

Replaced by #68088.

nik9000 added a commit that referenced this pull request Feb 1, 2021
This adds a `grok` and a `dissect` method to runtime fields which
returns a `Matcher` style object you can use to get the matched
patterns. A fairly simple script to extract the "verb" from an apache
log line with `grok` would look like this:
```
String verb = grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.verb;
if (verb != null) {
  emit(verb);
}
```

And `dissect` would look like:
```
String verb = dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{status} %{size}').extract(doc["message"].value)?.verb;
if (verb != null) {
  emit(verb);
}
```

We'll work later to get it down to a clean looking one liner, but for
now, this'll do.

The `grok` and `dissect` methods are special in that they only run at
script compile time. You can't pass non-constants to them. They'll
produce compile errors if you send in a bad pattern. This is nice
because they can be expensive to "compile" and there are many other
optimizations we can make when the patterns are available up front.

Closes #67825
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Feb 1, 2021
…8088)

This adds a `grok` and a `dissect` method to runtime fields which
returns a `Matcher` style object you can use to get the matched
patterns. A fairly simple script to extract the "verb" from an apache
log line with `grok` would look like this:
```
String verb = grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.verb;
if (verb != null) {
  emit(verb);
}
```

And `dissect` would look like:
```
String verb = dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{status} %{size}').extract(doc["message"].value)?.verb;
if (verb != null) {
  emit(verb);
}
```

We'll work later to get it down to a clean looking one liner, but for
now, this'll do.

The `grok` and `dissect` methods are special in that they only run at
script compile time. You can't pass non-constants to them. They'll
produce compile errors if you send in a bad pattern. This is nice
because they can be expensive to "compile" and there are many other
optimizations we can make when the patterns are available up front.

Closes elastic#67825
nik9000 added a commit that referenced this pull request Feb 1, 2021
…68332)

This adds a `grok` and a `dissect` method to runtime fields which
returns a `Matcher` style object you can use to get the matched
patterns. A fairly simple script to extract the "verb" from an apache
log line with `grok` would look like this:
```
String verb = grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.verb;
if (verb != null) {
  emit(verb);
}
```

And `dissect` would look like:
```
String verb = dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{status} %{size}').extract(doc["message"].value)?.verb;
if (verb != null) {
  emit(verb);
}
```

We'll work later to get it down to a clean looking one liner, but for
now, this'll do.

The `grok` and `dissect` methods are special in that they only run at
script compile time. You can't pass non-constants to them. They'll
produce compile errors if you send in a bad pattern. This is nice
because they can be expensive to "compile" and there are many other
optimizations we can make when the patterns are available up front.

Closes #67825
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Core/Infra/Scripting Scripting abstractions, Painless, and Mustache >enhancement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants