Expose dissect and grok to painless #67825

nik9000 · 2021-01-21T13:19:42Z

This adds two new "flavored regex" constructs in addition to the
/foo/ style java regexes we have now. Now you can do g/foo/ to get a
"grok flavored" regular expression and d/foo/ to ge ta dissect
flavored regular expresion. These return a FlavoredPattern interface
which is intentionally very very simple so we can abstract the common
bits of grok and dissect and because, quite frankly, we're not sure what
else we should add to it yet.

Closes #67825

This adds two new "flavored regex" constructs in addition to the `/foo/` style java regexes we have now. Now you can do `g/foo/` to get a "grok flavored" regular expression and `d/foo/` to ge ta dissect flavored regular expresion. These return a `FlavoredPattern` interface which is intentionally very very simple so we can abstract the common bits of grok and dissect and because, quite frankly, we're not sure what else we should add to it yet.

nik9000 · 2021-01-21T13:20:11Z

I'm creating this as draft for a little while so I can review it in github and leave questions for reviewers.

nik9000

Another open question that I didn't have a place for inline: do we actually want the syntax to be g/pattern/ and d/pattern/ or to be something else. g might be confused with perl's global syntax. Perl is awesome and terrible and where everyone stole the /pattern/ syntax from in the first place. It's good to look to your forebearers, I guess.

One thing that is kind of debateable and kind of not - are these really regular expressions at all? I mean, grok certainly is a regular expression dialect, descended from joni, descended from onigurama. Dissect is a pattern matching language, certainly, but it doesn't have any of the syntax you'd expect from regular expressions. In the end I figure it just simplifies our life to think of the all as "flavors of regular expressions".

nik9000 · 2021-01-21T13:21:28Z

libs/dissect/src/main/java/org/elasticsearch/dissect/DissectKey.java


        if (name == null || (name.isEmpty() && !skip)) {
-            throw new DissectException.KeyParse(key, "The key name could be determined");
+            throw new DissectException.KeyParse(key, "The key name could not be determined");


I bumped into a few small things in disect - error messages and figured I'd clean them up while I was looking at them. I can certainly break them into a separate PR if it'd make life easier.

nik9000 · 2021-01-21T13:24:18Z

modules/lang-painless/build.gradle

  api 'org.ow2.asm:asm-analysis:7.2'
  api 'org.ow2.asm:asm:7.2'
+  api project(':libs:elasticsearch-grok')
+  api project(':libs:elasticsearch-dissect')


First big question: do we want this in "core painless" or do we want to make an extension point? Grok and dissect are libraries, but grok has a few dependencies.

Second question that impacts on the first: do we want to limit these regex flavors to certain contexts? Like just runtime fields. I think it'd be complex to explain that to folks though.

@nik9000 I gave this specific issue a lot more thought over the weekend, and it's definitely my greatest concern. I think when we try to do the separation of core-painless from the plugin, this will be really hard to separate if it's part of the grammar. (Though, I guess we already have a way to turn on/off regex, so maybe this just needs a similar way to do that?). I do wonder if we should instead consider making grok/dissect instance bindings and have them called as static methods. This would allow the grok instance (singleton for Painless) to have both the watchdog and a cache independent of core-painless, and then they become dependent on whitelisting as opposed to grammar changes.

I think a grok method makes sense. While these are similar to regexes, they are also distinctly different. I would prefer we not add these dependencies directly to painless. Instead, as Jack suggested, we can make them available through our normal extension mechanisms. But we don't need grok in eg score scripts, so it's not something that needs to be available to all contexts, and we should continue to strive to keep core painless unencumbered.

But we don't need grok in eg score scripts, so it's not something that needs to be available to all contexts, and we should continue to strive to keep core painless unencumbered.

I'm not sure that is true. If we're comfortable exposing grok for runtime fields then I think they'd end up in all contexts anyway, if just transitively through runtime fields.

I certainly understand wanting to keep painless modular. Y'all prefer grok and dissect to be methods that compile the pattern instance binding style? That'd work, and it'd keep all of the watchdog stuff out of core painless.

I'm not overly concerned with the contexts simply because someone could use grok in a runtime field that could then be used as part of a score script. I get that it's hard to remove things once they're part of the context whitelist, but since it can be used indirectly anyway I don't have a strong desire to keep it out of other contexts.

It's a fair point that through runtime fields it can probably effectively be used everywhere. I do think the modular argument is strong through; there is nothing about grok that implies to me it needs to be part of the language itself. It can work just as well as a method, which would keep painless free of additional external deps.

nik9000 · 2021-01-21T13:24:58Z

modules/lang-painless/src/main/antlr/PainlessLexer.g4


 STRING: ( '"' ( '\\"' | '\\\\' | ~[\\"] )*? '"' ) | ( '\'' ( '\\\'' | '\\\\' | ~[\\'] )*? '\'' );
-REGEX: '/' ( '\\' ~'\n' | ~('/' | '\n') )+? '/' [cilmsUux]* { isSlashRegex() }?;
+REGEX: [dg]? '/' ( '\\' ~'\n' | ~('/' | '\n') )+? '/' [cilmsUux]* { isSlashRegex() }?;


If we made the regex flavor pluggable then we're replace [dg]? with [a-z]? or something like that.

I would propose a more descriptive name if possible e.g. grok and dissect rather than g and d. Thoughts?

I went with single letters because the regex flags are single letter and it "felt similar". I'd kind of prefer single letters over longer things, but I'm not super attached either way.

If we do stick with grammar changes, I prefer the single letters as well. Something like grok/.../ seems quite awkward.

nik9000 · 2021-01-21T13:28:43Z

modules/lang-painless/src/main/java/org/elasticsearch/painless/Compiler.java

+    /**
+     * Suppliers the watchdog that prevents grok from running forever.
+     */
+    private final Supplier<MatcherWatchdog> grokWatchdog;


This watchdog is what prevents grok from "running forever". It basically calls Grok#interrupt when grok's have run too long. Grok itself periodically checks its own Grok#interrupt flag and Thread.currentThread.interrupted. It'd be better for us if we could register a callback that it checks. But it doesn't support that. So we need this watchdog thing. The watchdog is a shared component that tracks groks and interrupts them if they'd run too long. It seems heavy, especially for painless. And the plumbing is unpleasant. But without modifying joni we're stuck with it. And modifying joni doesn't seem like a small project.

Another way to do it might be to plumb a listener for task cancellation into Task and have it call any running Groks. The more I type this the more I'm thinking modifying joni is the right thing.....

nik9000 · 2021-01-21T13:30:20Z

modules/lang-painless/src/main/java/org/elasticsearch/painless/CompilerSettings.java

+        return grokPatternBank;
+    }
+
+    public void addToGrokPatternBank(String name, String pattern) {


Grok allows you to add things to the pattern bank and I started plumbing that through using the scripts options. But it turns out that we have an assertion that options contains some very specific stuff off in Script so I never finished it. I've left this here so we can talk about it. Grok really does want to let folks write a pattern bank and options seems like an OK way to do it, its just a change to how we'd been using options. So I left this in so we could discuss it.

nik9000 · 2021-01-21T15:12:26Z

modules/lang-painless/src/test/java/org/elasticsearch/painless/DivisionTests.java

+        assertEquals(1, exec("int d = 1; return d/1;"));
+        assertEquals(1, exec("int g = 1; return g/1;"));
+        assertEquals(1, exec("int j = 1; return j/1;"));
+    }


We want to be extra paranoid here that the compiler doesn't see g/ as the key for a grok regex when it is really division. It doesn't, but let's add an explicit test to make sure it never does.

nik9000 · 2021-01-21T15:14:31Z

...ng-painless/src/main/java/org/elasticsearch/painless/phase/DefaultSemanticAnalysisPhase.java

+            );
+            throw errorLocation.createError(
+                new IllegalArgumentException(
+                    "Could not compile java regex constant [" + pattern + "] with flags [" + flags + "]: " + pse.getDescription(),


When I was debugging stuff I noticed that I wasn't getting good error message from bad java regexes. It turns out that pse.getDescription() has the actual cause of the syntax error in it and it isn't included in toString! I've added this because I wanted it when I was debugging stuff but it isn't really related to the rest of the change.

nik9000 · 2021-01-21T15:14:45Z

x-pack/plugin/ml/build.gradle


  // ml deps
-  api project(':libs:elasticsearch-grok')
+  compileOnly project(':libs:elasticsearch-grok') // Extending painless gets us the grok runtime classes


This is unfortunate.

nik9000 · 2021-01-21T15:15:47Z

modules/lang-painless/src/yamlRestTest/resources/rest-api-spec/test/painless/30_search.yml

+                            g/^%{IP:clientip} - -/.map(doc['message.keyword'].value).clientip
+
+    - match: { hits.total.value: 1 }
+    - match: { hits.hits.0.fields.clientip.0: 40.135.0.0 }


You can't specify the pattern bank over http right now because we don't allow extra options. Presumably we could build a special syntax for the pattern bank if we wanted, but that seems like something worth doing in a follow up rather than here.

nik9000 · 2021-01-21T15:17:25Z

x-pack/plugin/src/test/resources/rest-api-spec/test/runtime_fields/250_grok.yml

@@ -0,0 +1,73 @@
+---


I'm adding these to runtime fields just to prove that everything is plugged in. I expect the scripts themselves to get simpler and simpler over time. But you can use grok for runtime fields after this PR!

elasticmachine · 2021-01-21T15:25:30Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

nik9000 · 2021-01-21T15:32:37Z

I expect some failures in ml that I have yet to debug. I don't think they'll force us to change the design, but they might. And we might get their fix for free if we decide that we want these to be outside of core painless.

rjernst · 2021-01-25T21:19:17Z

I left a comment, but wanted to leave a more general explanation of my thoughts here. I'm happy to see this PR is relatively simple, great work! However, I do think we should keep grok and painless separated. There are some aspects of grok that it is unclear how to expose here (custom patterns), so we should stay flexible for how the user can add that in the future. In the case of grok, we have discussed before the idea with ingest of having a GrokService that handle's compilation of the patterns and registration of custom patterns. With that idea in mind, much of the development over 7.x in painless was geared toward allowing methods backed not only by static methods, but also even internal services. Maintaining Painless is difficult just with the existing feature set, so we should continue to endeavor to keep it as small as possible to minimize that maintenance burden, and separate concerns for user facing features vs fundamental concepts in the language.

@timestamp

This adds a `grok` and a `dissect` method to runtime fields which returns a `Matcher` style object you can use to get the matched patterns. A fairly simple script to extract the "verb" from an apache log line with `grok` would look like this: ``` String verb = grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.verb; if (verb != null) { emit(verb); } ``` And `dissect` would look like: ``` String verb = dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{status} %{size}').extract(doc["message"].value)?.verb; if (verb != null) { emit(verb); } ``` We'll work later to get it down to a clean looking one liner, but for now, this'll do. The `grok` and `dissect` methods are special in that they only run at script compile time. You can't pass non-constants to them. They'll produce compile errors if you send in a bad pattern. This is nice because they can be expensive to "compile" and there are many other optimizations we can make when the patterns are available up front. Closes elastic#67825

nik9000 · 2021-01-27T22:55:08Z

Replaced by #68088.

@timestamp

This adds a `grok` and a `dissect` method to runtime fields which returns a `Matcher` style object you can use to get the matched patterns. A fairly simple script to extract the "verb" from an apache log line with `grok` would look like this: ``` String verb = grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.verb; if (verb != null) { emit(verb); } ``` And `dissect` would look like: ``` String verb = dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{status} %{size}').extract(doc["message"].value)?.verb; if (verb != null) { emit(verb); } ``` We'll work later to get it down to a clean looking one liner, but for now, this'll do. The `grok` and `dissect` methods are special in that they only run at script compile time. You can't pass non-constants to them. They'll produce compile errors if you send in a bad pattern. This is nice because they can be expensive to "compile" and there are many other optimizations we can make when the patterns are available up front. Closes #67825

@timestamp

…8088) This adds a `grok` and a `dissect` method to runtime fields which returns a `Matcher` style object you can use to get the matched patterns. A fairly simple script to extract the "verb" from an apache log line with `grok` would look like this: ``` String verb = grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.verb; if (verb != null) { emit(verb); } ``` And `dissect` would look like: ``` String verb = dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{status} %{size}').extract(doc["message"].value)?.verb; if (verb != null) { emit(verb); } ``` We'll work later to get it down to a clean looking one liner, but for now, this'll do. The `grok` and `dissect` methods are special in that they only run at script compile time. You can't pass non-constants to them. They'll produce compile errors if you send in a bad pattern. This is nice because they can be expensive to "compile" and there are many other optimizations we can make when the patterns are available up front. Closes elastic#67825

@timestamp

…68332) This adds a `grok` and a `dissect` method to runtime fields which returns a `Matcher` style object you can use to get the matched patterns. A fairly simple script to extract the "verb" from an apache log line with `grok` would look like this: ``` String verb = grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.verb; if (verb != null) { emit(verb); } ``` And `dissect` would look like: ``` String verb = dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{status} %{size}').extract(doc["message"].value)?.verb; if (verb != null) { emit(verb); } ``` We'll work later to get it down to a clean looking one liner, but for now, this'll do. The `grok` and `dissect` methods are special in that they only run at script compile time. You can't pass non-constants to them. They'll produce compile errors if you send in a bad pattern. This is nice because they can be expensive to "compile" and there are many other optimizations we can make when the patterns are available up front. Closes #67825

nik9000 commented Jan 21, 2021

View reviewed changes

nik9000 marked this pull request as ready for review January 21, 2021 15:25

nik9000 added the :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache label Jan 21, 2021

elasticmachine added the Team:Core/Infra Meta label for core/infra team label Jan 21, 2021

nik9000 added v7.12.0 v8.0.0 and removed Team:Core/Infra Meta label for core/infra team labels Jan 21, 2021

nik9000 requested review from javanna, jdconrad, rjernst, romseygeek and stu-elastic and removed request for rjernst and stu-elastic January 21, 2021 15:25

nik9000 added the >enhancement label Jan 21, 2021

Add example that doesn't have any matches

55da281

nik9000 mentioned this pull request Jan 27, 2021

Add grok and dissect methods to runtime fields #68088

Merged

nik9000 closed this Jan 27, 2021

nik9000 removed v7.12.0 v8.0.0 labels Jan 27, 2021

nik9000 mentioned this pull request Feb 1, 2021

Add grok and dissect methods to runtime fields (backport of #68088) #68332

Merged

stu-elastic mentioned this pull request Feb 4, 2021

Painless convenience functions for parsing log lines #60669

Closed

stevejgordon mentioned this pull request Feb 22, 2021

7.12.0 Meta Ticket elastic/elasticsearch-net#5337

Closed

34 tasks

Expose dissect and grok to painless #67825

Expose dissect and grok to painless #67825

Uh oh!

Conversation

nik9000 commented Jan 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nik9000 commented Jan 21, 2021

Uh oh!

nik9000 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jdconrad Jan 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjernst Jan 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticmachine commented Jan 21, 2021

Uh oh!

nik9000 commented Jan 21, 2021

Uh oh!

rjernst commented Jan 25, 2021

Uh oh!

nik9000 commented Jan 27, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nik9000 commented Jan 21, 2021 •

edited

Loading

jdconrad Jan 26, 2021 •

edited

Loading

rjernst Jan 26, 2021 •

edited

Loading