Start building analysis-common module #23614

nik9000 · 2017-03-16T16:12:57Z

The goal is to (eventually) move all the analyzers in lucene-analzyers-common.jar into a module so the core ES jar doesn't have to depend on lucene-analyzer-common.jar. This would only affect the high level rest client and the transport client. The download would still include the the jar, just in a different spot.

So the real question here is, "is this worth our time?" This would shave maybe 2mb off of the high level rest client and transport client. lucene-analzyers-common.jar is 1.4mb. I'm generously assuming that the code I extract from core will be 600kb.

Some of the extraction is fairly easy - move files around, rig them up like plugins, etc. The effect on tests isn't super difficult, but means that the process takes time.

In the end the only analyzer available to tests in core would be standard because that is all that is in lucene-core. The standard tokenizer would be available in core as well. No token would be available though. Only the lowercase and mock tokenfilters are available in lucene-core and we can't expose them in core because lowercase in Elasticsearch is linked with GreekLowerCaseFilter, IrishLowerCaseFilter, and TurkishLowerCaseFilter and mock is just for testing. Useful for writing tests, mind, but not a thing we can expose.

So I'm opening this up for discussion: should we do it?

My thoughts:

It'd make the core theoretically cleaner. Lucene has had this separation for a long time.
It'd be a pretty big change. Not as long hanging fruit as we thought. This PR isn't small and it only does three token filters.
It wouldn't really save that many bits unless I'm reading it wrong.

I'm quite happy to kill this if we decide it isn't a useful savings.

These tests check the interaction of analyzers and core components so they have to move to where the analyzers are available.

nik9000 · 2017-03-16T16:17:55Z

Another cool thing this'd do:
4. It'd make super duper sure that analyzers are completely pluggable. That is actually a pretty cool side effect. We couldn't do anything it core that can't be done in plugins because there aren't any analyzers in core.

nik9000 · 2017-03-16T16:21:26Z

I don't have a guess as to how much time this'd take to complete. Maybe two weeks? Maybe one. Maybe three. Probably not a month. And I'd need a review buddy like I needed when I did all the NamedWriteable work. Someone to hold the project in their head and do reviews, but they wouldn't need to do the coding, I think.

dadoonet · 2017-03-16T16:33:02Z

"is this worth our time?

I'm all for modules. So IMO yes.

I take that as a start to split the core project. Even though we don't see yet the positive effect on the size I'm sure it's the right direction.
Note that we can also think about them as jars instead of plugins. I mean that we can totally have elasticsearch which depends on a jar we don't have to package in our clients.

Exposing a jar through a plugin is another question IMO.

nik9000 · 2017-03-20T14:25:28Z

@s1monw, this was your idea. Do you think it is worth doing? Sinking another couple of days into and reevaluating my estimates?

I mean that we can totally have elasticsearch which depends on a jar we don't have to package in our clients.

I like doing it as a module because that allows us to easily rely on bits of Elasticsearch core and to really validate that plugins can add any sort of analyzer.

nik9000 · 2017-03-20T16:44:36Z

Talked with a group of folks at Elastic interested in the client and got consensus to do this. I've switch this issue from discuss to review. Please have a look at see if you like the patterns I've been using for the migration. My plan is to create a big tracking issue for this work and to do it incrementally so we can release Elasticsearch while we're doing it.

nik9000 · 2017-03-27T18:36:45Z

core/src/main/java/org/elasticsearch/index/analysis/EdgeNGramTokenFilterFactory.java

    }
+
+    @Override
+    public boolean breaksFastVectorHighlighter() {


I should put javadoc on this.

Er, on the interface method, rather.

++, it would help as its not clear to me what the method's purpose is, since its currently used in the containsBrokenAnalysis check

abeyad

Left just a few comments / questions.

abeyad · 2017-03-28T18:37:42Z

core/src/main/java/org/elasticsearch/index/analysis/EdgeNGramTokenFilterFactory.java

    }
+
+    @Override
+    public boolean breaksFastVectorHighlighter() {


++, it would help as its not clear to me what the method's purpose is, since its currently used in the containsBrokenAnalysis check

abeyad · 2017-03-29T01:33:25Z

core/src/test/java/org/elasticsearch/action/admin/indices/TransportAnalyzeActionTests.java

        AnalyzeRequest request = new AnalyzeRequest();
-        request.analyzer("standard");
        request.text("the quick brown fox");
+


nit: extra empty line

abeyad · 2017-03-29T01:37:20Z

core/src/test/java/org/elasticsearch/action/admin/indices/TransportAnalyzeActionTests.java

+        assertEquals(4, tokens.size());
        assertEquals("the", tokens.get(0).getTerm());
-        assertEquals("qu1ck", tokens.get(1).getTerm());
+        assertEquals("quick", tokens.get(1).getTerm());


this block of the test now seems redundant, it duplicates the one above it

This one tests using a tokenizer instead of a fully built analyzer. I'll push a comment explaining

abeyad · 2017-03-29T01:55:25Z

test/framework/src/main/java/org/elasticsearch/AnalysisFactoryTestCase.java

        .put("arabicnormalization",       ArabicNormalizationFilterFactory.class)
        .put("arabicstem",                ArabicStemTokenFilterFactory.class)
-        .put("asciifolding",              ASCIIFoldingTokenFilterFactory.class)
+        .put("asciifolding",              Void.class)  // TODO remove this when core no longer depends on analysis-common


Why is this necessary? Why can't these filters just be removed from the map?

The test is written in such a way that it'll fail if there are any unmapped analysis components. It is our reminder to expose whatever is in Lucene. So we can't remove it. Instead we have to pretend we're intentionally not exposing it which we do by marking it as exposed by Void. At least, that is the pattern the test uses. It is a little funny, but the TODO notes that it should go away when we finally drop analyzer's dependency from core.

I'll push something that makes this a bit more clear/fancy. It'll be temporary while we do the migration, but it'll help.

abeyad · 2017-03-29T02:15:57Z

...ysis-common/src/test/java/org/elasticsearch/analysis/common/TransportAnalyzeActionTests.java

+import static java.util.Collections.singletonList;
+
+/**
+ * More "intense" version of a unit test with the same name that is in core. This one has access to the analyzers in this module.


Its not clear to me why we still need the core version of TransportAnalyzeActionTests?

I want to have something close to the code, but you are right, it isn't nearly as good as it used to be. I toyed with moving the TransportAnalyzeAction itself into analysis-common but I don't think that is a great choice for the high level rest client. It'll want to depend on request and response objects and the whole point of this project is to make the high level rest client not need them.

What does TransportAnalyzeAction have to do with request/response objects? The action should not be used by the high level client?

abeyad · 2017-03-29T02:16:47Z

...ysis-common/src/test/java/org/elasticsearch/analysis/common/TransportAnalyzeActionTests.java

+    }
+
+    public void testWithIndexAnalyzers() throws IOException {
+


Nit: extra empty line

abeyad · 2017-03-29T02:18:30Z

core/src/test/java/org/elasticsearch/index/analysis/AnalysisRegistryTests.java

+            @Override
+            public Map<String, AnalysisProvider<TokenFilterFactory>> getTokenFilters() {
+                Map<String, AnalysisProvider<TokenFilterFactory>> filters = new HashMap<>();
+                filters.put("mock", MockFactory::new);


Collections.singletonMap(...)?

I guess so! I think I wrote this like this at first because I thought I'd be adding more. I'm not sure, honestly. I'll change it.

abeyad · 2017-03-29T02:20:11Z

...is-common/src/test/java/org/elasticsearch/analysis/common/QueryStringWithAnalyzersTests.java

+
+    @Override
+    protected int maximumNumberOfShards() {
+        return 7;


Why the number 7?

I have no clue. This came from QueryStringIT. I can certainly add a comment about its unknown origin. Hell, I can experiment with dropping it entirely but I wanted to get this PR up for review quickly so I didn't want to destabilize the test.

Because it is weird I'll drop it. If something starts failing I'll dig.

nik9000 · 2017-03-31T00:31:44Z

Are you proposing having the request and response in core and the implementation in the module? That isn't a thing we've done in the past but I don't have any objections. It might be simpler to conceptually to keep the implementation in core and whole sale move the tests to the module. Do you have a preference? I did what I felt was the least intrusive thing but if you'd prefer something else I'm fine with it.

…

On Thu, Mar 30, 2017, 5:23 PM Ryan Ernst ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In modules/analysis-common/src/test/java/org/elasticsearch/analysis/common/TransportAnalyzeActionTests.java <#23614 (comment)> : > +import org.elasticsearch.env.Environment; +import org.elasticsearch.index.IndexSettings; +import org.elasticsearch.index.analysis.AnalysisRegistry; +import org.elasticsearch.index.analysis.IndexAnalyzers; +import org.elasticsearch.index.mapper.AllFieldMapper; +import org.elasticsearch.indices.analysis.AnalysisModule; +import org.elasticsearch.test.ESTestCase; +import org.elasticsearch.test.IndexSettingsModule; + +import java.io.IOException; +import java.util.List; + +import static java.util.Collections.singletonList; + +/** + * More "intense" version of a unit test with the same name that is in core. This one has access to the analyzers in this module. What does TransportAnalyzeAction have to do with request/response objects? The action should not be used by the high level client? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#23614 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AANLovQA9LPvyBEdN-qWooY0yxrsOIFxks5rrB1pgaJpZM4MfiCb> .

rjernst · 2017-03-31T02:21:05Z

In a theoretical future world, I could see the request/response objects separated, and upstream of core, so that both core and the rest client could use them. So I don't have any problem with the implementation of a particular request being separate from the definition of the api (request/response classes). Those are just simple containers for stuff, I don't think they normally contain much, if any, logic.

nik9000 · 2017-03-31T02:28:47Z

I don't think they normally contain much, if any, logic

Pretty much just validation, parsing, and streaming, yeah.

Yeah, I understand wanting to move the requests and responses out one day. From that perspective, sure, I'm fine to move the action to analysis-common.

The analysis-common module will have all the analyzers in it anyway so the only way we're going to get a good test of the analyze action is in that module.

nik9000 · 2017-04-01T16:58:02Z

@rjernst and @abeyad, I've pushed a bunch of updates.

I talked with @rjernst about moving the TransportAnalyzeAction into analysis-common. You can see in the commit history above that I tried doing it. It turned out to be kind of a mess. Instead I'm cutting the analyzers out of the tests for TransportAnalyzerAction in core and moving them to smoke tests of the analyzer in analysis-common. Each analyzer I move will get a simple smoke test for it in addition to the more comprehensive unit tests that they mostly already have.

rjernst

LGTM, one minor suggestion, do with it as you will. And I like the simpler test in core, with a simple example token filter, very nice!

rjernst · 2017-04-05T19:04:35Z

core/src/main/java/org/elasticsearch/index/analysis/TokenFilterFactory.java

+     * {@link FastVectorHighlighter}? If this is {@code true} then the
+     * {@linkplain FastVectorHighlighter} will attempt to work around the broken offsets.
+     */
+    default boolean breaksFastVectorHighlighter() {


Is this something we should really "allow"? Perhaps the hack could continue to exist as it did before, but with checking the name of the class instead of instanceof?

I'm not sure. I kind of like that the hack is at least more visible this way. For now I think we should keep it. Maybe we can pitch it if we ever go to 100% unified highlighter....

abeyad

LGTM

nik9000 · 2017-04-19T22:54:34Z

I just attempted a backport and it was fairly unclean. Any objections to me not backporting to 5.x and just leaving this master only?

rjernst · 2017-04-19T23:00:35Z

+1 to leaving master only

nik9000 added 4 commits March 15, 2017 17:53

Move asciifolding into analysis-common

1869db8

Tests

a4ed494

Fix some tests

1e4a2d9

Move some tests

e0cef65

These tests check the interaction of analyzers and core components so they have to move to where the analyzers are available.

nik9000 added :Search Relevance/Analysis How text is split into tokens discuss labels Mar 16, 2017

nik9000 mentioned this pull request Mar 16, 2017

Java High Level REST Client plan for first release #23331

Closed

58 tasks

More tests

2310f3c

nik9000 added review v5.4.0 v6.0.0-alpha1 and removed discuss labels Mar 20, 2017

nik9000 mentioned this pull request Mar 20, 2017

Move analysis components to a module #23658

Open

75 tasks

nik9000 commented Mar 27, 2017

View reviewed changes

abeyad self-requested a review March 27, 2017 18:38

abeyad suggested changes Mar 29, 2017

View reviewed changes

nik9000 added 9 commits March 30, 2017 12:40

Merge branch 'master' into module_analyzers_common

ed8d9be

Javadoc for breaksFastVectorHighlighter

dcbc7cf

Line length

c20aca4

Import

a09c37f

Line length

c92a6a8

line length

39df4c9

Line length

1ba49a6

Line length

1caf03f

Line length

16102ff

nik9000 added 10 commits March 31, 2017 11:20

Merge branch 'master' into module_analyzers_common

5b555a7

Merge branch 'master' into module_analyzers_common

929b83f

Move analyze action implementation to module

7301a6a

The analysis-common module will have all the analyzers in it anyway so the only way we're going to get a good test of the analyze action is in that module.

Hack transportCient so this works

047173b

Let's not

5fb1442

Break out tests

5d45b69

Add smoke tests for moved analyzers

fced812

Cleanup tests

eaafe40

Merge branch 'master' into module_analyzers_common

f6b907c

Remove now duplicate tests

d4bf67c

rjernst approved these changes Apr 5, 2017

View reviewed changes

abeyad approved these changes Apr 5, 2017

View reviewed changes

nik9000 added 5 commits April 12, 2017 14:56

Merge branch 'master' into module_analyzers_common

96531c7

Cleanup after merge

dfdc53c

Remove test accidentally added in merge

51609e1

Merge branch 'master' into module_analyzers_common

8608652

Catch exception

fac537e

nik9000 added v5.5.0 and removed v5.4.0 labels Apr 19, 2017

nik9000 merged commit caf376c into elastic:master Apr 19, 2017

nik9000 removed the v5.5.0 label Apr 19, 2017

clintongormley added >enhancement >non-issue and removed >enhancement labels Apr 24, 2017

Start building analysis-common module #23614

Start building analysis-common module #23614

Uh oh!

Conversation

nik9000 commented Mar 16, 2017

Uh oh!

nik9000 commented Mar 16, 2017

Uh oh!

nik9000 commented Mar 16, 2017

Uh oh!

dadoonet commented Mar 16, 2017

Uh oh!

nik9000 commented Mar 20, 2017

Uh oh!

nik9000 commented Mar 20, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abeyad left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nik9000 commented Mar 31, 2017 via email

Uh oh!

rjernst commented Mar 31, 2017

Uh oh!

nik9000 commented Mar 31, 2017

Uh oh!

nik9000 commented Apr 1, 2017

Uh oh!

rjernst left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abeyad left a comment

Choose a reason for hiding this comment

Uh oh!

nik9000 commented Apr 19, 2017

Uh oh!

rjernst commented Apr 19, 2017

Uh oh!

Reviewers