Skip to content

Conversation

@nik9000
Copy link
Member

@nik9000 nik9000 commented Mar 16, 2017

Relates to #23331 (comment).

The goal is to (eventually) move all the analyzers in lucene-analzyers-common.jar into a module so the core ES jar doesn't have to depend on lucene-analyzer-common.jar. This would only affect the high level rest client and the transport client. The download would still include the the jar, just in a different spot.

So the real question here is, "is this worth our time?" This would shave maybe 2mb off of the high level rest client and transport client. lucene-analzyers-common.jar is 1.4mb. I'm generously assuming that the code I extract from core will be 600kb.

Some of the extraction is fairly easy - move files around, rig them up like plugins, etc. The effect on tests isn't super difficult, but means that the process takes time.

In the end the only analyzer available to tests in core would be standard because that is all that is in lucene-core. The standard tokenizer would be available in core as well. No token would be available though. Only the lowercase and mock tokenfilters are available in lucene-core and we can't expose them in core because lowercase in Elasticsearch is linked with GreekLowerCaseFilter, IrishLowerCaseFilter, and TurkishLowerCaseFilter and mock is just for testing. Useful for writing tests, mind, but not a thing we can expose.

So I'm opening this up for discussion: should we do it?

My thoughts:

  1. It'd make the core theoretically cleaner. Lucene has had this separation for a long time.
  2. It'd be a pretty big change. Not as long hanging fruit as we thought. This PR isn't small and it only does three token filters.
  3. It wouldn't really save that many bits unless I'm reading it wrong.

I'm quite happy to kill this if we decide it isn't a useful savings.

nik9000 added 4 commits March 15, 2017 17:53
These tests check the interaction of analyzers and core components
so they have to move to where the analyzers are available.
@nik9000
Copy link
Member Author

nik9000 commented Mar 16, 2017

Another cool thing this'd do:
4. It'd make super duper sure that analyzers are completely pluggable. That is actually a pretty cool side effect. We couldn't do anything it core that can't be done in plugins because there aren't any analyzers in core.

@nik9000
Copy link
Member Author

nik9000 commented Mar 16, 2017

I don't have a guess as to how much time this'd take to complete. Maybe two weeks? Maybe one. Maybe three. Probably not a month. And I'd need a review buddy like I needed when I did all the NamedWriteable work. Someone to hold the project in their head and do reviews, but they wouldn't need to do the coding, I think.

@dadoonet
Copy link
Contributor

"is this worth our time?

I'm all for modules. So IMO yes.

I take that as a start to split the core project. Even though we don't see yet the positive effect on the size I'm sure it's the right direction.
Note that we can also think about them as jars instead of plugins. I mean that we can totally have elasticsearch which depends on a jar we don't have to package in our clients.

Exposing a jar through a plugin is another question IMO.

@nik9000
Copy link
Member Author

nik9000 commented Mar 20, 2017

@s1monw, this was your idea. Do you think it is worth doing? Sinking another couple of days into and reevaluating my estimates?

I mean that we can totally have elasticsearch which depends on a jar we don't have to package in our clients.

I like doing it as a module because that allows us to easily rely on bits of Elasticsearch core and to really validate that plugins can add any sort of analyzer.

@nik9000
Copy link
Member Author

nik9000 commented Mar 20, 2017

Talked with a group of folks at Elastic interested in the client and got consensus to do this. I've switch this issue from discuss to review. Please have a look at see if you like the patterns I've been using for the migration. My plan is to create a big tracking issue for this work and to do it incrementally so we can release Elasticsearch while we're doing it.

}

@Override
public boolean breaksFastVectorHighlighter() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should put javadoc on this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Er, on the interface method, rather.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++, it would help as its not clear to me what the method's purpose is, since its currently used in the containsBrokenAnalysis check

@abeyad abeyad self-requested a review March 27, 2017 18:38
Copy link

@abeyad abeyad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left just a few comments / questions.

}

@Override
public boolean breaksFastVectorHighlighter() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++, it would help as its not clear to me what the method's purpose is, since its currently used in the containsBrokenAnalysis check

AnalyzeRequest request = new AnalyzeRequest();
request.analyzer("standard");
request.text("the quick brown fox");

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra empty line

assertEquals(4, tokens.size());
assertEquals("the", tokens.get(0).getTerm());
assertEquals("qu1ck", tokens.get(1).getTerm());
assertEquals("quick", tokens.get(1).getTerm());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this block of the test now seems redundant, it duplicates the one above it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one tests using a tokenizer instead of a fully built analyzer. I'll push a comment explaining

.put("arabicnormalization", ArabicNormalizationFilterFactory.class)
.put("arabicstem", ArabicStemTokenFilterFactory.class)
.put("asciifolding", ASCIIFoldingTokenFilterFactory.class)
.put("asciifolding", Void.class) // TODO remove this when core no longer depends on analysis-common
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? Why can't these filters just be removed from the map?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is written in such a way that it'll fail if there are any unmapped analysis components. It is our reminder to expose whatever is in Lucene. So we can't remove it. Instead we have to pretend we're intentionally not exposing it which we do by marking it as exposed by Void. At least, that is the pattern the test uses. It is a little funny, but the TODO notes that it should go away when we finally drop analyzer's dependency from core.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll push something that makes this a bit more clear/fancy. It'll be temporary while we do the migration, but it'll help.

import static java.util.Collections.singletonList;

/**
* More "intense" version of a unit test with the same name that is in core. This one has access to the analyzers in this module.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not clear to me why we still need the core version of TransportAnalyzeActionTests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to have something close to the code, but you are right, it isn't nearly as good as it used to be. I toyed with moving the TransportAnalyzeAction itself into analysis-common but I don't think that is a great choice for the high level rest client. It'll want to depend on request and response objects and the whole point of this project is to make the high level rest client not need them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does TransportAnalyzeAction have to do with request/response objects? The action should not be used by the high level client?

}

public void testWithIndexAnalyzers() throws IOException {

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: extra empty line

@Override
public Map<String, AnalysisProvider<TokenFilterFactory>> getTokenFilters() {
Map<String, AnalysisProvider<TokenFilterFactory>> filters = new HashMap<>();
filters.put("mock", MockFactory::new);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collections.singletonMap(...)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess so! I think I wrote this like this at first because I thought I'd be adding more. I'm not sure, honestly. I'll change it.


@Override
protected int maximumNumberOfShards() {
return 7;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the number 7?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no clue. This came from QueryStringIT. I can certainly add a comment about its unknown origin. Hell, I can experiment with dropping it entirely but I wanted to get this PR up for review quickly so I didn't want to destabilize the test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is weird I'll drop it. If something starts failing I'll dig.

@nik9000
Copy link
Member Author

nik9000 commented Mar 31, 2017 via email

@rjernst
Copy link
Member

rjernst commented Mar 31, 2017

In a theoretical future world, I could see the request/response objects separated, and upstream of core, so that both core and the rest client could use them. So I don't have any problem with the implementation of a particular request being separate from the definition of the api (request/response classes). Those are just simple containers for stuff, I don't think they normally contain much, if any, logic.

@nik9000
Copy link
Member Author

nik9000 commented Mar 31, 2017

I don't think they normally contain much, if any, logic

Pretty much just validation, parsing, and streaming, yeah.

Yeah, I understand wanting to move the requests and responses out one day. From that perspective, sure, I'm fine to move the action to analysis-common.

@nik9000
Copy link
Member Author

nik9000 commented Apr 1, 2017

@rjernst and @abeyad, I've pushed a bunch of updates.

I talked with @rjernst about moving the TransportAnalyzeAction into analysis-common. You can see in the commit history above that I tried doing it. It turned out to be kind of a mess. Instead I'm cutting the analyzers out of the tests for TransportAnalyzerAction in core and moving them to smoke tests of the analyzer in analysis-common. Each analyzer I move will get a simple smoke test for it in addition to the more comprehensive unit tests that they mostly already have.

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one minor suggestion, do with it as you will. And I like the simpler test in core, with a simple example token filter, very nice!

* {@link FastVectorHighlighter}? If this is {@code true} then the
* {@linkplain FastVectorHighlighter} will attempt to work around the broken offsets.
*/
default boolean breaksFastVectorHighlighter() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we should really "allow"? Perhaps the hack could continue to exist as it did before, but with checking the name of the class instead of instanceof?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I kind of like that the hack is at least more visible this way. For now I think we should keep it. Maybe we can pitch it if we ever go to 100% unified highlighter....

Copy link

@abeyad abeyad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nik9000 nik9000 added v5.5.0 and removed v5.4.0 labels Apr 19, 2017
@nik9000 nik9000 merged commit caf376c into elastic:master Apr 19, 2017
@nik9000 nik9000 removed the v5.5.0 label Apr 19, 2017
@nik9000
Copy link
Member Author

nik9000 commented Apr 19, 2017

I just attempted a backport and it was fairly unclean. Any objections to me not backporting to 5.x and just leaving this master only?

@rjernst
Copy link
Member

rjernst commented Apr 19, 2017

+1 to leaving master only

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants