-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Start building analysis-common module #23614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
These tests check the interaction of analyzers and core components so they have to move to where the analyzers are available.
|
Another cool thing this'd do: |
|
I don't have a guess as to how much time this'd take to complete. Maybe two weeks? Maybe one. Maybe three. Probably not a month. And I'd need a review buddy like I needed when I did all the NamedWriteable work. Someone to hold the project in their head and do reviews, but they wouldn't need to do the coding, I think. |
I'm all for modules. So IMO yes. I take that as a start to split the core project. Even though we don't see yet the positive effect on the size I'm sure it's the right direction. Exposing a jar through a plugin is another question IMO. |
|
@s1monw, this was your idea. Do you think it is worth doing? Sinking another couple of days into and reevaluating my estimates?
I like doing it as a module because that allows us to easily rely on bits of Elasticsearch core and to really validate that plugins can add any sort of analyzer. |
|
Talked with a group of folks at Elastic interested in the client and got consensus to do this. I've switch this issue from |
| } | ||
|
|
||
| @Override | ||
| public boolean breaksFastVectorHighlighter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should put javadoc on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Er, on the interface method, rather.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++, it would help as its not clear to me what the method's purpose is, since its currently used in the containsBrokenAnalysis check
abeyad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left just a few comments / questions.
| } | ||
|
|
||
| @Override | ||
| public boolean breaksFastVectorHighlighter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++, it would help as its not clear to me what the method's purpose is, since its currently used in the containsBrokenAnalysis check
| AnalyzeRequest request = new AnalyzeRequest(); | ||
| request.analyzer("standard"); | ||
| request.text("the quick brown fox"); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra empty line
| assertEquals(4, tokens.size()); | ||
| assertEquals("the", tokens.get(0).getTerm()); | ||
| assertEquals("qu1ck", tokens.get(1).getTerm()); | ||
| assertEquals("quick", tokens.get(1).getTerm()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this block of the test now seems redundant, it duplicates the one above it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one tests using a tokenizer instead of a fully built analyzer. I'll push a comment explaining
| .put("arabicnormalization", ArabicNormalizationFilterFactory.class) | ||
| .put("arabicstem", ArabicStemTokenFilterFactory.class) | ||
| .put("asciifolding", ASCIIFoldingTokenFilterFactory.class) | ||
| .put("asciifolding", Void.class) // TODO remove this when core no longer depends on analysis-common |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary? Why can't these filters just be removed from the map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is written in such a way that it'll fail if there are any unmapped analysis components. It is our reminder to expose whatever is in Lucene. So we can't remove it. Instead we have to pretend we're intentionally not exposing it which we do by marking it as exposed by Void. At least, that is the pattern the test uses. It is a little funny, but the TODO notes that it should go away when we finally drop analyzer's dependency from core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll push something that makes this a bit more clear/fancy. It'll be temporary while we do the migration, but it'll help.
| import static java.util.Collections.singletonList; | ||
|
|
||
| /** | ||
| * More "intense" version of a unit test with the same name that is in core. This one has access to the analyzers in this module. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not clear to me why we still need the core version of TransportAnalyzeActionTests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to have something close to the code, but you are right, it isn't nearly as good as it used to be. I toyed with moving the TransportAnalyzeAction itself into analysis-common but I don't think that is a great choice for the high level rest client. It'll want to depend on request and response objects and the whole point of this project is to make the high level rest client not need them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does TransportAnalyzeAction have to do with request/response objects? The action should not be used by the high level client?
| } | ||
|
|
||
| public void testWithIndexAnalyzers() throws IOException { | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: extra empty line
| @Override | ||
| public Map<String, AnalysisProvider<TokenFilterFactory>> getTokenFilters() { | ||
| Map<String, AnalysisProvider<TokenFilterFactory>> filters = new HashMap<>(); | ||
| filters.put("mock", MockFactory::new); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collections.singletonMap(...)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess so! I think I wrote this like this at first because I thought I'd be adding more. I'm not sure, honestly. I'll change it.
|
|
||
| @Override | ||
| protected int maximumNumberOfShards() { | ||
| return 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the number 7?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no clue. This came from QueryStringIT. I can certainly add a comment about its unknown origin. Hell, I can experiment with dropping it entirely but I wanted to get this PR up for review quickly so I didn't want to destabilize the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it is weird I'll drop it. If something starts failing I'll dig.
|
Are you proposing having the request and response in core and the
implementation in the module? That isn't a thing we've done in the past but
I don't have any objections. It might be simpler to conceptually to keep
the implementation in core and whole sale move the tests to the module.
Do you have a preference? I did what I felt was the least intrusive thing
but if you'd prefer something else I'm fine with it.
…On Thu, Mar 30, 2017, 5:23 PM Ryan Ernst ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In
modules/analysis-common/src/test/java/org/elasticsearch/analysis/common/TransportAnalyzeActionTests.java
<#23614 (comment)>
:
> +import org.elasticsearch.env.Environment;
+import org.elasticsearch.index.IndexSettings;
+import org.elasticsearch.index.analysis.AnalysisRegistry;
+import org.elasticsearch.index.analysis.IndexAnalyzers;
+import org.elasticsearch.index.mapper.AllFieldMapper;
+import org.elasticsearch.indices.analysis.AnalysisModule;
+import org.elasticsearch.test.ESTestCase;
+import org.elasticsearch.test.IndexSettingsModule;
+
+import java.io.IOException;
+import java.util.List;
+
+import static java.util.Collections.singletonList;
+
+/**
+ * More "intense" version of a unit test with the same name that is in core. This one has access to the analyzers in this module.
What does TransportAnalyzeAction have to do with request/response
objects? The action should not be used by the high level client?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#23614 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AANLovQA9LPvyBEdN-qWooY0yxrsOIFxks5rrB1pgaJpZM4MfiCb>
.
|
|
In a theoretical future world, I could see the request/response objects separated, and upstream of core, so that both core and the rest client could use them. So I don't have any problem with the implementation of a particular request being separate from the definition of the api (request/response classes). Those are just simple containers for stuff, I don't think they normally contain much, if any, logic. |
Pretty much just validation, parsing, and streaming, yeah. Yeah, I understand wanting to move the requests and responses out one day. From that perspective, sure, I'm fine to move the action to analysis-common. |
The analysis-common module will have all the analyzers in it anyway so the only way we're going to get a good test of the analyze action is in that module.
|
@rjernst and @abeyad, I've pushed a bunch of updates. I talked with @rjernst about moving the TransportAnalyzeAction into analysis-common. You can see in the commit history above that I tried doing it. It turned out to be kind of a mess. Instead I'm cutting the analyzers out of the tests for TransportAnalyzerAction in core and moving them to smoke tests of the analyzer in analysis-common. Each analyzer I move will get a simple smoke test for it in addition to the more comprehensive unit tests that they mostly already have. |
rjernst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one minor suggestion, do with it as you will. And I like the simpler test in core, with a simple example token filter, very nice!
| * {@link FastVectorHighlighter}? If this is {@code true} then the | ||
| * {@linkplain FastVectorHighlighter} will attempt to work around the broken offsets. | ||
| */ | ||
| default boolean breaksFastVectorHighlighter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something we should really "allow"? Perhaps the hack could continue to exist as it did before, but with checking the name of the class instead of instanceof?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. I kind of like that the hack is at least more visible this way. For now I think we should keep it. Maybe we can pitch it if we ever go to 100% unified highlighter....
abeyad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
I just attempted a backport and it was fairly unclean. Any objections to me not backporting to 5.x and just leaving this master only? |
|
+1 to leaving master only |
Relates to #23331 (comment).
The goal is to (eventually) move all the analyzers in
lucene-analzyers-common.jarinto a module so the core ES jar doesn't have to depend onlucene-analyzer-common.jar. This would only affect the high level rest client and the transport client. The download would still include the the jar, just in a different spot.So the real question here is, "is this worth our time?" This would shave maybe 2mb off of the high level rest client and transport client.
lucene-analzyers-common.jaris 1.4mb. I'm generously assuming that the code I extract from core will be 600kb.Some of the extraction is fairly easy - move files around, rig them up like plugins, etc. The effect on tests isn't super difficult, but means that the process takes time.
In the end the only analyzer available to tests in core would be
standardbecause that is all that is inlucene-core. Thestandardtokenizer would be available in core as well. No token would be available though. Only thelowercaseandmocktokenfilters are available inlucene-coreand we can't expose them in core becauselowercasein Elasticsearch is linked withGreekLowerCaseFilter,IrishLowerCaseFilter, andTurkishLowerCaseFilterandmockis just for testing. Useful for writing tests, mind, but not a thing we can expose.So I'm opening this up for discussion: should we do it?
My thoughts:
I'm quite happy to kill this if we decide it isn't a useful savings.