-
Notifications
You must be signed in to change notification settings - Fork 25.6k
add nori_number token filter in analysis-nori #53583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Pinging @elastic/es-search (:Search/Analysis) |
|
Pinging @elastic/es-docs (>docs) |
jimczi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this filter danmuzi ! I left one comment regarding the discard_punctuation option of the tokenizer that was added to handle the number filter correctly.
|
Thanks for your review, @jimczi 👍 |
jimczi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left two minor comments but the change looks good to me.
But I'm not sure it's right to include the discard_punctuation option in this PR.
Because this PR is for nori_number token filter.
I think it's ok since the discard_punctuation option was added specifically for the number token filter. Let's add both in the same pr, thanks for separating the commits though.
plugins/analysis-nori/src/test/java/org/elasticsearch/index/analysis/NoriAnalysisTests.java
Outdated
Show resolved
Hide resolved
docs/plugins/analysis-nori.asciidoc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add add a NOTE: to emphasize this part ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
|
Thanks Jim. |
|
@elasticmachine ok to test |
3864137 to
1a4367e
Compare
|
I'm not sure why elasticsearch-ci/2 and elasticsearch-ci/bwc and elasticsearch-ci/default-distro are failed. |
|
Because of the my rebase mistake, the previous Jenkins build history has been lost in this conversation. After the rebase, all Jenkins builds passed. |
jimczi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @danmuzi !
|
Thanks for your kind reviews! @jimczi |
This change adds the `nori_number` token filter. It also adds a `discard_punctuation` option in nori_tokenizer that should be used in conjunction with the new filter.
The
KoreanNumberFilterhas included in Nori after Lucene 8.2.0. (LUCENE-8812)However, it isn't supported now in Nori Analysis plugin. (Kuromoji supports
kuromoji_number)It seems to be missing(#30397) because
KoreanNumberFilterdidn't exist at Lucene 7.4.0 that supports Nori first time.This PR is for that.