-
Notifications
You must be signed in to change notification settings - Fork 284
Add Thai word list from ICU BreakIterator dictionary #879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hello @pavaris-pm! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2023-12-06 10:17:06 UTC |
|
Hello! Thank you for your pull request. Can you add filter the word that start with |
SPDX-License-Identifier: Unicode-DFS-2016
Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise. |
Yes 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once license info has moved to corpus_license.md AND the comment lines are properly discarded, I can merge this.
I think we can do this in Maybe add the boolean parameter Or, we can utilize the existing Python standard library |
@bact @wannaphong i already add comment filtering by adding a new parameters named |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for help me sorting it alphabetically krub 👍🏻
@bact @wannaphong I've made some experiment to test the |
wannaphong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It look great for me.
Filename: icubrk_th.txt License: Unicode-DFS-2016
Also rename `thai_icu()` to `thai_icu_words()` to make it more explicit and consistent with others, like: `thai_orst_words()`
Change thai_icu to thai_icu_words
|
Kudos, SonarCloud Quality Gate passed!
|
| def get_corpus(filename: str, as_is: bool = False) -> Union[frozenset, list]: | ||
| def get_corpus(filename: str, | ||
| as_is: bool = False, | ||
| comments: bool = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed this to comments instead of discard_comments (as I suggested earlier) to avoid double negation.
The semantic now is:
- if
comments= True, then keep comments - if
comments= False, then discard comments
|
|
||
| if not comments: | ||
| # take only text before character '#' | ||
| lines = [line.split("#", 1)[0] for line in lines] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will allowed the comment to be at any position of the line.
bact
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved.
Few modification to get_corpus() to make the code more generic.
I have changed the module/function name to
corpus.icuinstead ofcorpus.thai_icu- to make the module name more genericthai_icu_wordsinstead ofthai_icu- to make the function name inline withthai_wordsandthai_orst_words
So when import, it will be like:
from pythainlp.corpus.icu import thai_icu_wordsNote: I will also after this rename the wikipedia (#869) and volubilis (#870) corpora as well, to make them more consistent:
So instead of having:
from pythainlp.corpus.volubilis import volubilis
from pythainlp.corpus.wikipedia_titles import wikipedia_titleswe should have:
from pythainlp.corpus.volubilis import thai_volubilis_words
from pythainlp.corpus.wikipedia import thai_wikipedia_titles|
Merged thank you. |








What does this changes
@wannaphong @bact from issue #877 since ICU are included to almost all web browser, i've added ICU dictionary to PyThaiNLP where file of ICU dictionary are named as
icubrk_th.txtand their python file to load the corpus are named asthai_icu.pykrub.Will resolve #877
Your checklist for this pull request
🚨Please review the guidelines for contributing to this repository.