-
Notifications
You must be signed in to change notification settings - Fork 284
Add Thai word list from ICU BreakIterator dictionary #879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
3ebf721
add thai ICU corpus
pavaris-pm 6338646
fix pep8
pavaris-pm 7677cf9
Add SPDX tags to thai_icu.txt
bact 4328b3a
Sort imports in __init__.py
bact 42f60c4
add comment filtering and update corpus license
pavaris-pm ee85e1b
fix pep8
pavaris-pm b2473a0
fix pep8 (trailing whitespaces)
pavaris-pm 52ea875
fix bug in thai_icu
pavaris-pm 0bed068
fix typo
pavaris-pm 73378c6
Add more get_corpus docs
wannaphong bf7212a
Add license for Thai dict from ICU
bact 276be53
Rename thai_icu.txt to icubrk_th.txt
bact f8ccc3a
Adjust comment discard method in get_corpus()
bact 171214c
Update and rename thai_icu.py to icu.py
bact 5cf8809
Update __init__.py
bact 3323b82
Update core.py
bact 82322da
Update test_corpus.py
bact File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -70,7 +70,10 @@ def path_pythainlp_corpus(filename: str) -> str: | |
| return os.path.join(corpus_path(), filename) | ||
|
|
||
|
|
||
| def get_corpus(filename: str, as_is: bool = False) -> Union[frozenset, list]: | ||
| def get_corpus(filename: str, | ||
| as_is: bool = False, | ||
| comments: bool = True | ||
| ) -> Union[frozenset, list]: | ||
| """ | ||
| Read corpus data from file and return a frozenset or a list. | ||
|
|
||
|
|
@@ -82,8 +85,12 @@ def get_corpus(filename: str, as_is: bool = False) -> Union[frozenset, list]: | |
| If as_is is True, a list will be return, with no modifications | ||
| in member values and their orders. | ||
|
|
||
| If comments is False, any text at any position after the character | ||
| '#' in each line will be discarded. | ||
|
|
||
| :param str filename: filename of the corpus to be read | ||
| :param bool as_is: no modification to the text, and return a list | ||
| :param bool comments: keep comments | ||
|
|
||
| :return: :class:`frozenset` or :class:`list` consisting of lines in the file | ||
| :rtype: :class:`frozenset` or :class:`list` | ||
|
|
@@ -93,26 +100,61 @@ def get_corpus(filename: str, as_is: bool = False) -> Union[frozenset, list]: | |
|
|
||
| from pythainlp.corpus import get_corpus | ||
|
|
||
| get_corpus('negations_th.txt') | ||
| # input file (negations_th.txt): | ||
| # แต่ | ||
| # ไม่ | ||
|
|
||
| get_corpus("negations_th.txt") | ||
| # output: | ||
| # frozenset({'แต่', 'ไม่'}) | ||
|
|
||
| get_corpus('ttc_freq.txt') | ||
| get_corpus("negations_th.txt", as_is=True) | ||
| # output: | ||
| # ['แต่', 'ไม่'] | ||
|
|
||
| # input file (ttc_freq.txt): | ||
| # ตัวบท<tab>10 | ||
| # โดยนัยนี้<tab>1 | ||
|
|
||
| get_corpus("ttc_freq.txt") | ||
| # output: | ||
| # frozenset({'โดยนัยนี้\\t1', | ||
| # 'ตัวบท\\t10', | ||
| # 'หยิบยื่น\\t3', | ||
| # ...}) | ||
|
|
||
| # input file (icubrk_th.txt): | ||
| # # Thai Dictionary for ICU BreakIterator | ||
| # กก | ||
| # กกขนาก | ||
|
|
||
| get_corpus("icubrk_th.txt") | ||
| # output: | ||
| # frozenset({'กกขนาก', | ||
| # '# Thai Dictionary for ICU BreakIterator', | ||
| # 'กก', | ||
| # ...}) | ||
|
|
||
| get_corpus("icubrk_th.txt", comments=False) | ||
| # output: | ||
| # frozenset({'กกขนาก', | ||
| # 'กก', | ||
| # ...}) | ||
|
|
||
| """ | ||
| path = path_pythainlp_corpus(filename) | ||
| lines = [] | ||
| with open(path, "r", encoding="utf-8-sig") as fh: | ||
| lines = fh.read().splitlines() | ||
|
|
||
| if not comments: | ||
| # take only text before character '#' | ||
| lines = [line.split("#", 1)[0] for line in lines] | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will allowed the comment to be at any position of the line. |
||
|
|
||
| if as_is: | ||
| return lines | ||
|
|
||
| lines = [line.strip() for line in lines] | ||
|
|
||
| return frozenset(filter(None, lines)) | ||
|
|
||
|
|
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| # -*- coding: utf-8 -*- | ||
| # SPDX-FileCopyrightText: Copyright 2016-2023 PyThaiNLP Project | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| """ | ||
| Provides an optional word list from International Components for Unicode (ICU) dictionary. | ||
| """ | ||
| from typing import FrozenSet | ||
|
|
||
| from pythainlp.corpus.common import get_corpus | ||
|
|
||
|
|
||
| _THAI_ICU_FILENAME = "icubrk_th.txt" | ||
|
|
||
|
|
||
| def thai_icu_words() -> FrozenSet[str]: | ||
| """ | ||
| Return a frozenset of words from the Thai dictionary for BreakIterator of the | ||
| International Components for Unicode (ICU). | ||
|
|
||
| :return: :class:`frozenset` containing `str` | ||
| :rtype: :class:`frozenset` | ||
| """ | ||
|
|
||
| _WORDS = get_corpus(_THAI_ICU_FILENAME, comments=False) | ||
|
|
||
| return _WORDS |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed this to
commentsinstead ofdiscard_comments(as I suggested earlier) to avoid double negation.The semantic now is:
comments= True, then keep commentscomments= False, then discard comments