Add Thai word list from ICU BreakIterator dictionary #879

pavaris-pm · 2023-12-05T05:32:07Z

What does this changes

@wannaphong @bact from issue #877 since ICU are included to almost all web browser, i've added ICU dictionary to PyThaiNLP where file of ICU dictionary are named as icubrk_th.txt and their python file to load the corpus are named as thai_icu.py krub.

Will resolve #877

Your checklist for this pull request

🚨Please review the guidelines for contributing to this repository.

Passed code styles and structures
Passed code linting checks and unit test

pep8speaks · 2023-12-05T05:32:17Z

Hello @pavaris-pm! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2023-12-06 10:17:06 UTC

wannaphong · 2023-12-05T06:06:58Z

Hello! Thank you for your pull request. Can you add filter the word that start with #?

SPDX-License-Identifier: Unicode-DFS-2016

pavaris-pm · 2023-12-05T08:03:08Z

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

wannaphong · 2023-12-05T08:13:43Z

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

Yes 👍

bact

Once license info has moved to corpus_license.md AND the comment lines are properly discarded, I can merge this.

pythainlp/corpus/thai_icu.py

bact · 2023-12-05T11:44:03Z

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

I think we can do this in get_corpus().

Maybe add the boolean parameter discard_comments to get_corpus()?
The default is probably False.

Or, we can utilize the existing Python standard library shlex for this. shlex will ignore comment lines when it gets its input.

https://docs.python.org/3/library/shlex.html

pavaris-pm · 2023-12-05T13:20:55Z

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

I think we can do this in get_corpus().

Maybe add the boolean parameter discard_comments to get_corpus()? The default is probably False.

Or, we can utilize the existing Python standard library shlex for this. shlex will ignore comment lines when it gets its input.

https://docs.python.org/3/library/shlex.html

@bact @wannaphong i already add comment filtering by adding a new parameters named discard_comments where the default value is set to be False. You can review the code from the latest commit krub

pythainlp/corpus/corpus_license.md

pavaris-pm

thanks for help me sorting it alphabetically krub 👍🏻

pavaris-pm · 2023-12-05T15:02:03Z

Hello! Thank you for your pull request. Can you add filter the word that start with #?

Sure. Did you mean add a parameters for user to control whether to return a corpus with the text starts with # or not right? by True if you want a returned corpus including words starts with #, and returned the corpus with filtered out word starts with # (no word start with # in corpus) otherwise.

I think we can do this in get_corpus().

Maybe add the boolean parameter discard_comments to get_corpus()? The default is probably False.

Or, we can utilize the existing Python standard library shlex for this. shlex will ignore comment lines when it gets its input.

https://docs.python.org/3/library/shlex.html

@bact @wannaphong I've made some experiment to test the discard_comments parameters and fix some bugs from it. Now it works perfectly. feel free to review from now on krub. It's done 💯

wannaphong

It look great for me.

Filename: icubrk_th.txt License: Unicode-DFS-2016

Also rename `thai_icu()` to `thai_icu_words()` to make it more explicit and consistent with others, like: `thai_orst_words()`

Change thai_icu to thai_icu_words

sonarqubecloud · 2023-12-06T10:17:38Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

bact · 2023-12-06T10:19:20Z

pythainlp/corpus/core.py

-def get_corpus(filename: str, as_is: bool = False) -> Union[frozenset, list]:
+def get_corpus(filename: str,
+               as_is: bool = False,
+               comments: bool = True


I have changed this to comments instead of discard_comments (as I suggested earlier) to avoid double negation.

The semantic now is:

if comments = True, then keep comments

if comments = False, then discard comments

bact · 2023-12-06T10:20:12Z

pythainlp/corpus/core.py


+    if not comments:
+        # take only text before character '#'
+        lines = [line.split("#", 1)[0] for line in lines]


This will allowed the comment to be at any position of the line.

bact

Approved.

Few modification to get_corpus() to make the code more generic.

I have changed the module/function name to

corpus.icu instead of corpus.thai_icu - to make the module name more generic
thai_icu_words instead of thai_icu - to make the function name inline with thai_words and thai_orst_words

So when import, it will be like:

from pythainlp.corpus.icu import thai_icu_words

Note: I will also after this rename the wikipedia (#869) and volubilis (#870) corpora as well, to make them more consistent:

So instead of having:

from pythainlp.corpus.volubilis import volubilis
from pythainlp.corpus.wikipedia_titles import wikipedia_titles

we should have:

from pythainlp.corpus.volubilis import thai_volubilis_words
from pythainlp.corpus.wikipedia import thai_wikipedia_titles

bact · 2023-12-06T12:23:07Z

Merged thank you.

add thai ICU corpus

3ebf721

pavaris-pm mentioned this pull request Dec 5, 2023

Add ICU wordbreak dictionary (Thai) #877

Closed

fix pep8

6338646

Add SPDX tags to thai_icu.txt

7677cf9

SPDX-License-Identifier: Unicode-DFS-2016

bact added enhancement enhance functionalities corpus corpus/dataset-related issues labels Dec 5, 2023

Sort imports in __init__.py

4328b3a

bact requested changes Dec 5, 2023

View reviewed changes

pythainlp/corpus/thai_icu.py Outdated Show resolved Hide resolved

bact added this to the 5.0 milestone Dec 5, 2023

bact changed the title ~~add Thai ICU Dict into PyThaiNLP corpus~~ Add Thai ICU wordbreak dictionary to PyThaiNLP corpus Dec 5, 2023

add comment filtering and update corpus license

42f60c4

pavaris-pm commented Dec 5, 2023

View reviewed changes

pythainlp/corpus/corpus_license.md Outdated Show resolved Hide resolved

pavaris-pm commented Dec 5, 2023

View reviewed changes

pavaris-pm added 4 commits December 5, 2023 13:35

fix pep8

ee85e1b

fix pep8 (trailing whitespaces)

b2473a0

fix bug in thai_icu

52ea875

fix typo

0bed068

pavaris-pm requested a review from bact December 5, 2023 15:02

Add more get_corpus docs

73378c6

wannaphong approved these changes Dec 5, 2023

View reviewed changes

bact added 3 commits December 6, 2023 09:43

Add license for Thai dict from ICU

bf7212a

Filename: icubrk_th.txt License: Unicode-DFS-2016

Rename thai_icu.txt to icubrk_th.txt

276be53

Adjust comment discard method in get_corpus()

f8ccc3a

bact added 4 commits December 6, 2023 10:12

Update and rename thai_icu.py to icu.py

171214c

Also rename `thai_icu()` to `thai_icu_words()` to make it more explicit and consistent with others, like: `thai_orst_words()`

Update __init__.py

5cf8809

Change thai_icu to thai_icu_words

Update core.py

3323b82

Update test_corpus.py

82322da

bact reviewed Dec 6, 2023

View reviewed changes

bact approved these changes Dec 6, 2023

View reviewed changes

bact merged commit 297aadc into PyThaiNLP:dev Dec 6, 2023

bact changed the title ~~Add Thai ICU wordbreak dictionary to PyThaiNLP corpus~~ Add Thai word list from ICU BreakIterator dictionary Dec 15, 2023

bact mentioned this pull request Dec 15, 2023

PyThaiNLP 5.0 Change Log #788

Closed

Add Thai word list from ICU BreakIterator dictionary #879

Add Thai word list from ICU BreakIterator dictionary #879

Uh oh!

Conversation

pavaris-pm commented Dec 5, 2023 • edited by bact Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this changes

Your checklist for this pull request

Uh oh!

pep8speaks commented Dec 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2023-12-06 10:17:06 UTC

Uh oh!

wannaphong commented Dec 5, 2023

Uh oh!

pavaris-pm commented Dec 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wannaphong commented Dec 5, 2023

Uh oh!

bact left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bact commented Dec 5, 2023

Uh oh!

pavaris-pm commented Dec 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pavaris-pm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pavaris-pm commented Dec 5, 2023

Uh oh!

wannaphong left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Dec 6, 2023

Uh oh!

bact Dec 6, 2023

Choose a reason for hiding this comment

Uh oh!

bact Dec 6, 2023

Choose a reason for hiding this comment

Uh oh!

bact left a comment

Choose a reason for hiding this comment

Uh oh!

bact commented Dec 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pavaris-pm commented Dec 5, 2023 •

edited by bact

Loading

pep8speaks commented Dec 5, 2023 •

edited

Loading

pavaris-pm commented Dec 5, 2023 •

edited

Loading

bact left a comment •

edited

Loading

pavaris-pm commented Dec 5, 2023 •

edited

Loading

pavaris-pm left a comment •

edited

Loading