-
Notifications
You must be signed in to change notification settings - Fork 284
Add extra segmentation style for paragraph_tokenize function
#844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
add segmentation style
add segmentation style
Paragraph segmentation style
fix segmentation style
|
Hello @pavaris-pm! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:
|
|
Kudos, SonarCloud Quality Gate passed! |
wannaphong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!








According to issue #843, about
wtpsplitengine used inparagraph_tokenizefunction. wtpsplit itself can adapt to the Universal Dependencies, OPUS100, or Ersatz corpus segmentation style in many languages as well. As for 2023, it supported Thai language inOPUS100corpus style.Since we both agreed on adding a segmentation style as an option, I've added
styleas a new argument ofparagraph_tokenizefunction.Here is a usage:
paragraph_tokenizewith defaultparagraph_threshold=0.5(the current version in PyThaiNLP):Here is
paragraph_tokenizefunction after addedstyleargumentnewlineandopus100style (as supported in wtpsplit)paragraph_thresholdwill be set to 0.5 in order to show how different in each segmentation styleparagraph_tokenizewithstyle='newline'that is the default style in the current version of PyThaiNLP. In other word, this is the same as 1.) case:paragraph_tokenizewithstyle="opus100"that is newly added style as mentioned in wtpsplit paper that this style is supported in Thai language. This will let the tokenizer adapt toOPUS100style for segmentation.Apart from the usage of
styleargument. I also write a condition to handle the case when the given segmentation style input is not our available style. The ValueError will be raised.This is an error that will be raised if that case occurs