-
Notifications
You must be signed in to change notification settings - Fork 284
Closed
Labels
enhancementenhance functionalitiesenhance functionalities
Milestone
Description
Detailed description
Recently, wtpsplit was published and it also support multi-lingual sentence segmentation (with Thai being one of the languages). I'm not sure how it compares to the existing crfcut method as I'm not too familiar with the Thai language to check segmentation results but it might be a good addition.
Context
- We automatically create statistical text corpora from crawled web texts. "Sentence" is a common segmentation unit we use in this case, especially to deduplicate common sentences. Paragraphs (often containing multiple sentences) hinder deduplication efforts.
- Benefit: Another possible (robust?) option to "sentence" segment Thai text. Since not all texts mark sentences boundaries with whitespaces (e.g. social media?). More options seem to be better (if the options work well) and might support different text genres and formats.
Possible implementation
- https://github.com/bminixhofer/wtpsplit
- with language code
th - different models seem to exists (small and large ones)
- with language code
wannaphong
Metadata
Metadata
Assignees
Labels
enhancementenhance functionalitiesenhance functionalities