Skip to content

Add sentence segmentation with 'wtpsplit' #803

@Querela

Description

@Querela

Detailed description

Recently, wtpsplit was published and it also support multi-lingual sentence segmentation (with Thai being one of the languages). I'm not sure how it compares to the existing crfcut method as I'm not too familiar with the Thai language to check segmentation results but it might be a good addition.

Context

  • We automatically create statistical text corpora from crawled web texts. "Sentence" is a common segmentation unit we use in this case, especially to deduplicate common sentences. Paragraphs (often containing multiple sentences) hinder deduplication efforts.
  • Benefit: Another possible (robust?) option to "sentence" segment Thai text. Since not all texts mark sentences boundaries with whitespaces (e.g. social media?). More options seem to be better (if the options work well) and might support different text genres and formats.

Possible implementation

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementenhance functionalities

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions