-
Notifications
You must be signed in to change notification settings - Fork 284
Closed
Labels
enhancementenhance functionalitiesenhance functionalities
Description
Common normalize function that do more than reordering of Thai characters. Something that can be used quickly for matching, searching, sorting, and preparing data for classification tasks.
Some ideas;
- Remove non-visible characters, like zero-width chars Add a function to remove zero-width characters #373
- Remove unnecessary spaces
- Normalize repetitions
- Normalize "obvious" mistakes like
- consonant + tonemark A + tonemark B <--- may be we can remove tonemark A
Note that for Unicode normalization, Python does already have unicodedata.normalize().
Metadata
Metadata
Assignees
Labels
enhancementenhance functionalitiesenhance functionalities