Skip to content

Add a more advanced normalize function #374

@bact

Description

@bact

Common normalize function that do more than reordering of Thai characters. Something that can be used quickly for matching, searching, sorting, and preparing data for classification tasks.

Some ideas;

  • Remove non-visible characters, like zero-width chars Add a function to remove zero-width characters #373
  • Remove unnecessary spaces
  • Normalize repetitions
  • Normalize "obvious" mistakes like
    • consonant + tonemark A + tonemark B <--- may be we can remove tonemark A

Note that for Unicode normalization, Python does already have unicodedata.normalize().

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementenhance functionalities

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions