Skip to content

Question: newmm tokenizer, why not just thai characters?  #855

@konbraphat51

Description

@konbraphat51

Hi, I have a question about implemention of newmm tokenizer.

https://github.com/PyThaiNLP/pythainlp/blob/e3a01772f1dbe578e81119214d85226c0cbde466/pythainlp/tokenize/newmm.py#L38C1-L46C2

Here, why not just do as "permit only Thai characters"?

I am having a troble that sometimes signs are included in the tokens.
Ex. "ถ้าไม่รังเกียจสีหน้า(รถ)" -> ถ้า / ไม่รังเกียจ / สีหน้า / (รถ) //"รถ" is in the dictionary used

Also, if this is "Dictionary-based maximal matching word segmentation", why it didn't took just "รถ" ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugbugs in the library

    Type

    No type

    Projects

    Status

    In progress

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions