Question: newmm tokenizer, why not just thai characters? 

Hi, I have a question about implemention of newmm tokenizer.

https://github.com/PyThaiNLP/pythainlp/blob/e3a01772f1dbe578e81119214d85226c0cbde466/pythainlp/tokenize/newmm.py#L38C1-L46C2

Here, why not just do as "permit only Thai characters"?

I am having a troble that sometimes signs are included in the tokens.
Ex. "ถ้าไม่รังเกียจสีหน้า(รถ)" -> ถ้า / ไม่รังเกียจ / สีหน้า / (รถ)     //"รถ" is in the dictionary used

Also, if this is "Dictionary-based maximal matching word segmentation", why it didn't took just "รถ" ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question: newmm tokenizer, why not just thai characters? #855

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question: newmm tokenizer, why not just thai characters? #855

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions