Problem with syllable tokenization

**Describe the bug**
I've observed a behavior that is worth being discussed here. In short, when there are some punctuations, syllable tokenizes would return some incorrect syllables, both from the default engine and `ssg`.


<img width="517" alt="image" src="https://user-images.githubusercontent.com/1214890/89706026-820af500-d962-11ea-8e21-15461a2f7b7c.png">

**To Reproduce**
Please see: https://colab.research.google.com/drive/12gxSmskjHCQzqV1-Nb4IOaBD-LJ0ARl5?usp=sharing

**Expected behavior**
imho, the expected result is `['หน้า', 'ที่', ' ', '19', '...'] `. To achieve this, we can split the sentence by punctuation first then do syllable tokenization for each part.


What do you think?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Problem with syllable tokenization #461

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Problem with syllable tokenization #461

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions