Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

MosesTokenizer has been moved out of NLTK due to licensing issues #306

@alvations

Description

@alvations

@jekbradbury great work here!

Due to nltk/nltk#2000, we had to remove MosesTokenizer out of NLTK but now it's hosted on https://github.com/alvations/sacremoses

pip install sacremoses

The silver lining is that the package comes with the data needed for tokenization so there's no need to keep the nltk_data directory =)


I would propose adding sacremoses on top of nltk because NLTK has another port of a nice tokenizer (by @jonsafari) that people overlook, https://github.com/nltk/nltk/blob/develop/nltk/tokenize/toktok.py (I think it's fast too)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions