Add Character Level BPE Tokenizer (#1936) #1946

rshraga · 2022-10-17T14:29:15Z

Summary:
Pull Request resolved: #1936

This change adds a character level BPE tokenizer to the set of available transforms. It takes a pre-trained encoder dict (i.e vocab dict) and merge list as input. It is not using C++ for encoding / decoding at this time.

Reviewed By: langong347

Differential Revision: D40186470

fbshipit-source-id: 48bacc631f537e941a495e39ef9ccb17d3ef7896

Summary: Pull Request resolved: #1936 This change adds a character level BPE tokenizer to the set of available transforms. It takes a pre-trained encoder dict (i.e vocab dict) and merge list as input. It is not using C++ for encoding / decoding at this time. Reviewed By: langong347 Differential Revision: D40186470 fbshipit-source-id: 48bacc631f537e941a495e39ef9ccb17d3ef7896

Nayef211

LGTM

Nayef211 · 2022-10-17T15:54:11Z

@joecummings should we still be seeing these CI test failures now that #1945 has been merge into main?

rshraga · 2022-10-17T16:54:47Z

@joecummings @Nayef211 I can't reproduce the unit test errors locally. Is this fine to land?

joecummings · 2022-10-17T17:33:55Z

@joecummings @Nayef211 I can't reproduce the unit test errors locally. Is this fine to land?

Not yet - let me look into these failures. How was this PR opened?

Nayef211 · 2022-10-17T18:32:39Z

@joecummings @Nayef211 I can't reproduce the unit test errors locally. Is this fine to land?

I just kicked off another rerun for all the CI jobs. Let's see if they succeed the 2nd time around.

Not yet - let me look into these failures. How was this PR opened?
I believe @rshraga cherrypicked the following commit from the fbsync branch to the main branch

Nayef211 · 2022-10-17T18:39:34Z

@rshraga looks like the docs build is failing because we need to add regex as into the docs/requirements.txt file. I expect this dependency to be removed when you move the underlying regex implementation to a C++ operator.

Can you also add CharBPETokenizer to docs/source/transforms.rst so that the new transform can show up in the published torchtext docs.

Nayef211 · 2022-10-17T19:47:25Z

docs/source/transforms.rst

   .. automethod:: forward

+CharBPETokenizer
+-------------


Looks like CI is still failing because the underline isn't equal to the title size https://app.circleci.com/pipelines/github/pytorch/text/6981/workflows/1b6e4090-be21-49a9-bca7-84714a050c11/jobs/241539?invite=true#step-105-146

Nayef211 · 2022-10-17T19:57:23Z

We also need to add regex as a runtime dependency here fix all the conda build jobs

Summary: Add regex as a new dependency as it is needed by torchtext: pytorch/text#1946 Test workflow: https://github.com/pytorch/benchmark/actions/runs/3284313972 Pull Request resolved: #1253 Reviewed By: davidberard98 Differential Revision: D40497588 Pulled By: xuzhao9 fbshipit-source-id: 6b936ceda26af61f2fd57dc366cd2703efe3ef57

facebook-github-bot added the cla signed label Oct 17, 2022

run linter

1537cec

Nayef211 approved these changes Oct 17, 2022

View reviewed changes

add regex to requirements and CharBPETokenizer to transforms.rst

1056e7d

Nayef211 reviewed Oct 17, 2022

View reviewed changes

rshraga added 2 commits October 17, 2022 16:23

fix docs and requirements

4966eaa

try to fix docstring format

adbf1f0

rshraga merged commit 5eb33ce into main Oct 18, 2022

seemethere mentioned this pull request Oct 19, 2022

Error when importing torchtext: "No module named 'regex'" #1951

Closed

xuzhao9 mentioned this pull request Oct 19, 2022

Add regex as a new dependency pytorch/benchmark#1253

Closed

seemethere mentioned this pull request Oct 19, 2022

Use re instead of regex #1953

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Character Level BPE Tokenizer (#1936) #1946

Add Character Level BPE Tokenizer (#1936) #1946

Uh oh!

rshraga commented Oct 17, 2022

Uh oh!

Nayef211 left a comment

Uh oh!

Nayef211 commented Oct 17, 2022

Uh oh!

rshraga commented Oct 17, 2022

Uh oh!

joecummings commented Oct 17, 2022

Uh oh!

Nayef211 commented Oct 17, 2022

Uh oh!

Nayef211 commented Oct 17, 2022 •

edited

Loading

Uh oh!

Nayef211 Oct 17, 2022

Uh oh!

Nayef211 commented Oct 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add Character Level BPE Tokenizer (#1936) #1946

Add Character Level BPE Tokenizer (#1936) #1946

Uh oh!

Conversation

rshraga commented Oct 17, 2022

Uh oh!

Nayef211 left a comment

Choose a reason for hiding this comment

Uh oh!

Nayef211 commented Oct 17, 2022

Uh oh!

rshraga commented Oct 17, 2022

Uh oh!

joecummings commented Oct 17, 2022

Uh oh!

Nayef211 commented Oct 17, 2022

Uh oh!

Nayef211 commented Oct 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nayef211 Oct 17, 2022

Choose a reason for hiding this comment

Uh oh!

Nayef211 commented Oct 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Nayef211 commented Oct 17, 2022 •

edited

Loading