-
Notifications
You must be signed in to change notification settings - Fork 814
Add Character Level BPE Tokenizer (#1936) #1946
Conversation
Summary: Pull Request resolved: #1936 This change adds a character level BPE tokenizer to the set of available transforms. It takes a pre-trained encoder dict (i.e vocab dict) and merge list as input. It is not using C++ for encoding / decoding at this time. Reviewed By: langong347 Differential Revision: D40186470 fbshipit-source-id: 48bacc631f537e941a495e39ef9ccb17d3ef7896
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@joecummings should we still be seeing these CI test failures now that #1945 has been merge into main? |
|
@joecummings @Nayef211 I can't reproduce the unit test errors locally. Is this fine to land? |
Not yet - let me look into these failures. How was this PR opened? |
I just kicked off another rerun for all the CI jobs. Let's see if they succeed the 2nd time around.
|
|
@rshraga looks like the docs build is failing because we need to add Can you also add |
docs/source/transforms.rst
Outdated
| .. automethod:: forward | ||
|
|
||
| CharBPETokenizer | ||
| ------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like CI is still failing because the underline isn't equal to the title size https://app.circleci.com/pipelines/github/pytorch/text/6981/workflows/1b6e4090-be21-49a9-bca7-84714a050c11/jobs/241539?invite=true#step-105-146
|
We also need to add |
Summary: Add regex as a new dependency as it is needed by torchtext: pytorch/text#1946 Test workflow: https://github.com/pytorch/benchmark/actions/runs/3284313972 Pull Request resolved: #1253 Reviewed By: davidberard98 Differential Revision: D40497588 Pulled By: xuzhao9 fbshipit-source-id: 6b936ceda26af61f2fd57dc366cd2703efe3ef57
Summary:
Pull Request resolved: #1936
This change adds a character level BPE tokenizer to the set of available transforms. It takes a pre-trained encoder dict (i.e vocab dict) and merge list as input. It is not using C++ for encoding / decoding at this time.
Reviewed By: langong347
Differential Revision: D40186470
fbshipit-source-id: 48bacc631f537e941a495e39ef9ccb17d3ef7896