Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Conversation

@rshraga
Copy link
Contributor

@rshraga rshraga commented Oct 17, 2022

Summary:
Pull Request resolved: #1936

This change adds a character level BPE tokenizer to the set of available transforms. It takes a pre-trained encoder dict (i.e vocab dict) and merge list as input. It is not using C++ for encoding / decoding at this time.

Reviewed By: langong347

Differential Revision: D40186470

fbshipit-source-id: 48bacc631f537e941a495e39ef9ccb17d3ef7896

Summary:
Pull Request resolved: #1936

This change adds a character level BPE tokenizer to the set of available transforms. It takes a pre-trained encoder dict (i.e vocab dict) and merge list as input. It is not using C++ for encoding / decoding at this time.

Reviewed By: langong347

Differential Revision: D40186470

fbshipit-source-id: 48bacc631f537e941a495e39ef9ccb17d3ef7896
Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Nayef211
Copy link
Contributor

@joecummings should we still be seeing these CI test failures now that #1945 has been merge into main?

@rshraga
Copy link
Contributor Author

rshraga commented Oct 17, 2022

@joecummings @Nayef211 I can't reproduce the unit test errors locally. Is this fine to land?

@joecummings
Copy link
Member

@joecummings @Nayef211 I can't reproduce the unit test errors locally. Is this fine to land?

Not yet - let me look into these failures. How was this PR opened?

@Nayef211
Copy link
Contributor

@joecummings @Nayef211 I can't reproduce the unit test errors locally. Is this fine to land?

I just kicked off another rerun for all the CI jobs. Let's see if they succeed the 2nd time around.

Not yet - let me look into these failures. How was this PR opened?
I believe @rshraga cherrypicked the following commit from the fbsync branch to the main branch

@Nayef211
Copy link
Contributor

Nayef211 commented Oct 17, 2022

@rshraga looks like the docs build is failing because we need to add regex as into the docs/requirements.txt file. I expect this dependency to be removed when you move the underlying regex implementation to a C++ operator.

Can you also add CharBPETokenizer to docs/source/transforms.rst so that the new transform can show up in the published torchtext docs.

.. automethod:: forward

CharBPETokenizer
-------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nayef211
Copy link
Contributor

We also need to add regex as a runtime dependency here fix all the conda build jobs

@rshraga rshraga merged commit 5eb33ce into main Oct 18, 2022
facebook-github-bot pushed a commit to pytorch/benchmark that referenced this pull request Oct 19, 2022
Summary:
Add regex as a new dependency as it is needed by torchtext: pytorch/text#1946

Test workflow:
https://github.com/pytorch/benchmark/actions/runs/3284313972

Pull Request resolved: #1253

Reviewed By: davidberard98

Differential Revision: D40497588

Pulled By: xuzhao9

fbshipit-source-id: 6b936ceda26af61f2fd57dc366cd2703efe3ef57
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants