Add decoding capability to GPT2BPE tokenizer #1919

reachsumit · 2022-10-01T05:22:04Z

Description

Add add_special_tokens feature to GPT2BPETokenizer

This change is targeted towards a requirement posted internally. It adds a new function add_special_tokens to GPT2BPETokenizer in order to enable the user to specify a dictionary of token that should not be changed during tokenization. Any newly specified token shall also be added to the vocabulary.

The change adds the capability to decode tokens ids back to text. This change is required following the internal discussion.

Types of changes

[x ] New feature (non-breaking change which adds core functionality)

Changes made

Made changes to C++ code (GPT2BPEEncoder class) to support the decode functionality.
Updated torch bindings and py bindings as required.
Added decode function to Python interface (GPT2BPETokenizer class).
Added unit test to test the decoding process.

Testing

No issue identified in pre-commit
No issue identified with any of the unit tests.

joecummings

This looks good to me. Thanks for getting this up so quickly. cc @Nayef211 for another approval.

Nayef211

LGTM!

add decoding capability to GPT2BPE tokenizer

bcb330f

facebook-github-bot added the cla signed label Oct 1, 2022

reachsumit added 3 commits October 1, 2022 17:17

use wstring_convert for all conversions

684acb4

minor update to comment and string creation logic

634f426

move converter definition outside of for loop

b64389a

joecummings approved these changes Oct 3, 2022

View reviewed changes

Nayef211 approved these changes Oct 3, 2022

View reviewed changes

Nayef211 merged commit 258a356 into pytorch:main Oct 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add decoding capability to GPT2BPE tokenizer #1919

Add decoding capability to GPT2BPE tokenizer #1919

Uh oh!

reachsumit commented Oct 1, 2022

Uh oh!

joecummings left a comment

Uh oh!

Nayef211 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add decoding capability to GPT2BPE tokenizer #1919

Add decoding capability to GPT2BPE tokenizer #1919

Uh oh!

Conversation

reachsumit commented Oct 1, 2022

Description

Types of changes

Changes made

Testing

Uh oh!

joecummings left a comment

Choose a reason for hiding this comment

Uh oh!

Nayef211 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants