Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Conversation

@reachsumit
Copy link
Contributor

Description

Add add_special_tokens feature to GPT2BPETokenizer

This change is targeted towards a requirement posted internally. It adds a new function add_special_tokens to GPT2BPETokenizer in order to enable the user to specify a dictionary of token that should not be changed during tokenization. Any newly specified token shall also be added to the vocabulary.

The change adds the capability to decode tokens ids back to text. This change is required following the internal discussion.

Types of changes

[x ] New feature (non-breaking change which adds core functionality)

Changes made

  • Made changes to C++ code (GPT2BPEEncoder class) to support the decode functionality.
  • Updated torch bindings and py bindings as required.
  • Added decode function to Python interface (GPT2BPETokenizer class).
  • Added unit test to test the decoding process.

Testing

  • No issue identified in pre-commit
  • No issue identified with any of the unit tests.

Copy link
Member

@joecummings joecummings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Thanks for getting this up so quickly. cc @Nayef211 for another approval.

Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Nayef211 Nayef211 merged commit 258a356 into pytorch:main Oct 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants