Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Conversation

@reachsumit
Copy link
Contributor

Description

Update decoding logic to handle special tokens

This change is targeted is a follow up to recently added add_special_tokens and decode features to GPT2BPETokenizer.

The change adds the capability to decode special tokens ids back to text.

Types of changes

[x ] New feature (non-breaking change which adds core functionality)

Changes made

  • Made changes to C++ code (GPT2BPEEncoder class) to update the decode functionality for special tokens.
  • Added unit test to test the new decoding process.

Testing

  • No issue identified in pre-commit
  • No issue identified with unit tests.

Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall changes LGTM. I would also wait for a stamp from @abhinavarora before merging this.


// fix left space(s) for special tokens
if (special_token_flags[tok_idx] == true &&
(tok_idx > 0 && special_token_flags[tok_idx - 1] == false)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to double check, does && special_token_flags[tok_idx - 1] == false only ensure that we append a space if the token before the special token is a regular token? If so, why do we do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It ensures that we insert space only if the last token wasn't special, because if the last token was special, it would have added space after it (through code at line 438). So we avoid having one extra space in case when two special tokens are next to each other.

@joecummings
Copy link
Member

@Nayef211 @reachsumit What is the status of this? Can we rebase, merge and let Jin's team know they have this capability now?

@Nayef211
Copy link
Contributor

@Nayef211 @reachsumit What is the status of this? Can we rebase, merge and let Jin's team know they have this capability now?

Changes looked fine to me but wanted a review from @abhinavarora as well to make sure I didn't miss anything!

@reachsumit
Copy link
Contributor Author

@joecummings I believe there is no additional PR required to support Jin's team's usecase. After this PR is approved and merged, we should be good to go.


std::string GPT2BPEEncoder::Decode(const std::vector<int64_t>& tokens) {
std::string text;
std::vector<bool> special_token_flags(tokens.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a vector seems to be an overkill. From the logic it seems we should be good with 2 bools. Is this correct or am I missing something?

Copy link
Contributor

@abhinavarora abhinavarora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left few comments. @reachsumit could you take a look? Logic looks fine to me.

@reachsumit reachsumit force-pushed the add_special_tok_decoding branch from ac53d38 to 866d826 Compare October 15, 2022 02:01
@reachsumit
Copy link
Contributor Author

Rebased on latest main to fix conflicts, and addressed recent comments.

  • Added examples for adding back spaces while decoding.
  • Replaced using boolean array with two boolean variables.

// get output character from byte decoder for each wide character
unsigned char uchr = byte_decoder_.at(converter.to_bytes(wchr));
decoded_token.push_back(uchr);
is_current_special = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about setting this as false once at the beginning of the loop?

Copy link
Contributor

@abhinavarora abhinavarora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, left a minor comment.

@Nayef211 Nayef211 merged commit 238b342 into pytorch:main Oct 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants