Update decoding logic to handle special tokens #1925

reachsumit · 2022-10-04T19:59:46Z

Description

Update decoding logic to handle special tokens

This change is targeted is a follow up to recently added add_special_tokens and decode features to GPT2BPETokenizer.

The change adds the capability to decode special tokens ids back to text.

Types of changes

[x ] New feature (non-breaking change which adds core functionality)

Changes made

Made changes to C++ code (GPT2BPEEncoder class) to update the decode functionality for special tokens.
Added unit test to test the new decoding process.

Testing

No issue identified in pre-commit
No issue identified with unit tests.

Nayef211

Overall changes LGTM. I would also wait for a stamp from @abhinavarora before merging this.

Nayef211 · 2022-10-04T21:43:17Z

torchtext/csrc/gpt2_bpe_tokenizer.cpp

+
+    // fix left space(s) for special tokens
+    if (special_token_flags[tok_idx] == true &&
+        (tok_idx > 0 && special_token_flags[tok_idx - 1] == false)) {


Just want to double check, does && special_token_flags[tok_idx - 1] == false only ensure that we append a space if the token before the special token is a regular token? If so, why do we do this?

It ensures that we insert space only if the last token wasn't special, because if the last token was special, it would have added space after it (through code at line 438). So we avoid having one extra space in case when two special tokens are next to each other.

joecummings · 2022-10-11T14:01:52Z

@Nayef211 @reachsumit What is the status of this? Can we rebase, merge and let Jin's team know they have this capability now?

Nayef211 · 2022-10-11T17:54:34Z

@Nayef211 @reachsumit What is the status of this? Can we rebase, merge and let Jin's team know they have this capability now?

Changes looked fine to me but wanted a review from @abhinavarora as well to make sure I didn't miss anything!

reachsumit · 2022-10-12T16:37:27Z

@joecummings I believe there is no additional PR required to support Jin's team's usecase. After this PR is approved and merged, we should be good to go.

torchtext/csrc/gpt2_bpe_tokenizer.cpp

abhinavarora · 2022-10-12T18:58:28Z

torchtext/csrc/gpt2_bpe_tokenizer.cpp


 std::string GPT2BPEEncoder::Decode(const std::vector<int64_t>& tokens) {
  std::string text;
+  std::vector<bool> special_token_flags(tokens.size());


Creating a vector seems to be an overkill. From the logic it seems we should be good with 2 bools. Is this correct or am I missing something?

abhinavarora

Left few comments. @reachsumit could you take a look? Logic looks fine to me.

reachsumit · 2022-10-15T02:06:40Z

Rebased on latest main to fix conflicts, and addressed recent comments.

Added examples for adding back spaces while decoding.
Replaced using boolean array with two boolean variables.

abhinavarora · 2022-10-15T02:41:31Z

torchtext/csrc/gpt2_bpe_tokenizer.cpp

+          // get output character from byte decoder for each wide character
+          unsigned char uchr = byte_decoder_.at(converter.to_bytes(wchr));
+          decoded_token.push_back(uchr);
+          is_current_special = false;


How about setting this as false once at the beginning of the loop?

abhinavarora

Overall LGTM, left a minor comment.

facebook-github-bot added the cla signed label Oct 4, 2022

Nayef211 approved these changes Oct 4, 2022

View reviewed changes

abhinavarora reviewed Oct 12, 2022

View reviewed changes

torchtext/csrc/gpt2_bpe_tokenizer.cpp Show resolved Hide resolved

abhinavarora reviewed Oct 12, 2022

View reviewed changes

update decoding logic to handle special tokens

866d826

reachsumit force-pushed the add_special_tok_decoding branch from ac53d38 to 866d826 Compare October 15, 2022 02:01

rebased and added example

be00e15

abhinavarora reviewed Oct 15, 2022

View reviewed changes

abhinavarora approved these changes Oct 15, 2022

View reviewed changes

minor refactor: moved boolean assignment outside of for loop

5be6ab7

Nayef211 merged commit 238b342 into pytorch:main Oct 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update decoding logic to handle special tokens #1925

Update decoding logic to handle special tokens #1925

Uh oh!

reachsumit commented Oct 4, 2022

Uh oh!

Nayef211 left a comment

Uh oh!

Nayef211 Oct 4, 2022

Uh oh!

reachsumit Oct 4, 2022

Uh oh!

joecummings commented Oct 11, 2022

Uh oh!

Nayef211 commented Oct 11, 2022

Uh oh!

reachsumit commented Oct 12, 2022

Uh oh!

Uh oh!

abhinavarora Oct 12, 2022

Uh oh!

abhinavarora left a comment

Uh oh!

reachsumit commented Oct 15, 2022

Uh oh!

abhinavarora Oct 15, 2022

Uh oh!

abhinavarora left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Update decoding logic to handle special tokens #1925

Update decoding logic to handle special tokens #1925

Uh oh!

Conversation

reachsumit commented Oct 4, 2022

Description

Types of changes

Changes made

Testing

Uh oh!

Nayef211 left a comment

Choose a reason for hiding this comment

Uh oh!

Nayef211 Oct 4, 2022

Choose a reason for hiding this comment

Uh oh!

reachsumit Oct 4, 2022

Choose a reason for hiding this comment

Uh oh!

joecummings commented Oct 11, 2022

Uh oh!

Nayef211 commented Oct 11, 2022

Uh oh!

reachsumit commented Oct 12, 2022

Uh oh!

Uh oh!

abhinavarora Oct 12, 2022

Choose a reason for hiding this comment

Uh oh!

abhinavarora left a comment

Choose a reason for hiding this comment

Uh oh!

reachsumit commented Oct 15, 2022

Uh oh!

abhinavarora Oct 15, 2022

Choose a reason for hiding this comment

Uh oh!

abhinavarora left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants