-
Notifications
You must be signed in to change notification settings - Fork 814
Update decoding logic to handle special tokens #1925
Conversation
Nayef211
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall changes LGTM. I would also wait for a stamp from @abhinavarora before merging this.
|
|
||
| // fix left space(s) for special tokens | ||
| if (special_token_flags[tok_idx] == true && | ||
| (tok_idx > 0 && special_token_flags[tok_idx - 1] == false)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to double check, does && special_token_flags[tok_idx - 1] == false only ensure that we append a space if the token before the special token is a regular token? If so, why do we do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It ensures that we insert space only if the last token wasn't special, because if the last token was special, it would have added space after it (through code at line 438). So we avoid having one extra space in case when two special tokens are next to each other.
|
@Nayef211 @reachsumit What is the status of this? Can we rebase, merge and let Jin's team know they have this capability now? |
Changes looked fine to me but wanted a review from @abhinavarora as well to make sure I didn't miss anything! |
|
@joecummings I believe there is no additional PR required to support Jin's team's usecase. After this PR is approved and merged, we should be good to go. |
|
|
||
| std::string GPT2BPEEncoder::Decode(const std::vector<int64_t>& tokens) { | ||
| std::string text; | ||
| std::vector<bool> special_token_flags(tokens.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a vector seems to be an overkill. From the logic it seems we should be good with 2 bools. Is this correct or am I missing something?
abhinavarora
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left few comments. @reachsumit could you take a look? Logic looks fine to me.
ac53d38 to
866d826
Compare
|
Rebased on latest
|
| // get output character from byte decoder for each wide character | ||
| unsigned char uchr = byte_decoder_.at(converter.to_bytes(wchr)); | ||
| decoded_token.push_back(uchr); | ||
| is_current_special = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about setting this as false once at the beginning of the loop?
abhinavarora
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, left a minor comment.
Description
Update decoding logic to handle special tokens
This change is targeted is a follow up to recently added
add_special_tokensanddecodefeatures toGPT2BPETokenizer.The change adds the capability to decode special tokens ids back to text.
Types of changes
[x ] New feature (non-breaking change which adds core functionality)
Changes made
GPT2BPEEncoderclass) to update thedecodefunctionality for special tokens.Testing
pre-commit