Commit d0aa2c2
authored
Address the feedback on the tokenizer's library (#7024)
* Fix cache when calling EncodeToIds
* Make EnglishRoberta _mergeRanks thread safe
* Delete Trainer
* Remove the setters on the Bpe properties
* Remove Roberta and Tiktoken special casing in the Tokenizer and support the cases in the Model abstraction
* Support text-embedding-3-small/large embedding
* Remove redundant TokenToId abstraction and keep the one with the extra parameters
* Enable creating Tiktoken asynchronously or directly using the tokenizer data
* Add cancellationToken support in CreateAsync APIs
* Rename sequence to text and Tokenize to Encode
* Rename skipSpecialTokens to considerSpecialTokens
* Rename TokenizerResult to EncodingResult
* Make Token publicly immutable
* Change offset tuples from (Index, End) to (Index, Length)
* Rename NormalizedString method's parameters
* Rename Model's methods to start with verb
* Convert Model.GetVocab() method to a Vocab property
* Some method's parameters and variable renaming
* Remove Vocab and VocabSize from the abstraction
* Cleanup normalization support
* Minor Bpe cleanup
* Resolve rebase change
* Address the feedback1 parent 4b89d98 commit d0aa2c2
File tree
31 files changed
+838
-6033
lines changed- src
- Microsoft.ML.Tokenizers
- Model
- Normalizer
- PreTokenizer
- Utils
- Microsoft.ML.TorchSharp
- Extensions
- NasBert
- Roberta
- test/Microsoft.ML.Tokenizers.Tests
- Data
31 files changed
+838
-6033
lines changedLines changed: 7 additions & 7 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
14 | | - | |
| 14 | + | |
15 | 15 | | |
16 | 16 | | |
17 | | - | |
| 17 | + | |
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
23 | | - | |
| 23 | + | |
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| |||
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
50 | | - | |
| 50 | + | |
51 | 51 | | |
52 | 52 | | |
53 | 53 | | |
| |||
121 | 121 | | |
122 | 122 | | |
123 | 123 | | |
124 | | - | |
| 124 | + | |
125 | 125 | | |
126 | 126 | | |
127 | | - | |
| 127 | + | |
128 | 128 | | |
129 | 129 | | |
130 | 130 | | |
| |||
138 | 138 | | |
139 | 139 | | |
140 | 140 | | |
141 | | - | |
| 141 | + | |
142 | 142 | | |
143 | 143 | | |
144 | 144 | | |
| |||
Large diffs are not rendered by default.
0 commit comments