-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Labels
Milestone
Description
Phi-4 uses Tiktoken tokenizer (100k vocab).
we now use the tiktoken tokenizer (for better multilingual support) with a padded vocabulary size of 100,352 (including unused tokens)
Consider adding it as an option to the encoding map so it's easier to create.
machinelearning/src/Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs
Lines 1025 to 1035 in 01c4164
| private static readonly (string Prefix, ModelEncoding Encoding)[] _modelPrefixToEncoding = | |
| [ | |
| // chat | |
| ( "o1-", ModelEncoding.O200kBase ), // e.g. o1-mini | |
| ( "gpt-4o-", ModelEncoding.O200kBase), // e.g., gpt-4o-2024-05-13 | |
| ( "gpt-4-", ModelEncoding.Cl100kBase), // e.g., gpt-4-0314, etc., plus gpt-4-32k | |
| ( "gpt-3.5-", ModelEncoding.Cl100kBase), // e.g, gpt-3.5-turbo-0301, -0401, etc. | |
| ( "gpt-35-", ModelEncoding.Cl100kBase ) // Azure deployment name | |
| ]; | |
| private static readonly Dictionary<string, ModelEncoding> _modelToEncoding = |
JasonHaley