Skip to content

Add Phi-4 to Tiktoken encoding map #7337

@luisquintanilla

Description

@luisquintanilla

Phi-4 uses Tiktoken tokenizer (100k vocab).

2412.08905v1

we now use the tiktoken tokenizer (for better multilingual support) with a padded vocabulary size of 100,352 (including unused tokens)

Consider adding it as an option to the encoding map so it's easier to create.

private static readonly (string Prefix, ModelEncoding Encoding)[] _modelPrefixToEncoding =
[
// chat
( "o1-", ModelEncoding.O200kBase ), // e.g. o1-mini
( "gpt-4o-", ModelEncoding.O200kBase), // e.g., gpt-4o-2024-05-13
( "gpt-4-", ModelEncoding.Cl100kBase), // e.g., gpt-4-0314, etc., plus gpt-4-32k
( "gpt-3.5-", ModelEncoding.Cl100kBase), // e.g, gpt-3.5-turbo-0301, -0401, etc.
( "gpt-35-", ModelEncoding.Cl100kBase ) // Azure deployment name
];
private static readonly Dictionary<string, ModelEncoding> _modelToEncoding =

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions