Add Phi-4 to Tiktoken encoding map

Phi-4 uses Tiktoken tokenizer (100k vocab).

[2412.08905v1](https://arxiv.org/pdf/2412.08905v1)

> we now use the tiktoken tokenizer (for better multilingual support) with a padded vocabulary size of 100,352 (including unused tokens)

Consider adding it as an option to the encoding map so it's easier to create. 


https://github.com/dotnet/machinelearning/blob/01c41644edef3d02060b4c8bb841878df3528050/src/Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs#L1025-L1035

	private static readonly (string Prefix, ModelEncoding Encoding)[] _modelPrefixToEncoding =
	[
	// chat
	( "o1-", ModelEncoding.O200kBase ), // e.g. o1-mini
	( "gpt-4o-", ModelEncoding.O200kBase), // e.g., gpt-4o-2024-05-13
	( "gpt-4-", ModelEncoding.Cl100kBase), // e.g., gpt-4-0314, etc., plus gpt-4-32k
	( "gpt-3.5-", ModelEncoding.Cl100kBase), // e.g, gpt-3.5-turbo-0301, -0401, etc.
	( "gpt-35-", ModelEncoding.Cl100kBase ) // Azure deployment name
	];

	private static readonly Dictionary<string, ModelEncoding> _modelToEncoding =

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Phi-4 to Tiktoken encoding map #7337

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add Phi-4 to Tiktoken encoding map #7337

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions