Skip to content

[Docs] Clarify Whitespace tokenizer behaviour with tokens longer than 255 characters #26641

@cbuescher

Description

@cbuescher

Currently the Whitespace tokenizer splits tokens longer than 255 characters into separate tokens by default. This is suprising to some users (see #26601 for an example why this can be confusing). We should document this better (other tokenizers like the Standard tokenizer have some explanation for this in the docs where we talk about the max_token_length parameter). Maybe we should also check that other tokenizers exhibiting this behaviour have a small not in the docs.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions