[Docs] Clarify Whitespace tokenizer behaviour with tokens longer than 255 characters

Currently the Whitespace tokenizer splits tokens longer than 255 characters into separate tokens by default. This is suprising to some users (see #26601 for an example why this can be confusing). We should document this better (other tokenizers like the Standard tokenizer have some explanation for this in the docs where we talk about the `max_token_length` parameter). Maybe we should also check that other tokenizers exhibiting this behaviour have a small not in the docs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Docs] Clarify Whitespace tokenizer behaviour with tokens longer than 255 characters #26641

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Docs] Clarify Whitespace tokenizer behaviour with tokens longer than 255 characters #26641

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions