Currently the Whitespace tokenizer splits tokens longer than 255 characters into separate tokens by default. This is suprising to some users (see #26601 for an example why this can be confusing). We should document this better (other tokenizers like the Standard tokenizer have some explanation for this in the docs where we talk about the max_token_length parameter). Maybe we should also check that other tokenizers exhibiting this behaviour have a small not in the docs.