`word_to_chars()` doesn't work as expected for Llama3.1-8b

### System Info

- `transformers` version: 4.45.1
- Platform: Linux-5.15.0-92-generic-x86_64-with-glibc2.35
- Python version: 3.12.4
- Huggingface_hub version: 0.25.1
- Safetensors version: 0.4.3
- Accelerate version: 0.32.1
- Accelerate config:    not found
- PyTorch version (GPU?): 2.3.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA RTX 6000 Ada Generation

### Who can help?

@ArthurZucker @itazap 

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

I am trying to retrieve the "word" as defined by `word_ids()` by retrieving the character span.

```Python
from transformers import AutoTokenizer
model_name = "meta-llama/Meta-Llama-3.1-8B"
this_tokenizer = AutoTokenizer.from_pretrained(model_name)

this_sent = "Hello World!"
this_encode = this_tokenizer.encode_plus(this_sent)
print(this_encode.word_to_chars(0))
```

And the output is:
```Python
CharSpan(start=0, end=0)
```

It doesn't happen with some other models such as BERT:
```Python
model_name = "bert-base-uncased"
this_tokenizer = AutoTokenizer.from_pretrained(model_name)

this_sent = "Hello World!"
this_encode = this_tokenizer.encode_plus(this_sent)
print(this_encode.word_to_chars(0))
```

With the output being:
```Python
CharSpan(start=0, end=5)
```
And the word "Hello" can be extracted via `this_sent[0:5]` easily. I wonder if it might have something to do with the tokenizer? I have tried BERT, RoBERTa, GPT-2, Qwen2.5 so far, and there were no problems.

For Llama models, I have tried llama3-8b, llama3.1-8b, llama3.2-1b and llama3.2-3b without success.


### Expected behavior

`word_to_chars()` should give the correct character span for llama models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`word_to_chars()` doesn't work as expected for Llama3.1-8b #33904

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

word_to_chars() doesn't work as expected for Llama3.1-8b #33904

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`word_to_chars()` doesn't work as expected for Llama3.1-8b #33904