apply_chat_template method not working correctly for llama 3 tokenizer

### System Info

- `transformers` version: 4.44.1
- Platform: Linux-4.18.0-553.8.1.el8_10.x86_64-x86_64-with-glibc2.28
- Python version: 3.10.14
- Huggingface_hub version: 0.24.5
- Safetensors version: 0.4.4
- Accelerate version: 0.33.0
- Accelerate config:    not found
- PyTorch version (GPU?): 2.4.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA A100-SXM4-80GB

### Who can help?

@ArthurZucker 
I noticed that the apply_chat_template for the PreTrainedTokenizerBase class does not work correctly when return_assistant_tokens_mask=True. We would expect to get back a list of indices for each example where 1 indicates the token is part of an assistant message and 0 otherwise. This is the case for the Llama 2 tokenizer for example. I am sharing a minimal example to reproduce this issue.

Looking deeper into the apply_chat_template method it seems the issue is related to the char_to_token method of the tokenizers.Embedding class and could be related to the fact that the Llama 3 tokenizer was trained with tiktoken as opposed to sentencepiece.

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

```python
from transformers import AutoTokenizer
from datasets import load_dataset

dataset_name = "m-a-p/Code-Feedback"

model_name = "meta-llama/Meta-Llama-3.1-8B" # apply_chat_template does not work correctly
#model_name = "meta-llama/Llama-2-7b-hf" # apply_chat_template works correctly

chat_template = """{% if messages[0]['role'] == 'system' %}
    {% set offset = 1 %}
{% else %}
    {% set offset = 0 %}
{% endif %}

{% for message in messages %}
    {% if (message['role'] == 'user') != (loop.index0 % 2 == offset) %}
        {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
    {% endif %}

    {{ '### ' + message['role'] + ':\n'}}
    {% if (message['role'] == 'assistant') %}
        {% generation %} {{ message['content'] | trim + eos_token }} {% endgeneration %}
    {% else %}
        {{ message['content'] | trim + eos_token }}
    {% endif %}

{% endfor %}

{% if add_generation_prompt %}
    {{ '### ' + 'assistant' + ':\n' }}
{% endif %}"""

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.chat_template = chat_template
datasets = load_dataset(dataset_name, trust_remote_code=True)

# assistant_mask is all zeros for llama3 tokenizer
chat = tokenizer.apply_chat_template(
    datasets["train"][0]["messages"],
    add_generation_prompt=False,
    return_dict=True,
    tokenize=True,
    return_assistant_tokens_mask=True
)
print("assistant_masks", chat["assistant_masks"])
```

Executing the steps to get the assistant mask in the apply chat template method shows that the char_to_token method of the tokenizers. Embedding class seems to be not working correctly.
```python
compiled_template = tokenizer._compile_jinja_template(chat_template)
template_kwargs = {**tokenizer.special_tokens_map}
rendered_chat, generation_indices = tokenizer._render_with_assistant_indices(
    compiled_template=compiled_template,
    messages=datasets["train"][0]["messages"],
    tools=[],
    documents=None,
    add_generation_prompt=False,
    **tokenizer.special_tokens_map
)
out = tokenizer(
    rendered_chat,
    padding=False,
    truncation=False,
    max_length=None,
    add_special_tokens=False,
    return_tensors=None
)
first_assistant_start_char, first_assistant_end_char = generation_indices[0]
# returns None for llama3
print("char_to_token", out[0].char_to_token(0, first_assistant_start_char))
```

### Expected behavior

If we assume that the entire chat is 10 characters and the assistant tokens occur at indices 4-6 and 8-9 we would have an expected output that looks like this
[0, 0, 0, 1, 1, 1, 0, 1, 1, 0]
The actual output for the llama 3 tokenizer is always all 0s
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

apply_chat_template method not working correctly for llama 3 tokenizer #33091

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

apply_chat_template method not working correctly for llama 3 tokenizer #33091

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions