Skip to content

apply_chat_template method not working correctly for llama 3 tokenizer #33091

@sirluk

Description

@sirluk

System Info

  • transformers version: 4.44.1
  • Platform: Linux-4.18.0-553.8.1.el8_10.x86_64-x86_64-with-glibc2.28
  • Python version: 3.10.14
  • Huggingface_hub version: 0.24.5
  • Safetensors version: 0.4.4
  • Accelerate version: 0.33.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.4.0+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA A100-SXM4-80GB

Who can help?

@ArthurZucker
I noticed that the apply_chat_template for the PreTrainedTokenizerBase class does not work correctly when return_assistant_tokens_mask=True. We would expect to get back a list of indices for each example where 1 indicates the token is part of an assistant message and 0 otherwise. This is the case for the Llama 2 tokenizer for example. I am sharing a minimal example to reproduce this issue.

Looking deeper into the apply_chat_template method it seems the issue is related to the char_to_token method of the tokenizers.Embedding class and could be related to the fact that the Llama 3 tokenizer was trained with tiktoken as opposed to sentencepiece.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
from datasets import load_dataset

dataset_name = "m-a-p/Code-Feedback"

model_name = "meta-llama/Meta-Llama-3.1-8B" # apply_chat_template does not work correctly
#model_name = "meta-llama/Llama-2-7b-hf" # apply_chat_template works correctly

chat_template = """{% if messages[0]['role'] == 'system' %}
    {% set offset = 1 %}
{% else %}
    {% set offset = 0 %}
{% endif %}

{% for message in messages %}
    {% if (message['role'] == 'user') != (loop.index0 % 2 == offset) %}
        {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
    {% endif %}

    {{ '### ' + message['role'] + ':\n'}}
    {% if (message['role'] == 'assistant') %}
        {% generation %} {{ message['content'] | trim + eos_token }} {% endgeneration %}
    {% else %}
        {{ message['content'] | trim + eos_token }}
    {% endif %}

{% endfor %}

{% if add_generation_prompt %}
    {{ '### ' + 'assistant' + ':\n' }}
{% endif %}"""

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.chat_template = chat_template
datasets = load_dataset(dataset_name, trust_remote_code=True)

# assistant_mask is all zeros for llama3 tokenizer
chat = tokenizer.apply_chat_template(
    datasets["train"][0]["messages"],
    add_generation_prompt=False,
    return_dict=True,
    tokenize=True,
    return_assistant_tokens_mask=True
)
print("assistant_masks", chat["assistant_masks"])

Executing the steps to get the assistant mask in the apply chat template method shows that the char_to_token method of the tokenizers. Embedding class seems to be not working correctly.

compiled_template = tokenizer._compile_jinja_template(chat_template)
template_kwargs = {**tokenizer.special_tokens_map}
rendered_chat, generation_indices = tokenizer._render_with_assistant_indices(
    compiled_template=compiled_template,
    messages=datasets["train"][0]["messages"],
    tools=[],
    documents=None,
    add_generation_prompt=False,
    **tokenizer.special_tokens_map
)
out = tokenizer(
    rendered_chat,
    padding=False,
    truncation=False,
    max_length=None,
    add_special_tokens=False,
    return_tensors=None
)
first_assistant_start_char, first_assistant_end_char = generation_indices[0]
# returns None for llama3
print("char_to_token", out[0].char_to_token(0, first_assistant_start_char))

Expected behavior

If we assume that the entire chat is 10 characters and the assistant tokens occur at indices 4-6 and 8-9 we would have an expected output that looks like this
[0, 0, 0, 1, 1, 1, 0, 1, 1, 0]
The actual output for the llama 3 tokenizer is always all 0s
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions