-
Notifications
You must be signed in to change notification settings - Fork 30.5k
Description
System Info
transformers
version: 4.44.1- Platform: Linux-4.18.0-553.8.1.el8_10.x86_64-x86_64-with-glibc2.28
- Python version: 3.10.14
- Huggingface_hub version: 0.24.5
- Safetensors version: 0.4.4
- Accelerate version: 0.33.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.4.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?:
- Using GPU in script?:
- GPU type: NVIDIA A100-SXM4-80GB
Who can help?
@ArthurZucker
I noticed that the apply_chat_template for the PreTrainedTokenizerBase class does not work correctly when return_assistant_tokens_mask=True. We would expect to get back a list of indices for each example where 1 indicates the token is part of an assistant message and 0 otherwise. This is the case for the Llama 2 tokenizer for example. I am sharing a minimal example to reproduce this issue.
Looking deeper into the apply_chat_template method it seems the issue is related to the char_to_token method of the tokenizers.Embedding class and could be related to the fact that the Llama 3 tokenizer was trained with tiktoken as opposed to sentencepiece.
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
from transformers import AutoTokenizer
from datasets import load_dataset
dataset_name = "m-a-p/Code-Feedback"
model_name = "meta-llama/Meta-Llama-3.1-8B" # apply_chat_template does not work correctly
#model_name = "meta-llama/Llama-2-7b-hf" # apply_chat_template works correctly
chat_template = """{% if messages[0]['role'] == 'system' %}
{% set offset = 1 %}
{% else %}
{% set offset = 0 %}
{% endif %}
{% for message in messages %}
{% if (message['role'] == 'user') != (loop.index0 % 2 == offset) %}
{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
{% endif %}
{{ '### ' + message['role'] + ':\n'}}
{% if (message['role'] == 'assistant') %}
{% generation %} {{ message['content'] | trim + eos_token }} {% endgeneration %}
{% else %}
{{ message['content'] | trim + eos_token }}
{% endif %}
{% endfor %}
{% if add_generation_prompt %}
{{ '### ' + 'assistant' + ':\n' }}
{% endif %}"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.chat_template = chat_template
datasets = load_dataset(dataset_name, trust_remote_code=True)
# assistant_mask is all zeros for llama3 tokenizer
chat = tokenizer.apply_chat_template(
datasets["train"][0]["messages"],
add_generation_prompt=False,
return_dict=True,
tokenize=True,
return_assistant_tokens_mask=True
)
print("assistant_masks", chat["assistant_masks"])
Executing the steps to get the assistant mask in the apply chat template method shows that the char_to_token method of the tokenizers. Embedding class seems to be not working correctly.
compiled_template = tokenizer._compile_jinja_template(chat_template)
template_kwargs = {**tokenizer.special_tokens_map}
rendered_chat, generation_indices = tokenizer._render_with_assistant_indices(
compiled_template=compiled_template,
messages=datasets["train"][0]["messages"],
tools=[],
documents=None,
add_generation_prompt=False,
**tokenizer.special_tokens_map
)
out = tokenizer(
rendered_chat,
padding=False,
truncation=False,
max_length=None,
add_special_tokens=False,
return_tensors=None
)
first_assistant_start_char, first_assistant_end_char = generation_indices[0]
# returns None for llama3
print("char_to_token", out[0].char_to_token(0, first_assistant_start_char))
Expected behavior
If we assume that the entire chat is 10 characters and the assistant tokens occur at indices 4-6 and 8-9 we would have an expected output that looks like this
[0, 0, 0, 1, 1, 1, 0, 1, 1, 0]
The actual output for the llama 3 tokenizer is always all 0s
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]