-
Notifications
You must be signed in to change notification settings - Fork 30.5k
Closed
Labels
Description
System Info
transformers
version: 4.45.1- Platform: Linux-5.15.0-92-generic-x86_64-with-glibc2.35
- Python version: 3.12.4
- Huggingface_hub version: 0.25.1
- Safetensors version: 0.4.3
- Accelerate version: 0.32.1
- Accelerate config: not found
- PyTorch version (GPU?): 2.3.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?:
- Using GPU in script?:
- GPU type: NVIDIA RTX 6000 Ada Generation
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I am trying to retrieve the "word" as defined by word_ids()
by retrieving the character span.
from transformers import AutoTokenizer
model_name = "meta-llama/Meta-Llama-3.1-8B"
this_tokenizer = AutoTokenizer.from_pretrained(model_name)
this_sent = "Hello World!"
this_encode = this_tokenizer.encode_plus(this_sent)
print(this_encode.word_to_chars(0))
And the output is:
CharSpan(start=0, end=0)
It doesn't happen with some other models such as BERT:
model_name = "bert-base-uncased"
this_tokenizer = AutoTokenizer.from_pretrained(model_name)
this_sent = "Hello World!"
this_encode = this_tokenizer.encode_plus(this_sent)
print(this_encode.word_to_chars(0))
With the output being:
CharSpan(start=0, end=5)
And the word "Hello" can be extracted via this_sent[0:5]
easily. I wonder if it might have something to do with the tokenizer? I have tried BERT, RoBERTa, GPT-2, Qwen2.5 so far, and there were no problems.
For Llama models, I have tried llama3-8b, llama3.1-8b, llama3.2-1b and llama3.2-3b without success.
Expected behavior
word_to_chars()
should give the correct character span for llama models.