Skip to content

Conversation

@lopuhin
Copy link
Contributor

@lopuhin lopuhin commented May 22, 2024

This PR fixes a few errors which appear when following the README https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/README.md#smoothquant on current latest commit.

Note: the first commit looks quite obvious (although I'm not sure how this could have worked before), while I'm less sure about the second, I just just going by error messages during engine conversion, there might be a better place for the fix. So feel free to treat this as a bug report instead. I verified that engine built in this way has reasonable outputs and expected performance. The model I was testing this on is Mistral 7B (mistral-7b-v0.1-instruct) but I assume other llama 2 and 3 should also work (didn't get to llama 3 yet).

lopuhin added 3 commits May 22, 2024 11:00
without this it errors out with:

Traceback (most recent call last):
  File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 456, in <module>
    main()
  File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 448, in main
    convert_and_save_hf(args)
  File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 353, in convert_and_save_hf
    LLaMAForCausalLM.quantize(args.model_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 405, in quantize
    convert.quantize(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1395, in quantize
    weights = load_weights_from_hf(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1437, in load_weights_from_hf
    weights = convert_hf_llama(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1089, in convert_hf_llama
    convert_layer(l)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 725, in convert_layer
    get_tllm_linear_sq_weight(int8_weights,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 610, in get_tllm_linear_sq_weight
    results[prefix + 'per_channel_scale'] = torch.Tensor([
ValueError: only one element tensors can be converted to Python scalars

we can also check the shapes:
cur_per_channel_value.shape -> torch.Size([6144])
col_shape -> [1, 6144]

so it's clear that we meant to convert the tensor without []
with these the model works and provides sensible output
@kaiyux kaiyux mentioned this pull request May 28, 2024
@kaiyux
Copy link
Member

kaiyux commented May 28, 2024

Hi @lopuhin , the changes are integrated in #1688 and we've credited you as co-author, hence I'm closing this PR now, thanks a lot.

@kaiyux kaiyux closed this May 28, 2024
@lopuhin
Copy link
Contributor Author

lopuhin commented Jun 4, 2024

hi @kaiyux great, thank you! I think only the first commit was integrated, but the two others were not, but they are also required -- although they fix the error which would happen when running the engine. I'm experimenting with smooth quant llama 3 right now and need all commits to get it working. Do you mind having another look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants