Skip to content

Conversation

@CISC
Copy link
Collaborator

@CISC CISC commented May 28, 2025

Adds test-tokenizers-remote that downloads vocab files from HF ggml-org/vocabs and runs test-tokenizer-0 on the files.

This incidentally sent me down the rabbit hole trying to find out what what wrong with the RWKV tokenizer, turns out it was the HF tokenizer all along! :P

@github-actions github-actions bot added the testing Everything test related label May 28, 2025
std::pair<long, std::vector<char>> common_remote_get_content(const std::string & url, const common_remote_params & params);

// download one single file from remote URL to local path
bool common_download_file_single(const std::string & url, const std::string & path, const std::string & bearer_token, bool offline);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we don't need to expose these functions. instead, use common_remote_get_content, then write the response content to file using fstream

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but then I'll lose all the fancy functionality (caching, multi-threaded download, etc).

@CISC
Copy link
Collaborator Author

CISC commented May 28, 2025

Sigh, not sure where Release in the binary path for windows-latest-cmake comes from, nor how to detect that, anyone got any clues?

Test command: D:\a\llama.cpp\llama.cpp\build\bin\Release\test-tokenizers-remote.exe
Working Directory: D:/a/llama.cpp/llama.cpp/build/bin

@CISC
Copy link
Collaborator Author

CISC commented May 30, 2025

Oh, Exit code 0xc0000135 means DLL missing...

@ngxson
Copy link
Collaborator

ngxson commented May 30, 2025

curl is not installed by default on windows, it is downloaded to a tmp directory, then we enter the unarchived path to cmake.

you need to copy the DLL from the unarchived path to the build/bin

@github-actions github-actions bot added the devops improvements to build systems and github actions label May 31, 2025
@CISC
Copy link
Collaborator Author

CISC commented May 31, 2025

Bah, Exit code 0xc0000409 now, so basically does not work in the Windows build for whatever reason.

@CISC CISC requested review from ggerganov and ngxson May 31, 2025 21:19
Comment on lines +90 to +111
json tree = get_hf_repo_dir(repo, true, {}, {});

if (!tree.empty()) {
std::vector<std::pair<std::string, std::string>> files;

for (const auto & item : tree) {
if (item.at("type") == "file") {
std::string path = item.at("path");

if (string_ends_with(path, ".gguf") || string_ends_with(path, ".gguf.inp") || string_ends_with(path, ".gguf.out")) {
// this is to avoid different repo having same file name, or same file name in different subdirs
std::string filepath = repo + "_" + path;
// to make sure we don't have any slashes in the filename
string_replace_all(filepath, "/", "_");
// to make sure we don't have any quotes in the filename
string_replace_all(filepath, "'", "_");
filepath = fs_get_cache_file(filepath);

files.push_back({endpoint + repo + "/resolve/main/" + path, filepath});
}
}
}
Copy link
Member

@ggerganov ggerganov Jun 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should factor this in a libcommon function. Two main reasons:

  • Be able to reuse this
  • Contain the json.hpp within libcommon and avoid compiling it one more time just for this test

json tree = get_hf_repo_dir(repo, true, {}, {});

if (!tree.empty()) {
std::vector<std::pair<std::string, std::string>> files;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, avoid these nested STL containers - I find this very difficult to understand. This can be:

struct common_file_info {
    std::string whatever_the_first_string_is;
    std::string whatever_the_second_string_is;
    // etc ...
};

std::vector<common_file_info> files;

It's much easier to read and extend in the future with additional information if needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe refactor common_download_file_multiple separately first then?

Copy link
Collaborator

@ngxson ngxson Jun 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is that you're trying to reuse common_download_file_multiple which is intended to be an internal function inside arg.cpp. But I still don't think we need to expose it to the outside world.

Tbh, I think this test is being quite over-engineered. I adhere to the KISS principle and here is my thoughts:

  • Caching is not necessary, as we don't have CI setup to use cache anyway
  • On local, if you want to run it, you can simply setup a scripts/get-*.sh. We already had some scripts like this
  • I don't see why we need to filter files by extension and download it manually via curl. Just download all files via git clone https://huggingface.co/ggml-org/vocabs. And that's even better, git already handle caching stuff
  • While we cannot run git clone on windows CI, we don't even need it. The tokenizer logic is guaranteed to be deterministic cross-platforms, we only need to run it on one of the linux CI job. Edit: windows runners on github CI have git pre-installed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is that you're trying to reuse common_download_file_multiple which is intended to be an internal function inside arg.cpp. But I still don't think we need to expose it to the outside world.

Well, maybe adding some new functions, as @ggerganov suggested, would be a better idea?

  • Getting repo file info
  • Downloading repo files by name/extension
  • Adding batch functionality to common_download_file_multiple

Tbh, I think this test is being quite over-engineered. I adhere to the KISS principle and here is my thoughts:

Undeniably. :)

* Caching is not necessary, as we don't have CI setup to use cache anyway

True.

* On local, if you want to run it, you can simply setup a `scripts/get-*.sh`. We already had some scripts like this

Should not be any reason to run this locally though, this is meant for CI.

* I don't see why we need to filter files by extension and download it manually via curl. Just download all files via `git clone https://huggingface.co/ggml-org/vocabs`. And that's even better, `git` already handle caching stuff

I don't want to rely on any specific tree structure, so in this case I would have to traverse the checkout to find all the right files instead, which adds just as much logic as before.

* ~While we cannot run `git clone` on windows CI, we don't even need it. The tokenizer logic is guaranteed to be deterministic cross-platforms, we only need to run it on **one of** the linux CI job.~ Edit: windows runners on github CI have git pre-installed

Yeah, I guess.

Copy link
Collaborator

@ngxson ngxson Jun 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to rely on any specific tree structure, so in this case I would have to traverse the checkout to find all the right files instead, which adds just as much logic as before.

Just add a script to copy all *.gguf.* to a temp directory before running it, something like this:

find . -type f -name "*.gguf.*" -exec cp {} ./my_tmp_dir
test-tokenizers-all ./my_tmp_dir

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding batching to common_download_file_multiple would be useful though?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. That function should be kept simple. Batching is just a wrapper around it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thinking was that it would be less duplicated effort, and that model downloading would benefit from it as it will now most likely get throttled on many splits.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to spend too much of my time arguing which way is better, but if you want to do it - do it.

Still, my concerns about whether the whole thing can be just a bash script seem to be ignored at this point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, didn't mean to come off as ignoring it, just mulling it over. :)

@CISC CISC closed this Jun 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops improvements to build systems and github actions testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants