-
Notifications
You must be signed in to change notification settings - Fork 13.7k
tests : add test-tokenizers-remote #13846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| std::pair<long, std::vector<char>> common_remote_get_content(const std::string & url, const common_remote_params & params); | ||
|
|
||
| // download one single file from remote URL to local path | ||
| bool common_download_file_single(const std::string & url, const std::string & path, const std::string & bearer_token, bool offline); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we don't need to expose these functions. instead, use common_remote_get_content, then write the response content to file using fstream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, but then I'll lose all the fancy functionality (caching, multi-threaded download, etc).
..and change cache file name as per suggestion.
|
Sigh, not sure where |
|
Oh, |
|
curl is not installed by default on windows, it is downloaded to a tmp directory, then we enter the unarchived path to cmake. you need to copy the DLL from the unarchived path to the |
|
Bah, |
| json tree = get_hf_repo_dir(repo, true, {}, {}); | ||
|
|
||
| if (!tree.empty()) { | ||
| std::vector<std::pair<std::string, std::string>> files; | ||
|
|
||
| for (const auto & item : tree) { | ||
| if (item.at("type") == "file") { | ||
| std::string path = item.at("path"); | ||
|
|
||
| if (string_ends_with(path, ".gguf") || string_ends_with(path, ".gguf.inp") || string_ends_with(path, ".gguf.out")) { | ||
| // this is to avoid different repo having same file name, or same file name in different subdirs | ||
| std::string filepath = repo + "_" + path; | ||
| // to make sure we don't have any slashes in the filename | ||
| string_replace_all(filepath, "/", "_"); | ||
| // to make sure we don't have any quotes in the filename | ||
| string_replace_all(filepath, "'", "_"); | ||
| filepath = fs_get_cache_file(filepath); | ||
|
|
||
| files.push_back({endpoint + repo + "/resolve/main/" + path, filepath}); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should factor this in a libcommon function. Two main reasons:
- Be able to reuse this
- Contain the
json.hppwithinlibcommonand avoid compiling it one more time just for this test
| json tree = get_hf_repo_dir(repo, true, {}, {}); | ||
|
|
||
| if (!tree.empty()) { | ||
| std::vector<std::pair<std::string, std::string>> files; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, avoid these nested STL containers - I find this very difficult to understand. This can be:
struct common_file_info {
std::string whatever_the_first_string_is;
std::string whatever_the_second_string_is;
// etc ...
};
std::vector<common_file_info> files;It's much easier to read and extend in the future with additional information if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe refactor common_download_file_multiple separately first then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem is that you're trying to reuse common_download_file_multiple which is intended to be an internal function inside arg.cpp. But I still don't think we need to expose it to the outside world.
Tbh, I think this test is being quite over-engineered. I adhere to the KISS principle and here is my thoughts:
- Caching is not necessary, as we don't have CI setup to use cache anyway
- On local, if you want to run it, you can simply setup a
scripts/get-*.sh. We already had some scripts like this - I don't see why we need to filter files by extension and download it manually via curl. Just download all files via
git clone https://huggingface.co/ggml-org/vocabs. And that's even better,gitalready handle caching stuff While we cannot runEdit: windows runners on github CI have git pre-installedgit cloneon windows CI, we don't even need it. The tokenizer logic is guaranteed to be deterministic cross-platforms, we only need to run it on one of the linux CI job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem is that you're trying to reuse
common_download_file_multiplewhich is intended to be an internal function insidearg.cpp. But I still don't think we need to expose it to the outside world.
Well, maybe adding some new functions, as @ggerganov suggested, would be a better idea?
- Getting repo file info
- Downloading repo files by name/extension
- Adding batch functionality to
common_download_file_multiple
Tbh, I think this test is being quite over-engineered. I adhere to the KISS principle and here is my thoughts:
Undeniably. :)
* Caching is not necessary, as we don't have CI setup to use cache anyway
True.
* On local, if you want to run it, you can simply setup a `scripts/get-*.sh`. We already had some scripts like this
Should not be any reason to run this locally though, this is meant for CI.
* I don't see why we need to filter files by extension and download it manually via curl. Just download all files via `git clone https://huggingface.co/ggml-org/vocabs`. And that's even better, `git` already handle caching stuff
I don't want to rely on any specific tree structure, so in this case I would have to traverse the checkout to find all the right files instead, which adds just as much logic as before.
* ~While we cannot run `git clone` on windows CI, we don't even need it. The tokenizer logic is guaranteed to be deterministic cross-platforms, we only need to run it on **one of** the linux CI job.~ Edit: windows runners on github CI have git pre-installed
Yeah, I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to rely on any specific tree structure, so in this case I would have to traverse the checkout to find all the right files instead, which adds just as much logic as before.
Just add a script to copy all *.gguf.* to a temp directory before running it, something like this:
find . -type f -name "*.gguf.*" -exec cp {} ./my_tmp_dir
test-tokenizers-all ./my_tmp_dirThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding batching to common_download_file_multiple would be useful though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. That function should be kept simple. Batching is just a wrapper around it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thinking was that it would be less duplicated effort, and that model downloading would benefit from it as it will now most likely get throttled on many splits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to spend too much of my time arguing which way is better, but if you want to do it - do it.
Still, my concerns about whether the whole thing can be just a bash script seem to be ignored at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, didn't mean to come off as ignoring it, just mulling it over. :)
Adds test-tokenizers-remote that downloads vocab files from HF
ggml-org/vocabsand runstest-tokenizer-0on the files.This incidentally sent me down the rabbit hole trying to find out what what wrong with the RWKV tokenizer, turns out it was the HF tokenizer all along! :P