-
Notifications
You must be signed in to change notification settings - Fork 601
unify data loading from HF and from disk #287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
wanchaol
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please see comment
| if dataset_name == "c4": | ||
| # c4 is huge, and requires both streaming and language selection | ||
| # (we default to en) | ||
| ds = load_dataset(dataset_path, name="en", split="train", streaming=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a streaming as an arg the HuggingFaceDataset, and then when creating dataloader in c4, we just pass streaming=true to the HuggingFaceDataset constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We certainly can, although this would create if-c4 statements to other places as well and more args in function call in general. Since we only have two datasets right now, maybe let's keep it until the change is necessary.
ghstack-source-id: 932e7cc Pull Request resolved: pytorch#287
ghstack-source-id: 932e7cc Pull Request resolved: pytorch#287
Stack from ghstack (oldest at bottom):
As titled. We can just use the
load_datasetHF API to unify different use cases.load_datasetis flexible in that, it can take a HF hub dataset repository or a local directory. The behavior is consistent as long as the underlying data is the same. It supports common data formats such as .txt, .json, .json.gz, .csv, .parquet, etc.load_from_diskcan only load dataset saved bysave_to_disk(in arrow format), which can be viewed as a way to load "preprocessed" dataset:streaming=Trueforload_dataset, even if it is stored in a local directory. One might thinkload_from_diskis better because of point 3 above; however, to preprocess the huge dataset and callsave_to_disk, one needs to load it in memory in the first place.For all the reasons listed above, let's not use
load_from_diskwhich assumes preprocessed data in arrow format.Let's use
load_datasetwhich supports common data formats, and setstreaming=Truefor large dataset, no matter it is from HF or from local disk.P.S.:
arrowtojson, while keeping the same data (first 45,000 entries ofc4).c4is now available to run large scale experiments. Performance verified.