Skip to content

Conversation

@tianyu-l
Copy link
Contributor

@tianyu-l tianyu-l commented Apr 30, 2024

Stack from ghstack (oldest at bottom):

As titled. We can just use the load_dataset HF API to unify different use cases.

  1. load_dataset is flexible in that, it can take a HF hub dataset repository or a local directory. The behavior is consistent as long as the underlying data is the same. It supports common data formats such as .txt, .json, .json.gz, .csv, .parquet, etc.
  2. According to this post,

load_dataset works in three steps: download the dataset, then prepare it as an arrow dataset, and finally return a memory mapped arrow dataset. In particular it creates a cache directory to store the arrow data and the subsequent cache files for map.

  1. Previously used load_from_disk can only load dataset saved by save_to_disk (in arrow format), which can be viewed as a way to load "preprocessed" dataset:

load_from_disk directly returns a memory mapped dataset from the arrow file (similar to Dataset.from_file). It doesn't create a cache diretory, instead all the subsequent map calls write in the same directory as the original data.

  1. For large dataset (which cannot fit in memory), we need to set streaming=True for load_dataset, even if it is stored in a local directory. One might think load_from_diskis better because of point 3 above; however, to preprocess the huge dataset and call save_to_disk, one needs to load it in memory in the first place.

For all the reasons listed above, let's not use load_from_disk which assumes preprocessed data in arrow format.

Let's use load_dataset which supports common data formats, and set streaming=True for large dataset, no matter it is from HF or from local disk.

P.S.:

  1. This PR updates the data file from arrow to json, while keeping the same data (first 45,000 entries of c4).
  2. c4 is now available to run large scale experiments. Performance verified.

tianyu-l added a commit that referenced this pull request Apr 30, 2024
ghstack-source-id: 63f2c0e
Pull Request resolved: #287
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 30, 2024
tianyu-l added a commit that referenced this pull request Apr 30, 2024
ghstack-source-id: a8f5228
Pull Request resolved: #287
tianyu-l added a commit that referenced this pull request Apr 30, 2024
ghstack-source-id: 932e7cc
Pull Request resolved: #287
@tianyu-l tianyu-l requested review from awgu, lessw2020 and wanchaol April 30, 2024 19:57
Copy link
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see comment

if dataset_name == "c4":
# c4 is huge, and requires both streaming and language selection
# (we default to en)
ds = load_dataset(dataset_path, name="en", split="train", streaming=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a streaming as an arg the HuggingFaceDataset, and then when creating dataloader in c4, we just pass streaming=true to the HuggingFaceDataset constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We certainly can, although this would create if-c4 statements to other places as well and more args in function call in general. Since we only have two datasets right now, maybe let's keep it until the change is necessary.

@tianyu-l tianyu-l merged commit 9a611e2 into gh/tianyu-l/10/base Apr 30, 2024
tianyu-l added a commit that referenced this pull request Apr 30, 2024
ghstack-source-id: 932e7cc
Pull Request resolved: #287
@tianyu-l tianyu-l deleted the gh/tianyu-l/10/head branch April 30, 2024 22:23
tianyu-l added a commit to tianyu-l/torchtitan_intern24 that referenced this pull request Aug 16, 2024
ghstack-source-id: 932e7cc
Pull Request resolved: pytorch#287
philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024
ghstack-source-id: 932e7cc
Pull Request resolved: pytorch#287
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants