unify data loading from HF and from disk #287

tianyu-l · 2024-04-30T19:24:19Z

Stack from ghstack (oldest at bottom):

-> unify data loading from HF and from disk #287

As titled. We can just use the load_dataset HF API to unify different use cases.

load_dataset is flexible in that, it can take a HF hub dataset repository or a local directory. The behavior is consistent as long as the underlying data is the same. It supports common data formats such as .txt, .json, .json.gz, .csv, .parquet, etc.
According to this post,

load_dataset works in three steps: download the dataset, then prepare it as an arrow dataset, and finally return a memory mapped arrow dataset. In particular it creates a cache directory to store the arrow data and the subsequent cache files for map.

Previously used load_from_disk can only load dataset saved by save_to_disk (in arrow format), which can be viewed as a way to load "preprocessed" dataset:

load_from_disk directly returns a memory mapped dataset from the arrow file (similar to Dataset.from_file). It doesn't create a cache diretory, instead all the subsequent map calls write in the same directory as the original data.

For large dataset (which cannot fit in memory), we need to set streaming=True for load_dataset, even if it is stored in a local directory. One might think load_from_diskis better because of point 3 above; however, to preprocess the huge dataset and call save_to_disk, one needs to load it in memory in the first place.

For all the reasons listed above, let's not use load_from_disk which assumes preprocessed data in arrow format.

Let's use load_dataset which supports common data formats, and set streaming=True for large dataset, no matter it is from HF or from local disk.

P.S.:

This PR updates the data file from arrow to json, while keeping the same data (first 45,000 entries of c4).
c4 is now available to run large scale experiments. Performance verified.

[ghstack-poisoned]

ghstack-source-id: 63f2c0e Pull Request resolved: #287

[ghstack-poisoned]

ghstack-source-id: a8f5228 Pull Request resolved: #287

[ghstack-poisoned]

ghstack-source-id: 932e7cc Pull Request resolved: #287

wanchaol

please see comment

wanchaol · 2024-04-30T22:10:00Z

torchtitan/datasets/hf_datasets.py

+        if dataset_name == "c4":
+            # c4 is huge, and requires both streaming and language selection
+            # (we default to en)
+            ds = load_dataset(dataset_path, name="en", split="train", streaming=True)


can we add a streaming as an arg the HuggingFaceDataset, and then when creating dataloader in c4, we just pass streaming=true to the HuggingFaceDataset constructor?

We certainly can, although this would create if-c4 statements to other places as well and more args in function call in general. Since we only have two datasets right now, maybe let's keep it until the change is necessary.

ghstack-source-id: 932e7cc Pull Request resolved: #287

ghstack-source-id: 932e7cc Pull Request resolved: pytorch#287

unify data loading from HF and from disk

d83ffe0

[ghstack-poisoned]

tianyu-l added a commit that referenced this pull request Apr 30, 2024

unify data loading from HF and from disk

8d73609

ghstack-source-id: 63f2c0e Pull Request resolved: #287

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 30, 2024

Update on "unify data loading from HF and from disk"

ab9ce1c

[ghstack-poisoned]

tianyu-l added a commit that referenced this pull request Apr 30, 2024

unify data loading from HF and from disk

d5e2113

ghstack-source-id: a8f5228 Pull Request resolved: #287

Update on "unify data loading from HF and from disk"

9a611e2

[ghstack-poisoned]

tianyu-l added a commit that referenced this pull request Apr 30, 2024

unify data loading from HF and from disk

b24ed42

ghstack-source-id: 932e7cc Pull Request resolved: #287

tianyu-l requested review from awgu, lessw2020 and wanchaol April 30, 2024 19:57

wanchaol approved these changes Apr 30, 2024

View reviewed changes

tianyu-l merged commit 9a611e2 into gh/tianyu-l/10/base Apr 30, 2024

tianyu-l added a commit that referenced this pull request Apr 30, 2024

unify data loading from HF and from disk

4e5ffaf

ghstack-source-id: 932e7cc Pull Request resolved: #287

tianyu-l deleted the gh/tianyu-l/10/head branch April 30, 2024 22:23

tianyu-l added a commit to tianyu-l/torchtitan_intern24 that referenced this pull request Aug 16, 2024

unify data loading from HF and from disk

ac090ba

ghstack-source-id: 932e7cc Pull Request resolved: pytorch#287

philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024

unify data loading from HF and from disk

258f608

ghstack-source-id: 932e7cc Pull Request resolved: pytorch#287

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

unify data loading from HF and from disk #287

unify data loading from HF and from disk #287

Uh oh!

tianyu-l commented Apr 30, 2024 •

edited

Loading

Uh oh!

wanchaol left a comment

Uh oh!

wanchaol Apr 30, 2024

Uh oh!

tianyu-l Apr 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

unify data loading from HF and from disk #287

unify data loading from HF and from disk #287

Uh oh!

Conversation

tianyu-l commented Apr 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

wanchaol Apr 30, 2024

Choose a reason for hiding this comment

Uh oh!

tianyu-l Apr 30, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tianyu-l commented Apr 30, 2024 •

edited

Loading