[Experiment] Add YesNo dataset based on IterDataPipe #1382

mthrok · 2021-03-10T15:57:56Z

This is the initial PoC / study for adopting IterDataPipe in torchaudio.
YesNo dataset is very simple, probably too simple for benefitting from the advantages of IterDataPipe, but I wanted to use it as an opportunity to learn how IterDataPipe should be used.

ejguan

Hey @mthrok ,
Thank you for trying IterDataPipe. Several comments below:

One goal for IterDataPipe is to reduce the memory in the pipeline by using __iter__ to generate one item per iteration. As an example for users, it's better not to save filenames in the memory as a list and iterate over them.
If the existing DataPipes in the core repo have already covered the functionalities you want, there is no need to re-implement a new DataPipe.
Please let me know your opinion about it. Feel free to ping me when you have further questions. ☺️

ejguan · 2021-03-11T16:04:32Z

torchaudio/datasets/yesno.py

+class LoadYesNoItem(IterDataPipe):
+    def __init__(self, data_pipe):
+        self.data_pipe = data_pipe
+
+    def __iter__(self):
+        for path, label in self.data_pipe:
+            waveform, sample_rate = torchaudio.load(path)
+            yield YesNoItem(path, label, waveform, sample_rate)


You can use Map to apply a function to each item in the pipeline.

ejguan · 2021-03-11T16:08:55Z

torchaudio/datasets/yesno.py

+    def __iter__(self):
+        for filename in self.files:
+            path = os.path.join(self.data_dir, filename)
+            label = [int(c) for c in path.split("_")]
+            yield path, label


Can you use ListDirFiles to generate a filename per iteration.

ejguan · 2021-03-11T16:10:33Z

torchaudio/datasets/yesno.py

+    # TODO: download dataset if necessary
+    return LoadYesNoItem(ListYesNoItems(root_dir))


You can use existing DataPipes

from torch.utils.data import datapipes dp = datapipes.iter. ListDirFiles(root_dir) dp = datapipes.iter.Map(dp, fn=loading_fn) return dp

Or, you use the functional API (I prefer this way)

return ListDirFiles(root_dir).map(fn=loading_fn)

where loading_fn converts file to YesNoItem.

* Update build.sh * Update audio tutorial for release pytorch 1.8 / torchaudio 0.8 (pytorch#1379) * [wip] replace audio tutorial * Update * Update * Update * fixup * Update requirements.txt * update * Update Co-authored-by: Brian Johnson <[email protected]> * [1.8 release] Switch to the new datasets in torchtext 0.9.0 release - text classification tutorial (pytorch#1352) * switch to the new dataset API * checkpoint * checkpoint * checkpoint * update docs * checkpoint * switch to legacy vocab * update to follow the master API * checkpoint * checkpoint * address reviewer's comments Co-authored-by: Guanheng Zhang <[email protected]> Co-authored-by: Brian Johnson <[email protected]> * [1.8 release] Switch to LM dataset in torchtext 0.9.0 release (pytorch#1349) * switch to raw text dataset in torchtext 0.9.0 release * follow the new API in torchtext master Co-authored-by: Guanheng Zhang <[email protected]> Co-authored-by: Brian Johnson <[email protected]> * [WIP][FX] CPU Performance Profiling with FX (pytorch#1319) Co-authored-by: Brian Johnson <[email protected]> * [FX] Added fuser tutorial (pytorch#1356) * Added fuser tutorial * updated index.rst * fixed conclusion * responded to some comments * responded to comments * respond Co-authored-by: Brian Johnson <[email protected]> * Update numeric_suite_tutorial.py * Tutorial combining DDP with Pipeline Parallelism to Train Transformer models (pytorch#1347) * Tutorial combining DDP with Pipeline Parallelism to Train Transformer models. Summary: Tutorial which places a pipe on GPUs 0 and 1 and another Pipe on GPUs 2 and 3. Both pipe replicas are replicated via DDP. One process drives GPUs 0 and 1 and another drives GPUs 2 and 3. * Polish out some of the docs. * Add thumbnail and address some comments. Co-authored-by: pritam <[email protected]> * More updates to numeric_suite * Even more updates * Update numeric_suite_tutorial.py Hopefully that's the last one * Update numeric_suite_tutorial.py Last one * Update build.sh Co-authored-by: moto <[email protected]> Co-authored-by: Guanheng George Zhang <[email protected]> Co-authored-by: Guanheng Zhang <[email protected]> Co-authored-by: James Reed <[email protected]> Co-authored-by: Horace He <[email protected]> Co-authored-by: Pritam Damania <[email protected]> Co-authored-by: pritam <[email protected]> Co-authored-by: Nikita Shulga <[email protected]>

[PoC] Add YesNo dataset based on IterDataPipe

f6dc822

facebook-github-bot added the CLA Signed label Mar 10, 2021

fix import

9cae276

mthrok changed the title ~~[PoC] Add YesNo dataset based on IterDataPipe~~ [Experiment] Add YesNo dataset based on IterDataPipe Mar 10, 2021

ejguan reviewed Mar 11, 2021

View reviewed changes

mthrok closed this Feb 21, 2023

mthrok deleted the datapipe-yesno branch February 21, 2023 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Experiment] Add YesNo dataset based on IterDataPipe #1382

[Experiment] Add YesNo dataset based on IterDataPipe #1382

Uh oh!

mthrok commented Mar 10, 2021 •

edited

Loading

Uh oh!

ejguan left a comment •

edited

Loading

Uh oh!

ejguan Mar 11, 2021

Uh oh!

ejguan Mar 11, 2021

Uh oh!

ejguan Mar 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# TODO: download dataset if necessary
		return LoadYesNoItem(ListYesNoItems(root_dir))

[Experiment] Add YesNo dataset based on IterDataPipe #1382

[Experiment] Add YesNo dataset based on IterDataPipe #1382

Uh oh!

Conversation

mthrok commented Mar 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ejguan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ejguan Mar 11, 2021

Choose a reason for hiding this comment

Uh oh!

ejguan Mar 11, 2021

Choose a reason for hiding this comment

Uh oh!

ejguan Mar 11, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mthrok commented Mar 10, 2021 •

edited

Loading

ejguan left a comment •

edited

Loading