Skip to content

Conversation

@mthrok
Copy link
Contributor

@mthrok mthrok commented Mar 10, 2021

cc @VitalyFedyunin @ejguan @glaringlee

This is the initial PoC / study for adopting IterDataPipe in torchaudio.
YesNo dataset is very simple, probably too simple for benefitting from the advantages of IterDataPipe, but I wanted to use it as an opportunity to learn how IterDataPipe should be used.

@mthrok mthrok changed the title [PoC] Add YesNo dataset based on IterDataPipe [Experiment] Add YesNo dataset based on IterDataPipe Mar 10, 2021
Copy link

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mthrok ,
Thank you for trying IterDataPipe. Several comments below:

  • One goal for IterDataPipe is to reduce the memory in the pipeline by using __iter__ to generate one item per iteration. As an example for users, it's better not to save filenames in the memory as a list and iterate over them.
  • If the existing DataPipes in the core repo have already covered the functionalities you want, there is no need to re-implement a new DataPipe.
    Please let me know your opinion about it. Feel free to ping me when you have further questions. ☺️

Comment on lines +109 to +116
class LoadYesNoItem(IterDataPipe):
def __init__(self, data_pipe):
self.data_pipe = data_pipe

def __iter__(self):
for path, label in self.data_pipe:
waveform, sample_rate = torchaudio.load(path)
yield YesNoItem(path, label, waveform, sample_rate)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use Map to apply a function to each item in the pipeline.

Comment on lines +102 to +106
def __iter__(self):
for filename in self.files:
path = os.path.join(self.data_dir, filename)
label = [int(c) for c in path.split("_")]
yield path, label
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use ListDirFiles to generate a filename per iteration.

Comment on lines +120 to +121
# TODO: download dataset if necessary
return LoadYesNoItem(ListYesNoItems(root_dir))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use existing DataPipes

from torch.utils.data import datapipes

dp = datapipes.iter. ListDirFiles(root_dir)
dp = datapipes.iter.Map(dp, fn=loading_fn)
return dp

Or, you use the functional API (I prefer this way)

return ListDirFiles(root_dir).map(fn=loading_fn)

where loading_fn converts file to YesNoItem.

mthrok added a commit to mthrok/audio that referenced this pull request Dec 13, 2022
* Update build.sh

* Update audio tutorial for release pytorch 1.8 / torchaudio 0.8 (pytorch#1379)

* [wip] replace audio tutorial

* Update

* Update

* Update

* fixup

* Update requirements.txt

* update

* Update

Co-authored-by: Brian Johnson <[email protected]>

* [1.8 release] Switch to the new datasets in torchtext 0.9.0 release - text classification tutorial (pytorch#1352)

* switch to the new dataset API

* checkpoint

* checkpoint

* checkpoint

* update docs

* checkpoint

* switch to legacy vocab

* update to follow the master API

* checkpoint

* checkpoint

* address reviewer's comments

Co-authored-by: Guanheng Zhang <[email protected]>
Co-authored-by: Brian Johnson <[email protected]>

* [1.8 release] Switch to LM dataset in torchtext 0.9.0 release (pytorch#1349)

* switch to raw text dataset in torchtext 0.9.0 release

* follow the new API in torchtext master

Co-authored-by: Guanheng Zhang <[email protected]>
Co-authored-by: Brian Johnson <[email protected]>

* [WIP][FX] CPU Performance Profiling with FX (pytorch#1319)

Co-authored-by: Brian Johnson <[email protected]>

* [FX] Added fuser tutorial (pytorch#1356)

* Added fuser tutorial

* updated index.rst

* fixed conclusion

* responded to some comments

* responded to comments

* respond

Co-authored-by: Brian Johnson <[email protected]>

* Update numeric_suite_tutorial.py

* Tutorial combining DDP with Pipeline Parallelism to Train Transformer models (pytorch#1347)

* Tutorial combining DDP with Pipeline Parallelism to Train Transformer models.

Summary: Tutorial which places a pipe on GPUs 0 and 1 and another Pipe
on GPUs 2 and 3. Both pipe replicas are replicated via DDP. One process
drives GPUs 0 and 1 and another drives GPUs 2 and 3.

* Polish out some of the docs.

* Add thumbnail and address some comments.

Co-authored-by: pritam <[email protected]>

* More updates to numeric_suite

* Even more updates

* Update numeric_suite_tutorial.py

Hopefully that's the last one

* Update numeric_suite_tutorial.py

Last one

* Update build.sh

Co-authored-by: moto <[email protected]>
Co-authored-by: Guanheng George Zhang <[email protected]>
Co-authored-by: Guanheng Zhang <[email protected]>
Co-authored-by: James Reed <[email protected]>
Co-authored-by: Horace He <[email protected]>
Co-authored-by: Pritam Damania <[email protected]>
Co-authored-by: pritam <[email protected]>
Co-authored-by: Nikita Shulga <[email protected]>
@mthrok mthrok closed this Feb 21, 2023
@mthrok mthrok deleted the datapipe-yesno branch February 21, 2023 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants