Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Conversation

@pmabbo13
Copy link
Contributor

@pmabbo13 pmabbo13 commented Jun 24, 2022

Description

Following the CNNDM dataset implementation #1789, the outstanding items were to (1) cache the extraction of the tar files, and (2) optimize reading in the filenames belonging in each split.

Process

  1. To cache the extraction process, we created a filepath function for the on_disk_cache method called _extracted_folder_fn , which returns a list of expected cached locations of every story extracted from the tar file. For the end_caching method, we created a separate filepath function called _extracted_filepath_fn, which returns the expected cached location for a single story.
  2. The previous implementation for retrieving the stories relevant to the dataset split was sub-optimal because it occurred upon instantiation of the dataset object and was used to filter the dataset after the CNN and DailyMail datapipes were combined. To make this more performant, we retrieve the list of files relevant to the split and datasource in _extracted_folder_fn, which gets called once per tar file. We store the list in a global variable so that it only gets instantiated at the first iteration of the dataset and remains accessible for the filtering step at subsequent iterations. This approach was settled on after benchmarking different approaches, the results of which can be found in a comment below.

Testing

pytest test/datasets/test_cnndm.py

Benchmarking results before vs after caching extraction for train split:

from torchtext.datasets.cnndm import CNNDM
from benchmark.utils import Timer


# train split
# First 2 iterations
dp = CNNDM(split="train")
dp = iter(dp)

for i in range(2):
    with Timer(f"Iteration: {i}"):
        next(dp)

# Iteration through entire dataset
with Timer("Iteration through entire train split"):
    dp = CNNDM(split="train")
    dp = iter(dp)
    print(len(list(dp)))

Tar files already downloaded and before tar extraction is cached

Iteration: 0 ... Total running time: 17.349151186004747
Iteration: 1 ... Total running time: 0.0048055590013973415
Iteration through entire train split ... Total running time: 1340.499610778992

Tar files already downloaded and after tar extraction is cached

Iteration: 0 ... Total running time: 33.93104401102755
Iteration: 1 ... Total running time: 0.005427820957265794
Iteration through entire train split ... Total running time: 995.5637674510363

@pmabbo13
Copy link
Contributor Author

pmabbo13 commented Jun 28, 2022

@Nayef211 just to quickly get something up, I have two files for each implementation. cnndm.py downloads and processes the story filenames at every iteration of the dataset. cnndm_v1.py only downloads and processes at the first iteration and stores the filenames in a global variable.

To get the benchmarking results I ran python cnndm.py and python cnndm_v1.py

Results for python cnndm.py

initialize dataset ... Total running time: 2.902001142501831e-06
iteration: 0 ... Total running time: 1.8253453625366092
iteration: 1 ... Total running time: 1.5163669474422932

Results for python cnndm_v1.py

initialize dataset ... Total running time: 3.1366944313049316e-06
iteration: 0 ... Total running time: 1.8413675921037793
iteration: 1 ... Total running time: 0.0005632331594824791

@pmabbo13 pmabbo13 changed the title [WIP] Cache cnndm url_list Cache CNNDM extraction and optimize reading in filenames Jul 8, 2022
@pmabbo13 pmabbo13 marked this pull request as ready for review July 8, 2022 19:35
@pmabbo13 pmabbo13 requested review from Nayef211 and parmeet July 8, 2022 19:36
Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for addressing all of the followups for the CNNDM dataset!

Copy link
Contributor

@parmeet parmeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pmabbo13 for adding the secondary cache. Overall LGTM!

nit: If you get chance, could you add some benchmark numbers how much we gained by adding the secondary cache? You can refer to summary here what we reported for AmazonReviewPolarity #1527 (comment)

@pmabbo13 pmabbo13 merged commit 5ce9c42 into pytorch:main Jul 14, 2022
@pmabbo13
Copy link
Contributor Author

Thanks @pmabbo13 for adding the secondary cache. Overall LGTM!

nit: If you get chance, could you add some benchmark numbers how much we gained by adding the secondary cache? You can refer to summary here what we reported for AmazonReviewPolarity #1527 (comment)

@parmeet The PR description has been updated accordingly

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants