Cache CNNDM extraction and optimize reading in filenames #1809

pmabbo13 · 2022-06-24T21:08:48Z

Description

Following the CNNDM dataset implementation #1789, the outstanding items were to (1) cache the extraction of the tar files, and (2) optimize reading in the filenames belonging in each split.

Process

To cache the extraction process, we created a filepath function for the on_disk_cache method called _extracted_folder_fn , which returns a list of expected cached locations of every story extracted from the tar file. For the end_caching method, we created a separate filepath function called _extracted_filepath_fn, which returns the expected cached location for a single story.
The previous implementation for retrieving the stories relevant to the dataset split was sub-optimal because it occurred upon instantiation of the dataset object and was used to filter the dataset after the CNN and DailyMail datapipes were combined. To make this more performant, we retrieve the list of files relevant to the split and datasource in _extracted_folder_fn, which gets called once per tar file. We store the list in a global variable so that it only gets instantiated at the first iteration of the dataset and remains accessible for the filtering step at subsequent iterations. This approach was settled on after benchmarking different approaches, the results of which can be found in a comment below.

Testing

pytest test/datasets/test_cnndm.py

Benchmarking results before vs after caching extraction for train split:

from torchtext.datasets.cnndm import CNNDM
from benchmark.utils import Timer


# train split
# First 2 iterations
dp = CNNDM(split="train")
dp = iter(dp)

for i in range(2):
    with Timer(f"Iteration: {i}"):
        next(dp)

# Iteration through entire dataset
with Timer("Iteration through entire train split"):
    dp = CNNDM(split="train")
    dp = iter(dp)
    print(len(list(dp)))

Tar files already downloaded and before tar extraction is cached

Iteration: 0 ... Total running time: 17.349151186004747
Iteration: 1 ... Total running time: 0.0048055590013973415
Iteration through entire train split ... Total running time: 1340.499610778992

Tar files already downloaded and after tar extraction is cached

Iteration: 0 ... Total running time: 33.93104401102755
Iteration: 1 ... Total running time: 0.005427820957265794
Iteration through entire train split ... Total running time: 995.5637674510363

…able

pmabbo13 · 2022-06-28T14:43:10Z

@Nayef211 just to quickly get something up, I have two files for each implementation. cnndm.py downloads and processes the story filenames at every iteration of the dataset. cnndm_v1.py only downloads and processes at the first iteration and stores the filenames in a global variable.

To get the benchmarking results I ran python cnndm.py and python cnndm_v1.py

Results for python cnndm.py

initialize dataset ... Total running time: 2.902001142501831e-06
iteration: 0 ... Total running time: 1.8253453625366092
iteration: 1 ... Total running time: 1.5163669474422932

Results for python cnndm_v1.py

initialize dataset ... Total running time: 3.1366944313049316e-06
iteration: 0 ... Total running time: 1.8413675921037793
iteration: 1 ... Total running time: 0.0005632331594824791

…text into feature/cache-url-list

Nayef211

LGTM. Thanks for addressing all of the followups for the CNNDM dataset!

torchtext/datasets/cnndm.py

parmeet

Thanks @pmabbo13 for adding the secondary cache. Overall LGTM!

nit: If you get chance, could you add some benchmark numbers how much we gained by adding the secondary cache? You can refer to summary here what we reported for AmazonReviewPolarity #1527 (comment)

…text into feature/cache-url-list

pmabbo13 · 2022-07-14T16:48:31Z

Thanks @pmabbo13 for adding the secondary cache. Overall LGTM!

nit: If you get chance, could you add some benchmark numbers how much we gained by adding the secondary cache? You can refer to summary here what we reported for AmazonReviewPolarity #1527 (comment)

@parmeet The PR description has been updated accordingly

url list benchmark testing

77dbd9c

facebook-github-bot added the cla signed label Jun 24, 2022

benchmarking original implementation vs storing fnames in global vari…

f4fc46f

…able

pmabbo13 added 7 commits July 5, 2022 22:09

cache tar extraction

92e9216

cleaning up

020d953

updating unittest

2680e13

patching in get_split_list for split testing

3b84bf0

Merge branch 'feature/cache-url-list' of https://github.com/pmabbo13/…

caa0fa8

…text into feature/cache-url-list

pre-commit

d5c143f

Merge branch 'main' into feature/cache-url-list

2194b0f

pmabbo13 changed the title ~~[WIP] Cache cnndm url_list~~ Cache CNNDM extraction and optimize reading in filenames Jul 8, 2022

pmabbo13 marked this pull request as ready for review July 8, 2022 19:35

pmabbo13 requested review from Nayef211 and parmeet July 8, 2022 19:36

Nayef211 approved these changes Jul 8, 2022

View reviewed changes

torchtext/datasets/cnndm.py Outdated Show resolved Hide resolved

torchtext/datasets/cnndm.py Outdated Show resolved Hide resolved

parmeet approved these changes Jul 11, 2022

View reviewed changes

pmabbo13 added 7 commits July 13, 2022 22:34

typcasting input args

bd5e1b3

removing print statement

8cbeb80

Merge branch 'main' into feature/cache-url-list

e4b989d

correcting types for input args

5889862

Merge branch 'feature/cache-url-list' of https://github.com/pmabbo13/…

5586d54

…text into feature/cache-url-list

Merge branch 'main' into feature/cache-url-list

fabdc82

Merge branch 'feature/cache-url-list' of https://github.com/pmabbo13/…

972989d

…text into feature/cache-url-list

pmabbo13 merged commit 5ce9c42 into pytorch:main Jul 14, 2022

This was referenced Jul 14, 2022

Add CNN-DM dataset to torchtext #1789

Merged

Add T5 Model and Demo on Text Summarization using CNNDM Dataset #1800

Closed

pmabbo13 deleted the feature/cache-url-list branch July 18, 2022 16:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cache CNNDM extraction and optimize reading in filenames #1809

Cache CNNDM extraction and optimize reading in filenames #1809

Uh oh!

pmabbo13 commented Jun 24, 2022 •

edited

Loading

Uh oh!

pmabbo13 commented Jun 28, 2022 •

edited

Loading

Uh oh!

Nayef211 left a comment

Uh oh!

Uh oh!

Uh oh!

parmeet left a comment

Uh oh!

pmabbo13 commented Jul 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Cache CNNDM extraction and optimize reading in filenames #1809

Cache CNNDM extraction and optimize reading in filenames #1809

Uh oh!

Conversation

pmabbo13 commented Jun 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Process

Testing

Benchmarking results before vs after caching extraction for train split:

Tar files already downloaded and before tar extraction is cached

Tar files already downloaded and after tar extraction is cached

Uh oh!

pmabbo13 commented Jun 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nayef211 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

parmeet left a comment

Choose a reason for hiding this comment

Uh oh!

pmabbo13 commented Jul 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pmabbo13 commented Jun 24, 2022 •

edited

Loading

pmabbo13 commented Jun 28, 2022 •

edited

Loading