-
Notifications
You must be signed in to change notification settings - Fork 814
Cache CNNDM extraction and optimize reading in filenames #1809
Conversation
|
@Nayef211 just to quickly get something up, I have two files for each implementation. cnndm.py downloads and processes the story filenames at every iteration of the dataset. cnndm_v1.py only downloads and processes at the first iteration and stores the filenames in a global variable. To get the benchmarking results I ran Results for Results for |
…text into feature/cache-url-list
Nayef211
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for addressing all of the followups for the CNNDM dataset!
parmeet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pmabbo13 for adding the secondary cache. Overall LGTM!
nit: If you get chance, could you add some benchmark numbers how much we gained by adding the secondary cache? You can refer to summary here what we reported for AmazonReviewPolarity #1527 (comment)
@parmeet The PR description has been updated accordingly |
Description
Following the CNNDM dataset implementation #1789, the outstanding items were to (1) cache the extraction of the tar files, and (2) optimize reading in the filenames belonging in each split.
Process
on_disk_cachemethod called_extracted_folder_fn, which returns a list of expected cached locations of every story extracted from the tar file. For theend_cachingmethod, we created a separate filepath function called_extracted_filepath_fn, which returns the expected cached location for a single story._extracted_folder_fn, which gets called once per tar file. We store the list in a global variable so that it only gets instantiated at the first iteration of the dataset and remains accessible for the filtering step at subsequent iterations. This approach was settled on after benchmarking different approaches, the results of which can be found in a comment below.Testing
pytest test/datasets/test_cnndm.pyBenchmarking results before vs after caching extraction for train split:
Tar files already downloaded and before tar extraction is cached
Tar files already downloaded and after tar extraction is cached