This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Description
🚀 Feature
We need to update our dataset implementations to add secondary caching for extracted files as a followup to #1494.
Motivation
Some of our datasets have a cache_compressed_dp and then a cache_decompressed_dp which is the behavior we want (i.e. EnWik9). On the other hand some of our other datasets only cache the downloaded archive file and not the files extracted from that archive (i.e. SogouNews).
Backlog of Dataset Tests
The following datasets need to be updated to add the secondary caching mechanism:
cc @parmeet @abhinavarora @VirgileHlav @erip