Skip to content

Conversation

@parmeet
Copy link
Contributor

@parmeet parmeet commented Jan 17, 2022

This PR fixes couple of things:

  1. It introduces caching for extracted files. It will help avoid repetitive extractions with every iteration
  2. It fixed the parsed CSV content

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 17, 2022
filepath_fn=lambda x: os.path.join(root, os.path.dirname(_EXTRACTED_FILES[split]), os.path.basename(x)))
cache_decompressed_dp = FileOpener(cache_decompressed_dp, mode="b").read_from_tar()
cache_compressed_dp = cache_decompressed_dp.end_caching(mode="wb", same_filepath_fn=True)
data_dp = FileOpener(cache_decompressed_dp.filter(lambda x: _EXTRACTED_FILES[split] in x[0]).map(lambda x: x[0]), mode='b')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ejguan, it seems like the end_caching for decompressed files are not saved onto the disk. I am getting following error:
NotADirectoryError: [Errno 20] Not a directory: '/Users/parmeetbhatia/.torchtext/cache/AmazonReviewPolarity/amazon_review_polarity_csv.tar.gz/amazon_review_polarity_csv/train.csv'

not sure, what i am doing wrong here. could you please help investigate this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After read_from_tar, the data becomes "(decompressed file path, file handle)" where decompressed file path becomes amazon_review_polarity_csv/amazon_review_polarity_csv/train.csv. But, as you used same_filepath_fn=True, the same function from on_disk_cache would be applied to each data, then you would get each file to amazon_review_polarity_csv.tar.gz/amazon_review_polarity_csv/train.csv.

You can use end_caching(mode="wb") without specifying same_filepath_fn to get amazon_review_polarity_csv/amazon_review_polarity_csv/train.csv for each decompressed file.

return check_filter_extracted_files.parse_csv().map(fn=lambda t: (int(t[0]), t[1]))
cache_compressed_dp = GDriveReader(cache_compressed_dp).end_caching(mode="wb", same_filepath_fn=True)
cache_decompressed_dp = cache_compressed_dp.on_disk_cache(
filepath_fn=lambda x: os.path.join(root, os.path.dirname(_EXTRACTED_FILES[split]), os.path.basename(x)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a problem. When we do the cache check at this stage, we want to check if decompressed files existing on local file system. Then, we need to use a generator function as the filepath_fn:

def decompressed_file_fn(x):
    for f in ["train.csv", "test.csv", "readme.txt"]:
        yield os.path.join(root, os.path.dirname(_EXTRACTED_FILES[split]), ..., f)
cache_decompressed_dp = cache_compressed_dp.on_disk_cache(filepath_fn=decompressed_file_fn)
...

cache_decompressed_dp = FileOpener(cache_decompressed_dp, mode="b").\
read_from_tar().\
filter(lambda x: _EXTRACTED_FILES[split] in x[0]).\
map(lambda x: (x[0].replace('_PATH' + '/', ''), x[1]))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ejguan I think the path returned by read_from_tar is still amazon_review_polarity_csv.tar.gz/amazon_review_polarity_csv/train.csv. I have to remove amazon_review_polarity_csv.tar.gz to make it work properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, actually ignore the part of "make it work properly". By using same_filepath_fn=True in end_caching, this part is irrelevant, thanks to @Nayef211 for catching that I am not using _PATH properly. That said, the extracted filename still contains amazon_review_polarity_csv.tar.gz, this is not really a concern though :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking that we still plan to replace the '_PATH' with f'{_PATH}'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can get rid of the whole map operation here.


def extracted_filepath_fn(x):
file_path = os.path.join(root, _EXTRACTED_FILES[split])
dir_path = os.path.dirname(file_path)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ejguan Another thing I realized while working is end_caching won't be able to save file to disk if the corresponding parent directory doesn't exist. I have to explicitly create it. Do you think it would make sense to upstream directory creation in torchdata. For instance tar do the same and create appropriate directories when extracting files, I guess it would be non-intuitive for users to explicitly create directories for end_caching?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, I might be something wrong. Just checked-in the new code, and yes it does not require to explicitly create the directory.

return check_filter_extracted_files.parse_csv().map(fn=lambda t: (int(t[0]), t[1]))
cache_compressed_dp = GDriveReader(cache_compressed_dp).end_caching(mode="wb", same_filepath_fn=True)

def extracted_filepath_fn(x):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ejguan I am not sure why is it necessary to pass all the files to generator as per your comment here (#169 (comment)). I guess it is sufficient to only check for file(s) we are concerned with, right? not sure if I miss anything here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second on_disk_cache is used to cache the decompressed files from an archive. So, for on_disk_cache, it only gets the input archive file path as xxx.tar.gz but we want to check if all decompressed files existing on your file system. This is a 1-to-N operation. So, in order to check all of the decompressed files, we have to accept a generator function as the filepath_fn.

Copy link
Contributor Author

@parmeet parmeet Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, got it. Ya, In this case, I only need to check the existence of single file (even though there are more in the archive). I guess it would be OK to specify only files we are concerned with and filter out remaining ones in the pipe?

cache_decompressed_dp = FileOpener(cache_decompressed_dp, mode="b").\
read_from_tar().\
filter(lambda x: _EXTRACTED_FILES[split] in x[0]).\
map(lambda x: (x[0].replace('_PATH' + '/', ''), x[1]))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking that we still plan to replace the '_PATH' with f'{_PATH}'?

Comment on lines 56 to 58
# data_dp = FileOpener(cache_compressed_dp, mode='b')
# data_dp = data_dp.read_from_tar()
# data_dp = data_dp.filter(lambda x: _EXTRACTED_FILES[split] in x[0])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for keeping these lines in comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, I was benchmarking and seems like forgot to remove this other piece of code :)

@parmeet
Copy link
Contributor Author

parmeet commented Jan 19, 2022

BTW, I think now this PR is ready for final review :) @ejguan

@ejguan
Copy link
Contributor

ejguan commented Jan 19, 2022

@parmeet Thanks a lot. Could you please add a commit to incorporate lint requirements. You can use pre-commit here. https://github.com/pytorch/data/blob/main/CONTRIBUTING.md#code-style

It would automatically add lint changes when you use git commit

@parmeet
Copy link
Contributor Author

parmeet commented Jan 19, 2022

@parmeet Thanks a lot. Could you please add a commit to incorporate lint requirements. You can use pre-commit here. https://github.com/pytorch/data/blob/main/CONTRIBUTING.md#code-style

It would automatically add lint changes when you use git commit

done! @Nayef211, @abhinavarora this thing looks cool, we should probably do this for torchtext?

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

@Nayef211
Copy link

@parmeet Thanks a lot. Could you please add a commit to incorporate lint requirements. You can use pre-commit here. https://github.com/pytorch/data/blob/main/CONTRIBUTING.md#code-style
It would automatically add lint changes when you use git commit

done! @Nayef211, @abhinavarora this thing looks cool, we should probably do this for torchtext?

This does look really cool. @ejguan do you know if can set this up with the formatter of your choice (i.e. black or autopep8)?

@parmeet
Copy link
Contributor Author

parmeet commented Jan 19, 2022

This does look really cool. @ejguan do you know if can set this up with the formatter of your choice (i.e. black or autopep8)?

I guess this might help answer some https://github.com/pytorch/data/blob/main/.pre-commit-config.yaml

@ejguan
Copy link
Contributor

ejguan commented Jan 19, 2022

This does look really cool. @ejguan do you know if can set this up with the formatter of your choice (i.e. black or autopep8)?

Here is the configuration file https://github.com/pytorch/data/blob/main/.pre-commit-config.yaml and ufmt can be used for black. I believe you can add your own formatter to the configuration file.

Credit belongs to @pmeier, who helped us to achieve this auto-formatter in this PR #147.

Edit: A workflow was added as well in https://github.com/pytorch/data/blob/c06066ae360fc6054fb826ae041b1cb0c09b2f3b/.github/workflows/lint.yml#L9-L27

@pmeier
Copy link
Contributor

pmeier commented Jan 19, 2022

Let me know if another PyTorch repository needs help setting this up.

@parmeet
Copy link
Contributor Author

parmeet commented Jan 19, 2022

Let me know if another PyTorch repository needs help setting this up.

would be happy to get contributions to torchtext :)

@Nayef211
Copy link

@ejguan just following up to see whether this PR is good to be merged?

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants