fixing amazonreviewpolarity example #169

parmeet · 2022-01-17T18:41:38Z

This PR fixes couple of things:

It introduces caching for extracted files. It will help avoid repetitive extractions with every iteration
It fixed the parsed CSV content

parmeet · 2022-01-17T18:45:22Z

examples/text/amazonreviewpolarity.py

+        filepath_fn=lambda x: os.path.join(root, os.path.dirname(_EXTRACTED_FILES[split]), os.path.basename(x)))
+    cache_decompressed_dp = FileOpener(cache_decompressed_dp, mode="b").read_from_tar()
+    cache_compressed_dp = cache_decompressed_dp.end_caching(mode="wb", same_filepath_fn=True)
+    data_dp = FileOpener(cache_decompressed_dp.filter(lambda x: _EXTRACTED_FILES[split] in x[0]).map(lambda x: x[0]), mode='b')


Hi @ejguan, it seems like the end_caching for decompressed files are not saved onto the disk. I am getting following error:
NotADirectoryError: [Errno 20] Not a directory: '/Users/parmeetbhatia/.torchtext/cache/AmazonReviewPolarity/amazon_review_polarity_csv.tar.gz/amazon_review_polarity_csv/train.csv'

not sure, what i am doing wrong here. could you please help investigate this?

After read_from_tar, the data becomes "(decompressed file path, file handle)" where decompressed file path becomes amazon_review_polarity_csv/amazon_review_polarity_csv/train.csv. But, as you used same_filepath_fn=True, the same function from on_disk_cache would be applied to each data, then you would get each file to amazon_review_polarity_csv.tar.gz/amazon_review_polarity_csv/train.csv.

You can use end_caching(mode="wb") without specifying same_filepath_fn to get amazon_review_polarity_csv/amazon_review_polarity_csv/train.csv for each decompressed file.

ejguan · 2022-01-18T18:42:20Z

examples/text/amazonreviewpolarity.py

-    return check_filter_extracted_files.parse_csv().map(fn=lambda t: (int(t[0]), t[1]))
+    cache_compressed_dp = GDriveReader(cache_compressed_dp).end_caching(mode="wb", same_filepath_fn=True)
+    cache_decompressed_dp = cache_compressed_dp.on_disk_cache(
+        filepath_fn=lambda x: os.path.join(root, os.path.dirname(_EXTRACTED_FILES[split]), os.path.basename(x)))


I think there is a problem. When we do the cache check at this stage, we want to check if decompressed files existing on local file system. Then, we need to use a generator function as the filepath_fn:

def decompressed_file_fn(x): for f in ["train.csv", "test.csv", "readme.txt"]: yield os.path.join(root, os.path.dirname(_EXTRACTED_FILES[split]), ..., f) cache_decompressed_dp = cache_compressed_dp.on_disk_cache(filepath_fn=decompressed_file_fn) ...

parmeet · 2022-01-18T23:16:38Z

examples/text/amazonreviewpolarity.py

+    cache_decompressed_dp = FileOpener(cache_decompressed_dp, mode="b").\
+                                            read_from_tar().\
+                                            filter(lambda x: _EXTRACTED_FILES[split] in x[0]).\
+                                            map(lambda x: (x[0].replace('_PATH' + '/', ''), x[1]))


@ejguan I think the path returned by read_from_tar is still amazon_review_polarity_csv.tar.gz/amazon_review_polarity_csv/train.csv. I have to remove amazon_review_polarity_csv.tar.gz to make it work properly.

Ahh, actually ignore the part of "make it work properly". By using same_filepath_fn=True in end_caching, this part is irrelevant, thanks to @Nayef211 for catching that I am not using _PATH properly. That said, the extracted filename still contains amazon_review_polarity_csv.tar.gz, this is not really a concern though :)

Just double checking that we still plan to replace the '_PATH' with f'{_PATH}'?

I think we can get rid of the whole map operation here.

parmeet · 2022-01-18T23:20:16Z

examples/text/amazonreviewpolarity.py

+
+    def extracted_filepath_fn(x):
+        file_path = os.path.join(root, _EXTRACTED_FILES[split])
+        dir_path = os.path.dirname(file_path)


@ejguan Another thing I realized while working is end_caching won't be able to save file to disk if the corresponding parent directory doesn't exist. I have to explicitly create it. Do you think it would make sense to upstream directory creation in torchdata. For instance tar do the same and create appropriate directories when extracting files, I guess it would be non-intuitive for users to explicitly create directories for end_caching?

As you can see from the code https://github.com/pytorch/data/blob/c06066ae360fc6054fb826ae041b1cb0c09b2f3b/torchdata/datapipes/iter/util/saver.py#L39-L40, Saver would create corresponding parent directory.

got it, I might be something wrong. Just checked-in the new code, and yes it does not require to explicitly create the directory.

parmeet · 2022-01-18T23:23:00Z

examples/text/amazonreviewpolarity.py

-    return check_filter_extracted_files.parse_csv().map(fn=lambda t: (int(t[0]), t[1]))
+    cache_compressed_dp = GDriveReader(cache_compressed_dp).end_caching(mode="wb", same_filepath_fn=True)
+
+    def extracted_filepath_fn(x):


@ejguan I am not sure why is it necessary to pass all the files to generator as per your comment here (#169 (comment)). I guess it is sufficient to only check for file(s) we are concerned with, right? not sure if I miss anything here?

The second on_disk_cache is used to cache the decompressed files from an archive. So, for on_disk_cache, it only gets the input archive file path as xxx.tar.gz but we want to check if all decompressed files existing on your file system. This is a 1-to-N operation. So, in order to check all of the decompressed files, we have to accept a generator function as the filepath_fn.

I see, got it. Ya, In this case, I only need to check the existence of single file (even though there are more in the archive). I guess it would be OK to specify only files we are concerned with and filter out remaining ones in the pipe?

Nayef211 · 2022-01-19T16:06:18Z

examples/text/amazonreviewpolarity.py

+    cache_decompressed_dp = FileOpener(cache_decompressed_dp, mode="b").\
+                                            read_from_tar().\
+                                            filter(lambda x: _EXTRACTED_FILES[split] in x[0]).\
+                                            map(lambda x: (x[0].replace('_PATH' + '/', ''), x[1]))


Just double checking that we still plan to replace the '_PATH' with f'{_PATH}'?

Nayef211 · 2022-01-19T16:06:52Z

examples/text/amazonreviewpolarity.py

+    # data_dp = FileOpener(cache_compressed_dp, mode='b')
+    # data_dp = data_dp.read_from_tar()
+    # data_dp = data_dp.filter(lambda x: _EXTRACTED_FILES[split] in x[0])


Any reason for keeping these lines in comments?

Not really, I was benchmarking and seems like forgot to remove this other piece of code :)

parmeet · 2022-01-19T18:19:08Z

BTW, I think now this PR is ready for final review :) @ejguan

ejguan · 2022-01-19T18:24:52Z

@parmeet Thanks a lot. Could you please add a commit to incorporate lint requirements. You can use pre-commit here. https://github.com/pytorch/data/blob/main/CONTRIBUTING.md#code-style

It would automatically add lint changes when you use git commit

parmeet · 2022-01-19T18:44:17Z

@parmeet Thanks a lot. Could you please add a commit to incorporate lint requirements. You can use pre-commit here. https://github.com/pytorch/data/blob/main/CONTRIBUTING.md#code-style

It would automatically add lint changes when you use git commit

done! @Nayef211, @abhinavarora this thing looks cool, we should probably do this for torchtext?

ejguan

Thanks, LGTM

Nayef211 · 2022-01-19T19:25:42Z

@parmeet Thanks a lot. Could you please add a commit to incorporate lint requirements. You can use pre-commit here. https://github.com/pytorch/data/blob/main/CONTRIBUTING.md#code-style
It would automatically add lint changes when you use git commit

done! @Nayef211, @abhinavarora this thing looks cool, we should probably do this for torchtext?

This does look really cool. @ejguan do you know if can set this up with the formatter of your choice (i.e. black or autopep8)?

parmeet · 2022-01-19T19:33:36Z

This does look really cool. @ejguan do you know if can set this up with the formatter of your choice (i.e. black or autopep8)?

I guess this might help answer some https://github.com/pytorch/data/blob/main/.pre-commit-config.yaml

ejguan · 2022-01-19T19:34:48Z

This does look really cool. @ejguan do you know if can set this up with the formatter of your choice (i.e. black or autopep8)?

Here is the configuration file https://github.com/pytorch/data/blob/main/.pre-commit-config.yaml and ufmt can be used for black. I believe you can add your own formatter to the configuration file.

Credit belongs to @pmeier, who helped us to achieve this auto-formatter in this PR #147.

Edit: A workflow was added as well in https://github.com/pytorch/data/blob/c06066ae360fc6054fb826ae041b1cb0c09b2f3b/.github/workflows/lint.yml#L9-L27

pmeier · 2022-01-19T19:43:25Z

Let me know if another PyTorch repository needs help setting this up.

parmeet · 2022-01-19T19:45:59Z

Let me know if another PyTorch repository needs help setting this up.

would be happy to get contributions to torchtext :)

Nayef211 · 2022-01-27T20:20:41Z

@ejguan just following up to see whether this PR is good to be merged?

facebook-github-bot · 2022-01-27T20:29:34Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

fixing amazonreviewpolarity example

d68c394

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 17, 2022

parmeet commented Jan 17, 2022

View reviewed changes

This was referenced Jan 17, 2022

Migrate WikiText103 to datapipes pytorch/text#1518

Merged

migrate CONLL 2000 to datapipes. pytorch/text#1515

Merged

ejguan reviewed Jan 18, 2022

View reviewed changes

fix caching issue

a843942

parmeet commented Jan 18, 2022

View reviewed changes

Nayef211 reviewed Jan 19, 2022

View reviewed changes

removing explicit directory creation

3f758cc

parmeet mentioned this pull request Jan 19, 2022

Cache extraction for AmazonReviewPolarity pytorch/text#1527

Merged

fix linter

1b4d7e7

ejguan approved these changes Jan 19, 2022

View reviewed changes

ejguan approved these changes Jan 27, 2022

View reviewed changes

facebook-github-bot closed this in 160ce80 Jan 27, 2022

pmeier mentioned this pull request Jan 28, 2022

prepare repo for auto-formatters pytorch/text#1546

Merged

fixing amazonreviewpolarity example #169

fixing amazonreviewpolarity example #169

Uh oh!

Conversation

parmeet commented Jan 17, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

parmeet Jan 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

parmeet commented Jan 19, 2022

Uh oh!

ejguan commented Jan 19, 2022

Uh oh!

parmeet commented Jan 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ejguan left a comment

Choose a reason for hiding this comment

Uh oh!

Nayef211 commented Jan 19, 2022

Uh oh!

parmeet commented Jan 19, 2022

Uh oh!

ejguan commented Jan 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pmeier commented Jan 19, 2022

Uh oh!

parmeet commented Jan 19, 2022

Uh oh!

Nayef211 commented Jan 27, 2022

Uh oh!

facebook-github-bot commented Jan 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

parmeet Jan 19, 2022 •

edited

Loading

parmeet commented Jan 19, 2022 •

edited

Loading

ejguan commented Jan 19, 2022 •

edited

Loading