-
Notifications
You must be signed in to change notification settings - Fork 170
fixing amazonreviewpolarity example #169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| filepath_fn=lambda x: os.path.join(root, os.path.dirname(_EXTRACTED_FILES[split]), os.path.basename(x))) | ||
| cache_decompressed_dp = FileOpener(cache_decompressed_dp, mode="b").read_from_tar() | ||
| cache_compressed_dp = cache_decompressed_dp.end_caching(mode="wb", same_filepath_fn=True) | ||
| data_dp = FileOpener(cache_decompressed_dp.filter(lambda x: _EXTRACTED_FILES[split] in x[0]).map(lambda x: x[0]), mode='b') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ejguan, it seems like the end_caching for decompressed files are not saved onto the disk. I am getting following error:
NotADirectoryError: [Errno 20] Not a directory: '/Users/parmeetbhatia/.torchtext/cache/AmazonReviewPolarity/amazon_review_polarity_csv.tar.gz/amazon_review_polarity_csv/train.csv'
not sure, what i am doing wrong here. could you please help investigate this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After read_from_tar, the data becomes "(decompressed file path, file handle)" where decompressed file path becomes amazon_review_polarity_csv/amazon_review_polarity_csv/train.csv. But, as you used same_filepath_fn=True, the same function from on_disk_cache would be applied to each data, then you would get each file to amazon_review_polarity_csv.tar.gz/amazon_review_polarity_csv/train.csv.
You can use end_caching(mode="wb") without specifying same_filepath_fn to get amazon_review_polarity_csv/amazon_review_polarity_csv/train.csv for each decompressed file.
| return check_filter_extracted_files.parse_csv().map(fn=lambda t: (int(t[0]), t[1])) | ||
| cache_compressed_dp = GDriveReader(cache_compressed_dp).end_caching(mode="wb", same_filepath_fn=True) | ||
| cache_decompressed_dp = cache_compressed_dp.on_disk_cache( | ||
| filepath_fn=lambda x: os.path.join(root, os.path.dirname(_EXTRACTED_FILES[split]), os.path.basename(x))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a problem. When we do the cache check at this stage, we want to check if decompressed files existing on local file system. Then, we need to use a generator function as the filepath_fn:
def decompressed_file_fn(x):
for f in ["train.csv", "test.csv", "readme.txt"]:
yield os.path.join(root, os.path.dirname(_EXTRACTED_FILES[split]), ..., f)
cache_decompressed_dp = cache_compressed_dp.on_disk_cache(filepath_fn=decompressed_file_fn)
...| cache_decompressed_dp = FileOpener(cache_decompressed_dp, mode="b").\ | ||
| read_from_tar().\ | ||
| filter(lambda x: _EXTRACTED_FILES[split] in x[0]).\ | ||
| map(lambda x: (x[0].replace('_PATH' + '/', ''), x[1])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ejguan I think the path returned by read_from_tar is still amazon_review_polarity_csv.tar.gz/amazon_review_polarity_csv/train.csv. I have to remove amazon_review_polarity_csv.tar.gz to make it work properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, actually ignore the part of "make it work properly". By using same_filepath_fn=True in end_caching, this part is irrelevant, thanks to @Nayef211 for catching that I am not using _PATH properly. That said, the extracted filename still contains amazon_review_polarity_csv.tar.gz, this is not really a concern though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just double checking that we still plan to replace the '_PATH' with f'{_PATH}'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can get rid of the whole map operation here.
|
|
||
| def extracted_filepath_fn(x): | ||
| file_path = os.path.join(root, _EXTRACTED_FILES[split]) | ||
| dir_path = os.path.dirname(file_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ejguan Another thing I realized while working is end_caching won't be able to save file to disk if the corresponding parent directory doesn't exist. I have to explicitly create it. Do you think it would make sense to upstream directory creation in torchdata. For instance tar do the same and create appropriate directories when extracting files, I guess it would be non-intuitive for users to explicitly create directories for end_caching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you can see from the code https://github.com/pytorch/data/blob/c06066ae360fc6054fb826ae041b1cb0c09b2f3b/torchdata/datapipes/iter/util/saver.py#L39-L40, Saver would create corresponding parent directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, I might be something wrong. Just checked-in the new code, and yes it does not require to explicitly create the directory.
| return check_filter_extracted_files.parse_csv().map(fn=lambda t: (int(t[0]), t[1])) | ||
| cache_compressed_dp = GDriveReader(cache_compressed_dp).end_caching(mode="wb", same_filepath_fn=True) | ||
|
|
||
| def extracted_filepath_fn(x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ejguan I am not sure why is it necessary to pass all the files to generator as per your comment here (#169 (comment)). I guess it is sufficient to only check for file(s) we are concerned with, right? not sure if I miss anything here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second on_disk_cache is used to cache the decompressed files from an archive. So, for on_disk_cache, it only gets the input archive file path as xxx.tar.gz but we want to check if all decompressed files existing on your file system. This is a 1-to-N operation. So, in order to check all of the decompressed files, we have to accept a generator function as the filepath_fn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, got it. Ya, In this case, I only need to check the existence of single file (even though there are more in the archive). I guess it would be OK to specify only files we are concerned with and filter out remaining ones in the pipe?
| cache_decompressed_dp = FileOpener(cache_decompressed_dp, mode="b").\ | ||
| read_from_tar().\ | ||
| filter(lambda x: _EXTRACTED_FILES[split] in x[0]).\ | ||
| map(lambda x: (x[0].replace('_PATH' + '/', ''), x[1])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just double checking that we still plan to replace the '_PATH' with f'{_PATH}'?
| # data_dp = FileOpener(cache_compressed_dp, mode='b') | ||
| # data_dp = data_dp.read_from_tar() | ||
| # data_dp = data_dp.filter(lambda x: _EXTRACTED_FILES[split] in x[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for keeping these lines in comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, I was benchmarking and seems like forgot to remove this other piece of code :)
|
BTW, I think now this PR is ready for final review :) @ejguan |
|
@parmeet Thanks a lot. Could you please add a commit to incorporate lint requirements. You can use pre-commit here. https://github.com/pytorch/data/blob/main/CONTRIBUTING.md#code-style It would automatically add lint changes when you use |
done! @Nayef211, @abhinavarora this thing looks cool, we should probably do this for torchtext? |
ejguan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM
This does look really cool. @ejguan do you know if can set this up with the formatter of your choice (i.e. black or autopep8)? |
I guess this might help answer some https://github.com/pytorch/data/blob/main/.pre-commit-config.yaml |
Here is the configuration file https://github.com/pytorch/data/blob/main/.pre-commit-config.yaml and Credit belongs to @pmeier, who helped us to achieve this auto-formatter in this PR #147. Edit: A workflow was added as well in https://github.com/pytorch/data/blob/c06066ae360fc6054fb826ae041b1cb0c09b2f3b/.github/workflows/lint.yml#L9-L27 |
|
Let me know if another PyTorch repository needs help setting this up. |
would be happy to get contributions to torchtext :) |
|
@ejguan just following up to see whether this PR is good to be merged? |
|
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
This PR fixes couple of things: