-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Description
🐛 Bug
All downloaded files are empty when downloading from google Drive via torchvision.utils.download_file_from_google_drive (or methods that resolve to this function, e.g. download_url)
To Reproduce
Steps to reproduce the behavior:
- install torchvision (tested with
masterbranch at 9596668) - Run code below
from torchvision.datasets.utils import download_url, download_file_from_google_drive
try:
download_url(
"http://www.vision.caltech.edu/Image_Datasets/Caltech101/101_ObjectCategories.tar.gz",
"./caltech101",
filename="101_ObjectCategories.tar.gz",
md5="b224c7392d521a49829488ab0f1120d9")
except:
pass
finally:
folder = './miniimagenet'
gdrive_id = '16V_ZlkW4SsnNDtnGmaBRq2OoPmUOc5mY'
gz_filename = 'mini-imagenet.tar.gz'
gz_md5 = 'b38f1eb4251fb9459ecc8e7febf9b2eb'
download_file_from_google_drive(gdrive_id, folder, gz_filename, md5=gz_md5)
- Afterwards, both
101_ObjectCategories.tar.gzandmini-imagenet.tar.gzare empty files
Expected behavior
Download fails explicitly if google drive quota is exceeded, and succeeds otherwise.
Additional context & Error source
Related Issues #3708 #2992
Related PRs: #3710 #3035
The issue stems from the fact that google drive issues a quota on downloads and that the returned response.status_code cannot be used to check the quota consistently (refer #2992). As a workaround, we therefore need to check the payload for the corresponding string via _quota_exceeded. Before #3710, this required parsing the whole payload/content, which was infeasible and therefore disabled in #3035.
However, the proposed solution in #3710 breaks torchvision.utils.download_file_from_google_drive. The reason is that one should only iterate once over the content of a Response, refer the requests documentation for streaming content.
Opposed to this, we currently construct iterators twice from a streaming Response: first here, second here.
As a result, the second iterator has length of 0 and therefore the files written to disk are empty.
Proposed solution
Since we do not want to issue the same request twice to google drive (the response may be different between first request and 2nd one from google drive), I suggest to
- construct the Iterator + extract its first chunk inside
download_file_from_google_drive - pass only the first chunk to
_quota_exceeded - pass the first chunk + partially consumed Iterator to
_save_response_contentif the quota check is passed
I have begun working on such a solution here:
https://github.com/ORippler/vision/tree/fix_google_drive_quotacheck