Skip to content

Downloads from Google Drive return empty files / are still broken #4108

@ORippler

Description

@ORippler

🐛 Bug

All downloaded files are empty when downloading from google Drive via torchvision.utils.download_file_from_google_drive (or methods that resolve to this function, e.g. download_url)

To Reproduce

Steps to reproduce the behavior:

  1. install torchvision (tested with master branch at 9596668)
  2. Run code below
from torchvision.datasets.utils import download_url, download_file_from_google_drive

try:
    download_url(
        "http://www.vision.caltech.edu/Image_Datasets/Caltech101/101_ObjectCategories.tar.gz",
        "./caltech101",
        filename="101_ObjectCategories.tar.gz",
        md5="b224c7392d521a49829488ab0f1120d9")
except:
    pass
finally:
    folder = './miniimagenet'
    gdrive_id = '16V_ZlkW4SsnNDtnGmaBRq2OoPmUOc5mY'
    gz_filename = 'mini-imagenet.tar.gz'
    gz_md5 = 'b38f1eb4251fb9459ecc8e7febf9b2eb'
    download_file_from_google_drive(gdrive_id, folder, gz_filename, md5=gz_md5)
  1. Afterwards, both 101_ObjectCategories.tar.gz and mini-imagenet.tar.gz are empty files

Expected behavior

Download fails explicitly if google drive quota is exceeded, and succeeds otherwise.

Additional context & Error source

Related Issues #3708 #2992
Related PRs: #3710 #3035

The issue stems from the fact that google drive issues a quota on downloads and that the returned response.status_code cannot be used to check the quota consistently (refer #2992). As a workaround, we therefore need to check the payload for the corresponding string via _quota_exceeded. Before #3710, this required parsing the whole payload/content, which was infeasible and therefore disabled in #3035.

However, the proposed solution in #3710 breaks torchvision.utils.download_file_from_google_drive. The reason is that one should only iterate once over the content of a Response, refer the requests documentation for streaming content.

Opposed to this, we currently construct iterators twice from a streaming Response: first here, second here.

As a result, the second iterator has length of 0 and therefore the files written to disk are empty.

Proposed solution

Since we do not want to issue the same request twice to google drive (the response may be different between first request and 2nd one from google drive), I suggest to

  1. construct the Iterator + extract its first chunk inside download_file_from_google_drive
  2. pass only the first chunk to _quota_exceeded
  3. pass the first chunk + partially consumed Iterator to _save_response_content if the quota check is passed

I have begun working on such a solution here:

https://github.com/ORippler/vision/tree/fix_google_drive_quotacheck

@pmeier

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions