-
Notifications
You must be signed in to change notification settings - Fork 741
Resume download #320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resume download #320
Conversation
|
For reference: sample code that opens a gzip url and decode on-the-fly. |
927e462 to
59ac2cf
Compare
zhangguanheng66
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another example that we should unify the download func across DAPIs. @vincentqb @fmassa @cpuhrsch
pytorch/text#538
|
Yes, in an ideal world our tests do not depend on the internet. But it's hard to test a function that operates on URLs without URLs. I guess we could use a |
torchaudio/datasets/utils.py
Outdated
| """Download a file from a url and place it in root. | ||
| def download_url(url, download_folder, hash_value=None, hash_type="sha256"): | ||
| """Execute the correct download operation. | ||
| Depending on the size of the file online and offline, resume the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd make resuming an option, similar to how something like rsync allows you to decide whether you want to just simply overwrite or attempt to detect differences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's reasonable.
torchaudio/datasets/utils.py
Outdated
|
|
||
| def download_url(url, root, filename=None, md5=None): | ||
| """Download a file from a url and place it in root. | ||
| def download_url(url, download_folder, hash_value=None, hash_type="sha256"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes the assumption that we want to download to a folder. Is it possible to split this out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can be done.
torchaudio/datasets/utils.py
Outdated
| pbar.update(len(chunk)) | ||
|
|
||
|
|
||
| def validate_download_url(filepath, hash_value, hash_type="sha256"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works with any file right? So it could just be called "validate" and take a path. Even better, take a buffer or file-like object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but then I need to fork the open/read statement :) preferred way of doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you require a file-like object you don't need to check whether it's a path or not. You require the user to call this via
validate_download_url(open(filepath), hash_value)
instead of
validate_download_url(filepath, hash_value)
This then also enables them to validate_download_url(stream_url(URL), hash_value)
This could also return the input stream of data so it can be added into a pipeline and throw an error if it fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you want to suggest removing the with open(filepath, "rb") statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and instead accept a file_like object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'm ok with that.
torchaudio/datasets/utils.py
Outdated
|
|
||
| with open(filepath, mode) as fpointer, urllib.request.urlopen( | ||
| req | ||
| ) as upointer, tqdm( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know there are tqdm options that prevent things from being printed, but I think this should also warrant a "quiet" or "verbose" flag. Something like this can fill up a log with a lot of lines of unimportant information. This kind of plays into a global logging level flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed.
torchaudio/datasets/utils.py
Outdated
| if not filename: | ||
| filename = os.path.basename(url) | ||
| fpath = os.path.join(root, filename) | ||
| filepath = os.path.join(download_folder, os.path.basename(url)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some URLs might no yield a sane filename, but people will still want to download from this. There's an option to attempt to detect the filename that the underlying file resolves to in https://github.com/pytorch/text/blob/master/torchtext/utils.py#L53
Yes, hopefully this PR helps settle on one to share across domains. |
|
Until we define torchdata we'll likely copy-paste functions like this. Although download has the potential of living within core pytorch since it's also required by torchhub. |
torchaudio/datasets/utils.py
Outdated
| download_folder (str): Folder to download file. | ||
| filepath (str): File to read. | ||
| hash_value (str): Hash for url. | ||
| hash_type (str): Hash type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should enumerate the available options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
|
Bonus points for a multithreaded stream_urls function that accepts a list of urls and yields the results URL by URL (or alternatively as_completed). For this you can create a buffer that you fill up with chunks concurrently and can whose size be specified by the user. My main worry here would be a timeout, since the URL chunks will be streamed sequentially. Presumably you can simply read ahead. Maybe this can be achieved by a generic Buffer function :) You can use concurrent ThreadPoolExecutor for that. Based on https://python3statement.org/ PyTorch will drop Python2.7 support in 2020. So it's ok to have features that will not work with Python2.7 at this point, as long as we still guard against them not being available. |
|
|
||
| def download_url(url, root, filename=None, md5=None): | ||
| """Download a file from a url and place it in root. | ||
| def stream_url(url, start_byte=None, block_size=32 * 1024, progress_bar=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be an excellent target for a multithreaded buffer function :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, see here :D
If the goal is to get an iterator over multiple files, then we can wrap If instead we want to add parallel at the at the stream_url level, then we'd need a buffer also to make the download parallel. This is more specialized in its application though. Thoughts? Btw, to tests with multithread: |
torchaudio/datasets/utils.py
Outdated
| ) | ||
|
|
||
|
|
||
| def download_url(urls, *args, max_workers=5, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This notation for the function signature, and concurrent futures are not supported in python 2.
We could leave the multithreaded download for the first release with python 3. Otherwise, I could add some mechanism to offer this functionality only when python 3 is available and default to single thread otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference: this is the commit to undo to add multidownload.
33156c4 to
1369092
Compare
|
Should we merge this? |
Indeed, see here :D |
Rebased. Ready when you are :) |
torchaudio/datasets/utils.py
Outdated
| else: | ||
| raise ValueError | ||
|
|
||
| with open(filepath, "rb") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're assuming a file_path as an input here for validation. I think this would be useful broadly for file-like objects. If you want to make this specific I'd suggest something like validate_filepath(filepath): validate_file(open(filepath)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made validate_file take a file_obj and iterate through it. download_url calls it with with open(...).
|
I think it's worthwhile standardizing on file-like objects and create convenience wrapper functions for objects that can be referenced to via a file path. |
cpuhrsch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
|
||
| while True: | ||
| # Read by chunk to avoid filling memory | ||
| chunk = f.read(1024 ** 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this function actually consume file_obj?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This consumes an object with read(chunk_size) signature, see line 147. Is that what you meant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but should it be file_obj.read rather than f.read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out! Opened #352.
This adds resume to download function, and a validate function with md5 or sha256, from pytorch/pytorch#24915.
Following this, it was tested using:
Do we have a url we can use to test? I'm leaning toward not having a test like this that would ping a url with every batch of test.
See