Resume download #320

vincentqb · 2019-10-29T17:10:08Z

This adds resume to download function, and a validate function with md5 or sha256, from pytorch/pytorch#24915.

Following this, it was tested using:

    def test_download_url(self):

        url = "http://www.patentsview.org/data/20171226/botanic.tsv.zip"
        hash_url = "94c642405619b20ecaf657b30e84bab787320649e751ed6ac629c0be613ded44"

        download_folder = "."
        download_url(url, download_folder)
        validate_download_url(url, download_folder, hash_url)

Do we have a url we can use to test? I'm leaning toward not having a test like this that would ping a url with every batch of test.

See

vincentqb · 2019-10-29T17:15:31Z

For reference: sample code that opens a gzip url and decode on-the-fly.

zhangguanheng66

This is another example that we should unify the download func across DAPIs. @vincentqb @fmassa @cpuhrsch
pytorch/text#538

cpuhrsch · 2019-10-30T13:04:00Z

Yes, in an ideal world our tests do not depend on the internet. But it's hard to test a function that operates on URLs without URLs. I guess we could use a file:// type URLs? This might be a bit harder to test rigorously within the CI.

cpuhrsch · 2019-10-30T13:15:54Z

torchaudio/datasets/utils.py

-    """Download a file from a url and place it in root.
+def download_url(url, download_folder, hash_value=None, hash_type="sha256"):
+    """Execute the correct download operation.
+    Depending on the size of the file online and offline, resume the


I'd make resuming an option, similar to how something like rsync allows you to decide whether you want to just simply overwrite or attempt to detect differences.

Yeah, that's reasonable.

cpuhrsch · 2019-10-30T13:16:18Z

torchaudio/datasets/utils.py


-def download_url(url, root, filename=None, md5=None):
-    """Download a file from a url and place it in root.
+def download_url(url, download_folder, hash_value=None, hash_type="sha256"):


This makes the assumption that we want to download to a folder. Is it possible to split this out?

Yes, it can be done.

cpuhrsch · 2019-10-30T13:17:05Z

torchaudio/datasets/utils.py

+            pbar.update(len(chunk))
+
+
+def validate_download_url(filepath, hash_value, hash_type="sha256"):


This works with any file right? So it could just be called "validate" and take a path. Even better, take a buffer or file-like object.

but then I need to fork the open/read statement :) preferred way of doing this?

If you require a file-like object you don't need to check whether it's a path or not. You require the user to call this via

validate_download_url(open(filepath), hash_value)
instead of
validate_download_url(filepath, hash_value)

This then also enables them to validate_download_url(stream_url(URL), hash_value)

This could also return the input stream of data so it can be added into a pipeline and throw an error if it fails.

So you want to suggest removing the with open(filepath, "rb") statement?

Yes and instead accept a file_like object.

Sure, I'm ok with that.

cpuhrsch · 2019-10-30T13:19:17Z

torchaudio/datasets/utils.py

+
+    with open(filepath, mode) as fpointer, urllib.request.urlopen(
+        req
+    ) as upointer, tqdm(


I know there are tqdm options that prevent things from being printed, but I think this should also warrant a "quiet" or "verbose" flag. Something like this can fill up a log with a lot of lines of unimportant information. This kind of plays into a global logging level flag.

cpuhrsch · 2019-10-30T13:25:12Z

torchaudio/datasets/utils.py

-    if not filename:
-        filename = os.path.basename(url)
-    fpath = os.path.join(root, filename)
+    filepath = os.path.join(download_folder, os.path.basename(url))


Some URLs might no yield a sane filename, but people will still want to download from this. There's an option to attempt to detect the filename that the underlying file resolves to in https://github.com/pytorch/text/blob/master/torchtext/utils.py#L53

vincentqb · 2019-10-30T14:08:39Z

This is another example that we should unify the download func across DAPIs.
pytorch/text#538

Yes, hopefully this PR helps settle on one to share across domains.

cpuhrsch · 2019-10-30T14:42:29Z

Until we define torchdata we'll likely copy-paste functions like this. Although download has the potential of living within core pytorch since it's also required by torchhub.

cpuhrsch · 2019-10-30T16:22:19Z

torchaudio/datasets/utils.py

-        download_folder (str): Folder to download file.
+        filepath (str): File to read.
        hash_value (str): Hash for url.
        hash_type (str): Hash type.


We should enumerate the available options

cpuhrsch · 2019-10-30T16:27:00Z

Bonus points for a multithreaded stream_urls function that accepts a list of urls and yields the results URL by URL (or alternatively as_completed). For this you can create a buffer that you fill up with chunks concurrently and can whose size be specified by the user. My main worry here would be a timeout, since the URL chunks will be streamed sequentially. Presumably you can simply read ahead. Maybe this can be achieved by a generic Buffer function :)

You can use concurrent ThreadPoolExecutor for that. Based on https://python3statement.org/ PyTorch will drop Python2.7 support in 2020. So it's ok to have features that will not work with Python2.7 at this point, as long as we still guard against them not being available.

cpuhrsch · 2019-10-30T16:29:50Z

torchaudio/datasets/utils.py


-def download_url(url, root, filename=None, md5=None):
-    """Download a file from a url and place it in root.
+def stream_url(url, start_byte=None, block_size=32 * 1024, progress_bar=True):


This could be an excellent target for a multithreaded buffer function :D

Indeed, see here :D

vincentqb · 2019-10-30T20:11:47Z

Bonus points for a multithreaded stream_urls function that accepts a list of urls and yields the results URL by URL (or alternatively as_completed). For this you can create a buffer that you fill up with chunks concurrently and can whose size be specified by the user. My main worry here would be a timeout, since the URL chunks will be streamed sequentially. Presumably you can simply read ahead. Maybe this can be achieved by a generic Buffer function :)

You can use concurrent ThreadPoolExecutor for that. Based on https://python3statement.org/ PyTorch will drop Python2.7 support in 2020. So it's ok to have features that will not work with Python2.7 at this point, as long as we still guard against them not being available.

If the goal is to get an iterator over multiple files, then we can wrap download_url by simply applying it to a list of urls.

If instead we want to add parallel at the at the stream_url level, then we'd need a buffer also to make the download parallel. This is more specialized in its application though. Thoughts?

Btw, to tests with multithread:

    def test_download_multiple_url(self):

        url = "http://www.patentsview.org/data/20171226/botanic.tsv.zip"
        url = [url] * 6

        download_folder = "."
        fs = download_url(url, download_folder, hash_value=None)

vincentqb · 2019-10-30T21:40:19Z

torchaudio/datasets/utils.py

+        )
+
+
+def download_url(urls, *args, max_workers=5, **kwargs):


This notation for the function signature, and concurrent futures are not supported in python 2.

We could leave the multithreaded download for the first release with python 3. Otherwise, I could add some mechanism to offer this functionality only when python 3 is available and default to single thread otherwise.

Sounds good

For reference: this is the commit to undo to add multidownload.

cpuhrsch · 2019-11-11T20:13:05Z

Should we merge this?

vincentqb · 2019-11-12T00:00:21Z

Bonus points for a multithreaded stream_urls function that accepts a list of urls and yields the results URL by URL (or alternatively as_completed). For this you can create a buffer that you fill up with chunks concurrently and can whose size be specified by the user. My main worry here would be a timeout, since the URL chunks will be streamed sequentially. Presumably you can simply read ahead. Maybe this can be achieved by a generic Buffer function :)

You can use concurrent ThreadPoolExecutor for that. Based on https://python3statement.org/ PyTorch will drop Python2.7 support in 2020. So it's ok to have features that will not work with Python2.7 at this point, as long as we still guard against them not being available.

Indeed, see here :D

vincentqb · 2019-11-12T00:00:56Z

Should we merge this?

Rebased. Ready when you are :)

cpuhrsch · 2019-11-12T23:51:42Z

torchaudio/datasets/utils.py

+    else:
+        raise ValueError
+
+    with open(filepath, "rb") as f:


We're assuming a file_path as an input here for validation. I think this would be useful broadly for file-like objects. If you want to make this specific I'd suggest something like validate_filepath(filepath): validate_file(open(filepath)).

I made validate_file take a file_obj and iterate through it. download_url calls it with with open(...).

cpuhrsch · 2019-11-12T23:52:40Z

I think it's worthwhile standardizing on file-like objects and create convenience wrapper functions for objects that can be referenced to via a file path.

cpuhrsch

LGTM

zhangguanheng66 · 2019-11-25T18:24:36Z

torchaudio/datasets/utils.py

+
+    while True:
+        # Read by chunk to avoid filling memory
+        chunk = f.read(1024 ** 2)


Does this function actually consume file_obj?

This consumes an object with read(chunk_size) signature, see line 147. Is that what you meant?

Yes, but should it be file_obj.read rather than f.read?

Thanks for pointing out! Opened #352.

vincentqb changed the title ~~resume download, validate with md5 or sha256.~~ Resume download Oct 29, 2019

vincentqb force-pushed the download branch 5 times, most recently from 927e462 to 59ac2cf Compare October 29, 2019 21:00

vincentqb requested review from cpuhrsch, fmassa and zhangguanheng66 October 29, 2019 22:26

zhangguanheng66 reviewed Oct 30, 2019

View reviewed changes

cpuhrsch reviewed Oct 30, 2019

View reviewed changes

vincentqb commented Oct 30, 2019

View reviewed changes

vincentqb force-pushed the download branch 2 times, most recently from 33156c4 to 1369092 Compare November 5, 2019 22:22

vincentqb mentioned this pull request Nov 5, 2019

Multi download #331

Closed

vincentqb added 2 commits November 11, 2019 15:53

resume download, validate with md5 or sha256.

1d63601

with urllib.

9e7d843

vincentqb added 8 commits November 11, 2019 15:53

split stream from saving. detect filename.

46e92fe

not specific to url.

0ad22e2

validate at end too. check file size again.

6d6813b

futures.

fcb5d0a

expose choices of hash.

cb95695

update comment.

b301e1c

typo.

2ea3018

remove parallel download.

edef3ca

vincentqb force-pushed the download branch from 1369092 to edef3ca Compare November 11, 2019 23:55

extra library.

a3ed7d6

cpuhrsch reviewed Nov 12, 2019

View reviewed changes

validate now operates on file object.

f8abbcf

cpuhrsch approved these changes Nov 14, 2019

View reviewed changes

vincentqb merged commit e0407b5 into pytorch:master Nov 14, 2019

zhangguanheng66 reviewed Nov 25, 2019

View reviewed changes

		pbar.update(len(chunk))


		def validate_download_url(filepath, hash_value, hash_type="sha256"):

Resume download #320

Resume download #320

Uh oh!

Conversation

vincentqb commented Oct 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vincentqb commented Oct 29, 2019

Uh oh!

zhangguanheng66 left a comment

Choose a reason for hiding this comment

Uh oh!

cpuhrsch commented Oct 30, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vincentqb Oct 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpuhrsch Oct 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vincentqb Oct 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpuhrsch Oct 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vincentqb commented Oct 30, 2019

Uh oh!

cpuhrsch commented Oct 30, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpuhrsch commented Oct 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vincentqb Oct 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vincentqb commented Oct 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vincentqb Oct 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vincentqb Nov 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpuhrsch commented Nov 11, 2019

Uh oh!

vincentqb commented Oct 29, 2019 •

edited

Loading

vincentqb Oct 30, 2019 •

edited

Loading

cpuhrsch Oct 30, 2019 •

edited

Loading

vincentqb Oct 30, 2019 •

edited

Loading

cpuhrsch Oct 30, 2019 •

edited

Loading

cpuhrsch commented Oct 30, 2019 •

edited

Loading

vincentqb Oct 30, 2019 •

edited

Loading

vincentqb commented Oct 30, 2019 •

edited

Loading

vincentqb Oct 30, 2019 •

edited

Loading

vincentqb Nov 5, 2019 •

edited

Loading

vincentqb Nov 25, 2019 •

edited

Loading