Skip to content

Conversation

@vincentqb
Copy link
Contributor

@vincentqb vincentqb commented Oct 29, 2019

This adds resume to download function, and a validate function with md5 or sha256, from pytorch/pytorch#24915.

Following this, it was tested using:

    def test_download_url(self):

        url = "http://www.patentsview.org/data/20171226/botanic.tsv.zip"
        hash_url = "94c642405619b20ecaf657b30e84bab787320649e751ed6ac629c0be613ded44"

        download_folder = "."
        download_url(url, download_folder)
        validate_download_url(url, download_folder, hash_url)

Do we have a url we can use to test? I'm leaning toward not having a test like this that would ping a url with every batch of test.

See

@vincentqb vincentqb changed the title resume download, validate with md5 or sha256. Resume download Oct 29, 2019
@vincentqb
Copy link
Contributor Author

For reference: sample code that opens a gzip url and decode on-the-fly.

@vincentqb vincentqb force-pushed the download branch 5 times, most recently from 927e462 to 59ac2cf Compare October 29, 2019 21:00
Copy link

@zhangguanheng66 zhangguanheng66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another example that we should unify the download func across DAPIs. @vincentqb @fmassa @cpuhrsch
pytorch/text#538

@cpuhrsch
Copy link
Contributor

Yes, in an ideal world our tests do not depend on the internet. But it's hard to test a function that operates on URLs without URLs. I guess we could use a file:// type URLs? This might be a bit harder to test rigorously within the CI.

"""Download a file from a url and place it in root.
def download_url(url, download_folder, hash_value=None, hash_type="sha256"):
"""Execute the correct download operation.
Depending on the size of the file online and offline, resume the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make resuming an option, similar to how something like rsync allows you to decide whether you want to just simply overwrite or attempt to detect differences.

Copy link
Contributor Author

@vincentqb vincentqb Oct 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's reasonable.


def download_url(url, root, filename=None, md5=None):
"""Download a file from a url and place it in root.
def download_url(url, download_folder, hash_value=None, hash_type="sha256"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the assumption that we want to download to a folder. Is it possible to split this out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can be done.

pbar.update(len(chunk))


def validate_download_url(filepath, hash_value, hash_type="sha256"):
Copy link
Contributor

@cpuhrsch cpuhrsch Oct 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works with any file right? So it could just be called "validate" and take a path. Even better, take a buffer or file-like object.

Copy link
Contributor Author

@vincentqb vincentqb Oct 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but then I need to fork the open/read statement :) preferred way of doing this?

Copy link
Contributor

@cpuhrsch cpuhrsch Oct 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you require a file-like object you don't need to check whether it's a path or not. You require the user to call this via

validate_download_url(open(filepath), hash_value)
instead of
validate_download_url(filepath, hash_value)

This then also enables them to validate_download_url(stream_url(URL), hash_value)

This could also return the input stream of data so it can be added into a pipeline and throw an error if it fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you want to suggest removing the with open(filepath, "rb") statement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and instead accept a file_like object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'm ok with that.


with open(filepath, mode) as fpointer, urllib.request.urlopen(
req
) as upointer, tqdm(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know there are tqdm options that prevent things from being printed, but I think this should also warrant a "quiet" or "verbose" flag. Something like this can fill up a log with a lot of lines of unimportant information. This kind of plays into a global logging level flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.

if not filename:
filename = os.path.basename(url)
fpath = os.path.join(root, filename)
filepath = os.path.join(download_folder, os.path.basename(url))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some URLs might no yield a sane filename, but people will still want to download from this. There's an option to attempt to detect the filename that the underlying file resolves to in https://github.com/pytorch/text/blob/master/torchtext/utils.py#L53

@vincentqb
Copy link
Contributor Author

This is another example that we should unify the download func across DAPIs.
pytorch/text#538

Yes, hopefully this PR helps settle on one to share across domains.

@cpuhrsch
Copy link
Contributor

Until we define torchdata we'll likely copy-paste functions like this. Although download has the potential of living within core pytorch since it's also required by torchhub.

download_folder (str): Folder to download file.
filepath (str): File to read.
hash_value (str): Hash for url.
hash_type (str): Hash type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should enumerate the available options

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

@cpuhrsch
Copy link
Contributor

cpuhrsch commented Oct 30, 2019

Bonus points for a multithreaded stream_urls function that accepts a list of urls and yields the results URL by URL (or alternatively as_completed). For this you can create a buffer that you fill up with chunks concurrently and can whose size be specified by the user. My main worry here would be a timeout, since the URL chunks will be streamed sequentially. Presumably you can simply read ahead. Maybe this can be achieved by a generic Buffer function :)

You can use concurrent ThreadPoolExecutor for that. Based on https://python3statement.org/ PyTorch will drop Python2.7 support in 2020. So it's ok to have features that will not work with Python2.7 at this point, as long as we still guard against them not being available.


def download_url(url, root, filename=None, md5=None):
"""Download a file from a url and place it in root.
def stream_url(url, start_byte=None, block_size=32 * 1024, progress_bar=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be an excellent target for a multithreaded buffer function :D

Copy link
Contributor Author

@vincentqb vincentqb Oct 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, see here :D

@vincentqb
Copy link
Contributor Author

vincentqb commented Oct 30, 2019

Bonus points for a multithreaded stream_urls function that accepts a list of urls and yields the results URL by URL (or alternatively as_completed). For this you can create a buffer that you fill up with chunks concurrently and can whose size be specified by the user. My main worry here would be a timeout, since the URL chunks will be streamed sequentially. Presumably you can simply read ahead. Maybe this can be achieved by a generic Buffer function :)

You can use concurrent ThreadPoolExecutor for that. Based on https://python3statement.org/ PyTorch will drop Python2.7 support in 2020. So it's ok to have features that will not work with Python2.7 at this point, as long as we still guard against them not being available.

If the goal is to get an iterator over multiple files, then we can wrap download_url by simply applying it to a list of urls.

If instead we want to add parallel at the at the stream_url level, then we'd need a buffer also to make the download parallel. This is more specialized in its application though. Thoughts?

Btw, to tests with multithread:

    def test_download_multiple_url(self):

        url = "http://www.patentsview.org/data/20171226/botanic.tsv.zip"
        url = [url] * 6

        download_folder = "."
        fs = download_url(url, download_folder, hash_value=None)

)


def download_url(urls, *args, max_workers=5, **kwargs):
Copy link
Contributor Author

@vincentqb vincentqb Oct 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This notation for the function signature, and concurrent futures are not supported in python 2.

We could leave the multithreaded download for the first release with python 3. Otherwise, I could add some mechanism to offer this functionality only when python 3 is available and default to single thread otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

Copy link
Contributor Author

@vincentqb vincentqb Nov 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference: this is the commit to undo to add multidownload.

@vincentqb vincentqb force-pushed the download branch 2 times, most recently from 33156c4 to 1369092 Compare November 5, 2019 22:22
@vincentqb vincentqb mentioned this pull request Nov 5, 2019
@cpuhrsch
Copy link
Contributor

Should we merge this?

@vincentqb
Copy link
Contributor Author

Bonus points for a multithreaded stream_urls function that accepts a list of urls and yields the results URL by URL (or alternatively as_completed). For this you can create a buffer that you fill up with chunks concurrently and can whose size be specified by the user. My main worry here would be a timeout, since the URL chunks will be streamed sequentially. Presumably you can simply read ahead. Maybe this can be achieved by a generic Buffer function :)

You can use concurrent ThreadPoolExecutor for that. Based on https://python3statement.org/ PyTorch will drop Python2.7 support in 2020. So it's ok to have features that will not work with Python2.7 at this point, as long as we still guard against them not being available.

Indeed, see here :D

@vincentqb
Copy link
Contributor Author

Should we merge this?

Rebased. Ready when you are :)

else:
raise ValueError

with open(filepath, "rb") as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're assuming a file_path as an input here for validation. I think this would be useful broadly for file-like objects. If you want to make this specific I'd suggest something like validate_filepath(filepath): validate_file(open(filepath)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made validate_file take a file_obj and iterate through it. download_url calls it with with open(...).

@cpuhrsch
Copy link
Contributor

I think it's worthwhile standardizing on file-like objects and create convenience wrapper functions for objects that can be referenced to via a file path.

Copy link
Contributor

@cpuhrsch cpuhrsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vincentqb vincentqb merged commit e0407b5 into pytorch:master Nov 14, 2019

while True:
# Read by chunk to avoid filling memory
chunk = f.read(1024 ** 2)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this function actually consume file_obj?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This consumes an object with read(chunk_size) signature, see line 147. Is that what you meant?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but should it be file_obj.read rather than f.read?

Copy link
Contributor Author

@vincentqb vincentqb Nov 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out! Opened #352.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants