Skip to content

Conversation

@vincentqb
Copy link
Contributor

@vincentqb vincentqb commented Oct 30, 2019

We can prefetch the upcoming values of a generator. Since a generator may be strictly sequential, this computes new values in the background but not in parallel unlike here.

See source.

@vincentqb vincentqb mentioned this pull request Oct 30, 2019
@cpuhrsch
Copy link
Contributor

cpuhrsch commented Nov 5, 2019

Two suggestions:

  1. I think a shorter name could be better. Maybe bg_iterator akin to the bg command in bash.
  2. Wrap this into a function such as bg_iterator(iterable, queue_size=1) to make it more consistent with download, iterator etc.

def run(self):
for item in self.generator:
self.queue.put(item)
self.queue.put(None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An iterator might very well return None (such as indicating an empty datapoint). We should use a custom poison bill, something like a very specific class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a class _End to be the end of queue, and checking for this.

@vincentqb
Copy link
Contributor Author

  1. Wrap this into a function such as bg_iterator(iterable, queue_size=1) to make it more consistent with download, iterator etc.

I've just renamed the class. What would wrapping the class in a function add? The notation is already that of a function:

for i in bg_iterator(range(10)):
  print(i)

@cpuhrsch
Copy link
Contributor

cpuhrsch commented Nov 5, 2019

Mostly consistency. Classes are supposed to use CamelCase and factory functions aren't an uncommon pattern. It also allows us to write simple factories that initialize a complex object. There might be a few more advantages, we should think about it.

@vincentqb
Copy link
Contributor Author

Mostly consistency. Classes are supposed to use CamelCase and factory functions aren't an uncommon pattern. It also allows us to write simple factories that initialize a complex object. There might be a few more advantages, we should think about it.

We should respect the naming convention, so I'd be in favor of BGIterator(generator) just like DiskCache(generator) we introduced.

You have in mind a function like this?

class _BGIterator:
    ...
      
def bg_iterator(*args, **kwargs):
    return _BGIterator(*args, **kwargs)

@cpuhrsch
Copy link
Contributor

cpuhrsch commented Nov 5, 2019

It could be

def bg_iterator(iterable, queue_size):
    return _ThreadedIterator(iterable, num_threads=1, queue_size=queue_size)

def pool_iterator(iterable, num_threads, queue_size):
    return _ThreadedIterator(iterable, num_threads=num_threads, queue_size=queue_size)

@vincentqb
Copy link
Contributor Author

def bg_iterator(iterable, queue_size):
    return _ThreadedIterator(iterable, num_threads=1, queue_size=queue_size)

Ok. Should we then also change the DiskCache?

def diskcache_iterator(dataset, location=".cached"):
    return _DiskCache(dataset, location)
def pool_iterator(iterable, num_threads, queue_size):
    return _ThreadedIterator(iterable, num_threads=num_threads, queue_size=queue_size)

Since an iterator may be sequential, the class simply detaches a thread to run separately. I'd put this out of the scope of this PR.

To benefit from having many workers when the iterator is not sequential, we would then need to add something like concurrent futures in the thread.

@vincentqb vincentqb changed the title [WIP] Background generator Background generator Nov 5, 2019
@cpuhrsch
Copy link
Contributor

cpuhrsch commented Nov 5, 2019

Ideally DiskCache is setup so that it can be combined with bg_iterator.

Maybe this works: disk_cache writes the iterator values to disk as it reads it and bg_iterator allows that process to happen in the background.

@vincentqb
Copy link
Contributor Author

Ideally DiskCache is setup so that it can be combined with bg_iterator.

Maybe this works: disk_cache writes the iterator values to disk as it reads it and bg_iterator allows that process to happen in the background.

yup :)

@vincentqb
Copy link
Contributor Author

Ok. Should we then also change the DiskCache?

def diskcache_iterator(dataset, location=".cached"):
    return _DiskCache(dataset, location)

Done.

@cpuhrsch
Copy link
Contributor

cpuhrsch commented Nov 5, 2019

Yeah I think that's a good idea, if you agree. Ok, so at this point this looks good, I'll accept it :)

@vincentqb vincentqb merged commit 6db2ad1 into pytorch:master Nov 6, 2019
@vincentqb vincentqb deleted the backgroundgen branch November 6, 2019 14:43
@vincentqb
Copy link
Contributor Author

Relates to pytorch/pytorch#41292

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants