-
-
Notifications
You must be signed in to change notification settings - Fork 474
Description
Fundamentally I see three types of weighted-choice algorithm:
- Calculate
weight_sum
, takesample = rng.gen_range(0, weight_sum)
, iterate over elements until cumulative weight exceedssample
then take the previous item. - Calculate a CDF of weights (probably just an array of cumulative weights), take
sample
as above, then find item by binary search; look up element from the index - As follows:
fn choose_weighted<R, F, I, X>(items: I, weight_fn: F, rng: &mut R) -> Option<T> where R: Rng + ?Sized, I: Iterator<T>, F: Fn(&T) -> W, X: SampleUniform + ::core::ops::AddAssign<X> + ::core::cmp::PartialOrd<X> { let mut result = if let Some(item) = items.next() { item } else { return None; }; let mut sum = weight_fn(&result); while let Some(item) = items.next() { let weight = weight_fn(&item); sum += weight; if rng.gen_range(0, sum) < weight { result = item; } } Some(result) }
Where one wants to sample from the same set of weights multiple times, calculating a CDF is the obvious choice since the CDF should require no more memory than the original weights themselves.
Where one wants to sample a single time from a slice, one of the first two choices makes the most sense; since calculating the total weight requires all the work of calculating the CDF except storing the results, using the CDF may often be the best option but this isn't guaranteed.
Where one wants to sample a single time from an iterator, any of the above can be used, but the first two options require either cloning the iterator and iterating twice (not always possible and slightly expensive) or collecting all items into a temporary vector while calculating the sum/CDF, then selecting the required item. In this case the last option may be attractive, though of course sampling the RNG for every item has significant overhead (so probably is only useful for large elements or no allocator).
Which algorithm(s) should we include in Rand?
The method calculating the CDF will often be preferred, so should be included. Unfortunately it requires an allocator (excepting if weighs are provided via mutable reference to a slice), but we should probably not worry about this.
A convenience method to sample from weighted slices would presumably prefer to use the CDF method normally.
For a method to sample from weighted iterators it is less clear which implementation should be used. Although it will not perform well, the last algorithm (i.e. sample code above) may be a nice choice in that it does not require an allocator.
My conclusion: perhaps we should accept #518 in its current form (i.e. WeightedIndex
distribution using CDF + binary search, and convenience wrappers for slices), plus consider adding the code here to sample from iterators.