Misleading time complexity

I think the time complexity is misleading.

To check N images against each other a lookup complexity of N log(N) is advertised.

However, this is basically never the case. This would only be true if we only descended into one child for each node. That can actually happen when we call `BKTree.find( item, 0 )`, so we are only interested in items with distance 0. But as soon as you are descending into multiple children on average, you are not `O(log N)` anymore. [These](https://iq.opengenus.org/burkhard-keller-tree/) [websites](https://www.geeksforgeeks.org/bk-tree-introduction-implementation/) also wrongly list `O(log N)` for the lookup complexity. [This](https://nullwords.wordpress.com/2013/03/13/the-bk-tree-a-data-structure-for-spell-checking/) website explicitly states that `O(log N)` is not the case. This can be easily proven wrong with benchmarks and also theoretically.

For the time complexity of the ranged query `find all <= d`, consider a tree with `N` elements and a distance measure with values in `[0, D)`. E.g., for 32-bit integers `D=32` because the hamming distance can range in `[0,31]`. Assume further that the tree is pretty well-balanced and saturated, i.e., on each level the tree will have `D=32` children. For that ranged query, we calculate the distance to the current node and then have to check all children with distances in `[dist_to_current - d, dist_to_current + d]`. So, as a rough estimate, we need to check `min(2d+1,D)` children on each level.

If we have to check `m` children on each level, then in total we will have to check `m^depth` children! The number of elements in such a saturated tree would be `N=D^depth`, so the depth would be `depth=ln(N)/ln(D)`. Combining this, we get `m^[ln(N)/ln(D)] = N^[ln(m)/ln(D)]` lookups. So, clearly a power law with some broken valued exponent and not a logarithmic scaling!

For a lookup to `find all <= d`, we have `m=min(2d+1,D)` resulting in a lookup complexity of `N^[ln(min(2d+1,D))/ln(D)]`. For `D=32` (32 bit integers), we would get these scalings:

 - `d=0`: `O(log(N))`
 - `d=1`: `O(N^0.32)`
 - `d=2`: `O(N^0.46)`
 - `d=4`: `O(N^0.63)`
 - `d=8`: `O(N^0.82)`
 - `d=16`: `O(N)`

or for `D=64`:

 - `d=0`: `O(log(N))`
 - `d=1`: `O(N^0.26)`
 - `d=2`: `O(N^0.39)`
 - `d=4`: `O(N^0.53)`
 - `d=8`: `O(N^0.68)`
 - `d=16`: `O(N^0.84)`
 - `d=32`: `O(N)`

For `d>=D/2`, we would get `O(N)` scaling because we would have to look at each element in the tree on average with the given assumptions. The assumption, which is encoded in `m=min(2d+1,D)`, is that the distance to each node has to be exactly `D/2`, which on average should be true for looking up random values.

How to check this? Well, let's benchmark:

<details><summary>benchmark-pybktree.py</summary><p>

```python3
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import os
import time

import matplotlib.pyplot as plt
import numpy as np
import pybktree
import scipy.stats
import tqdm


def benchmark_pybktree( element_counts, repeat_count = 10 ):
    """
    Returns a list of triples:
      - elements
      - tree creation time in seconds
      - lookup time for one element in seconds
    """
    timings = []
    for element_count in tqdm.tqdm( element_counts ):
        timing = [ element_count ]

        runtimes = []
        for i in range( repeat_count ):
            elements = np.random.randint( np.iinfo( np.uint64 ).max, size = element_count, dtype = np.uint64 )
            t0 = time.time()
            tree = pybktree.BKTree( pybktree.hamming_distance, elements )
            t1 = time.time()
            runtimes.append( t1 - t0 )
        timing += [ np.mean( runtimes ), np.std( runtimes ) ]

        for distance in [ 0, 1, 2, 4, 8, 16 ]:
            runtimes = []
            for i in range( repeat_count ):
                elements = np.random.randint( np.iinfo( np.uint64 ).max, size = element_count, dtype = np.uint64 )
                tree = pybktree.BKTree( pybktree.hamming_distance, elements )
                t0 = time.time()
                results = tree.find( item = np.uint64( 0 ), n = distance )
                t1 = time.time()
                runtimes.append( t1 - t0 )
            timing += [ distance, np.mean( runtimes ), np.std( runtimes ) ]

        timings.append( timing )

    return timings

def fit_power_law( x, y ):
    """
    In log-log space power laws becom linear laws. Consider the generic f(x) = a*x^b,
    in log-log space it would be log(f(x)) = log(a) + b*log(x) plotted over log(x).
    Returns the tuple (a,b), i.e., (factor, exponent).
    """
    slope, intercept, rvalue, pvalue, stderr = scipy.stats.linregress( np.log( x ), np.log( y ) )
    return np.exp( intercept ), slope

def plot_results( timings, export_name = None ):
    timings = np.array( timings )

    fig = plt.figure()
    ax = fig.add_subplot( 111,
        xlabel = "Number of Elements",
        ylabel = "Execution Time / s",
        xscale = 'log',
        yscale = 'log',
    )

    element_counts      = timings[:,0]
    creation_times_mean = timings[:,1]
    creation_times_std  = timings[:,2]

    ax.errorbar( element_counts, creation_times_mean, creation_times_std, marker = 'o', label = "Tree creation" )
    a,b = fit_power_law( element_counts, creation_times_mean )
    ax.plot( element_counts, a * element_counts**b, color = '0.5', linestyle = '--' )
    print( f"Fitted creation time to {a} N**{b}" )

    for i in range( ( timings.shape[1] - 3 ) // 3 ):
        lookup_distances  = timings[:, 3 + 3*i+0]
        lookup_times_mean = timings[:, 3 + 3*i+1]
        lookup_times_std  = timings[:, 3 + 3*i+2]

        assert np.all( lookup_distances == lookup_distances[0] )

        ax.errorbar( element_counts, lookup_times_mean, lookup_times_std, marker = 'o',
                     label = "Element lookup distance <= {}".format( lookup_distances[0] ) )
        a,b = fit_power_law( element_counts, lookup_times_mean )
        ax.plot( element_counts, a * element_counts**b, color = '0.5', linestyle = '--' )
        print( f"Fitted lookup for distance <= {lookup_distances[0]} time to {a} N**{b}" )

    ax.plot( element_counts, np.log( element_counts ) / np.log( element_counts )[1] * creation_times_mean[1],
             label = 'O(log N)', zorder = 20, color = 'k')

    ax.legend( loc = 'best' )

    fig.tight_layout()

    if export_name:
        fig.savefig( export_name + '.pdf' )
        fig.savefig( export_name + '.png' )
    else:
        plt.show()

if __name__ == '__main__':
    dataFileName = 'pybktree-scaling.dat'
    if not os.path.exists( dataFileName ):
        timings = benchmark_pybktree( np.unique( ( 10 ** np.linspace( 0, 6, 40 ) ).astype( np.int ) ) )
        np.savetxt( dataFileName, timings )
    plot_results( np.genfromtxt( dataFileName ), export_name = 'pybktree-scaling' )
```
</p></details>

Here are the results:

![pybktree-scaling](https://user-images.githubusercontent.com/6842824/86415739-382a5180-bcc8-11ea-80fa-f156bdb17ba2.png)

The plot is in log-log scale. In log-log scale all power laws in the form of  `f(x)=ax^b` become linear functions in the form of `log(a) + b*x`. The dashed lines are fits to those linear functions and here are the results for the fitted scaling laws:

 - tree creation: ``0.97e-6s N^1.14``
 - `d=0`: `1.80e-6s N^0.27`
 - `d=1`: `1.81e-6s N^0.38`
 - `d=2`: `1.51e-6s N^0.52`
 - `d=4`: `1.21e-6s N^0.73`
 - `d=8`: `1.00e-6s N^0.92`
 - `d=16`: `0.98e-6s N^0.99`

For some reason, the exponents seem to be systematically underestimated. But, it can clearly be seen that all except `d=0` follow power laws instead of logarithmic scaling. As can be seen in the code, the benchmark ran with 64 bits. So, assuming you have a 64 bit hash and you want to find all with a distance <= 16, you already enter O(N) territory and the bktree becomes useless.

Conclusion: This is less suited than I thought for a lookup in 1M+ files and even less suited if your threshold distance is quite high, which it is in my case. I'm basically edging close to `O(N)` because of this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misleading time complexity #5

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Misleading time complexity #5

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions