Skip to content

Misleading time complexity #5

@mxmlnkn

Description

@mxmlnkn

I think the time complexity is misleading.

To check N images against each other a lookup complexity of N log(N) is advertised.

However, this is basically never the case. This would only be true if we only descended into one child for each node. That can actually happen when we call BKTree.find( item, 0 ), so we are only interested in items with distance 0. But as soon as you are descending into multiple children on average, you are not O(log N) anymore. These websites also wrongly list O(log N) for the lookup complexity. This website explicitly states that O(log N) is not the case. This can be easily proven wrong with benchmarks and also theoretically.

For the time complexity of the ranged query find all <= d, consider a tree with N elements and a distance measure with values in [0, D). E.g., for 32-bit integers D=32 because the hamming distance can range in [0,31]. Assume further that the tree is pretty well-balanced and saturated, i.e., on each level the tree will have D=32 children. For that ranged query, we calculate the distance to the current node and then have to check all children with distances in [dist_to_current - d, dist_to_current + d]. So, as a rough estimate, we need to check min(2d+1,D) children on each level.

If we have to check m children on each level, then in total we will have to check m^depth children! The number of elements in such a saturated tree would be N=D^depth, so the depth would be depth=ln(N)/ln(D). Combining this, we get m^[ln(N)/ln(D)] = N^[ln(m)/ln(D)] lookups. So, clearly a power law with some broken valued exponent and not a logarithmic scaling!

For a lookup to find all <= d, we have m=min(2d+1,D) resulting in a lookup complexity of N^[ln(min(2d+1,D))/ln(D)]. For D=32 (32 bit integers), we would get these scalings:

  • d=0: O(log(N))
  • d=1: O(N^0.32)
  • d=2: O(N^0.46)
  • d=4: O(N^0.63)
  • d=8: O(N^0.82)
  • d=16: O(N)

or for D=64:

  • d=0: O(log(N))
  • d=1: O(N^0.26)
  • d=2: O(N^0.39)
  • d=4: O(N^0.53)
  • d=8: O(N^0.68)
  • d=16: O(N^0.84)
  • d=32: O(N)

For d>=D/2, we would get O(N) scaling because we would have to look at each element in the tree on average with the given assumptions. The assumption, which is encoded in m=min(2d+1,D), is that the distance to each node has to be exactly D/2, which on average should be true for looking up random values.

How to check this? Well, let's benchmark:

benchmark-pybktree.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import os
import time

import matplotlib.pyplot as plt
import numpy as np
import pybktree
import scipy.stats
import tqdm


def benchmark_pybktree( element_counts, repeat_count = 10 ):
    """
    Returns a list of triples:
      - elements
      - tree creation time in seconds
      - lookup time for one element in seconds
    """
    timings = []
    for element_count in tqdm.tqdm( element_counts ):
        timing = [ element_count ]

        runtimes = []
        for i in range( repeat_count ):
            elements = np.random.randint( np.iinfo( np.uint64 ).max, size = element_count, dtype = np.uint64 )
            t0 = time.time()
            tree = pybktree.BKTree( pybktree.hamming_distance, elements )
            t1 = time.time()
            runtimes.append( t1 - t0 )
        timing += [ np.mean( runtimes ), np.std( runtimes ) ]

        for distance in [ 0, 1, 2, 4, 8, 16 ]:
            runtimes = []
            for i in range( repeat_count ):
                elements = np.random.randint( np.iinfo( np.uint64 ).max, size = element_count, dtype = np.uint64 )
                tree = pybktree.BKTree( pybktree.hamming_distance, elements )
                t0 = time.time()
                results = tree.find( item = np.uint64( 0 ), n = distance )
                t1 = time.time()
                runtimes.append( t1 - t0 )
            timing += [ distance, np.mean( runtimes ), np.std( runtimes ) ]

        timings.append( timing )

    return timings

def fit_power_law( x, y ):
    """
    In log-log space power laws becom linear laws. Consider the generic f(x) = a*x^b,
    in log-log space it would be log(f(x)) = log(a) + b*log(x) plotted over log(x).
    Returns the tuple (a,b), i.e., (factor, exponent).
    """
    slope, intercept, rvalue, pvalue, stderr = scipy.stats.linregress( np.log( x ), np.log( y ) )
    return np.exp( intercept ), slope

def plot_results( timings, export_name = None ):
    timings = np.array( timings )

    fig = plt.figure()
    ax = fig.add_subplot( 111,
        xlabel = "Number of Elements",
        ylabel = "Execution Time / s",
        xscale = 'log',
        yscale = 'log',
    )

    element_counts      = timings[:,0]
    creation_times_mean = timings[:,1]
    creation_times_std  = timings[:,2]

    ax.errorbar( element_counts, creation_times_mean, creation_times_std, marker = 'o', label = "Tree creation" )
    a,b = fit_power_law( element_counts, creation_times_mean )
    ax.plot( element_counts, a * element_counts**b, color = '0.5', linestyle = '--' )
    print( f"Fitted creation time to {a} N**{b}" )

    for i in range( ( timings.shape[1] - 3 ) // 3 ):
        lookup_distances  = timings[:, 3 + 3*i+0]
        lookup_times_mean = timings[:, 3 + 3*i+1]
        lookup_times_std  = timings[:, 3 + 3*i+2]

        assert np.all( lookup_distances == lookup_distances[0] )

        ax.errorbar( element_counts, lookup_times_mean, lookup_times_std, marker = 'o',
                     label = "Element lookup distance <= {}".format( lookup_distances[0] ) )
        a,b = fit_power_law( element_counts, lookup_times_mean )
        ax.plot( element_counts, a * element_counts**b, color = '0.5', linestyle = '--' )
        print( f"Fitted lookup for distance <= {lookup_distances[0]} time to {a} N**{b}" )

    ax.plot( element_counts, np.log( element_counts ) / np.log( element_counts )[1] * creation_times_mean[1],
             label = 'O(log N)', zorder = 20, color = 'k')

    ax.legend( loc = 'best' )

    fig.tight_layout()

    if export_name:
        fig.savefig( export_name + '.pdf' )
        fig.savefig( export_name + '.png' )
    else:
        plt.show()

if __name__ == '__main__':
    dataFileName = 'pybktree-scaling.dat'
    if not os.path.exists( dataFileName ):
        timings = benchmark_pybktree( np.unique( ( 10 ** np.linspace( 0, 6, 40 ) ).astype( np.int ) ) )
        np.savetxt( dataFileName, timings )
    plot_results( np.genfromtxt( dataFileName ), export_name = 'pybktree-scaling' )

Here are the results:

pybktree-scaling

The plot is in log-log scale. In log-log scale all power laws in the form of f(x)=ax^b become linear functions in the form of log(a) + b*x. The dashed lines are fits to those linear functions and here are the results for the fitted scaling laws:

  • tree creation: 0.97e-6s N^1.14
  • d=0: 1.80e-6s N^0.27
  • d=1: 1.81e-6s N^0.38
  • d=2: 1.51e-6s N^0.52
  • d=4: 1.21e-6s N^0.73
  • d=8: 1.00e-6s N^0.92
  • d=16: 0.98e-6s N^0.99

For some reason, the exponents seem to be systematically underestimated. But, it can clearly be seen that all except d=0 follow power laws instead of logarithmic scaling. As can be seen in the code, the benchmark ran with 64 bits. So, assuming you have a 64 bit hash and you want to find all with a distance <= 16, you already enter O(N) territory and the bktree becomes useless.

Conclusion: This is less suited than I thought for a lookup in 1M+ files and even less suited if your threshold distance is quite high, which it is in my case. I'm basically edging close to O(N) because of this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions