Skip to content

Large (20x) runtime difference between value and Jacobian with simple sparsity pattern #138

@moyner

Description

@moyner

As mentioned in issue #136, I am observing a discrepancy in performance between the AD and non-AD runtime that is larger than expected. Here is a MWE with a simple sparsity pattern that isolates the behavior and performs some sanity checks:

using ForwardDiff
using SparsityDetection, SparseArrays
using SparseDiffTools
using BenchmarkTools
using Test

n = 1000*1000
# n = 1000
L = [i for i in 1:n]
R = [i % n + 1 for i in 1:n]

function g(out, in)
    n = length(in)
    for i in eachindex(in)
        l = i
        r = i % n + 1
        out[i] = in[l] - in[r]
    end
end
inp = rand(n)
out = similar(inp)
# Time case without any AD
println("Calling value version:")
@btime g(out, inp)

pattern = jacobian_sparsity(g, out, inp)
jacA = Float64.(sparse(pattern)) # Output from non-cached version
jacB = Float64.(sparse(pattern)) # Output from JacCache'd version

colors = matrix_colors(jacA)
# Verify that it works without cache
println("Calling sparse AD without cache")
@btime forwarddiff_color_jacobian!(jacA, g, inp, colorvec = colors,
                                           sparsity = pattern)
# Set up cache
cache = ForwardColorJacCache(g, inp, colorvec = colors, 
                                     sparsity = pattern);
# Call with cache
println("Calling sparse AD with cache")
@btime forwarddiff_color_jacobian!(jacB, g, inp, cache)
# Check that they have same pattern & values
@test nnz(jacA-jacB) == 0
# Basic sanity check
@test maximum(colors) == 2

The function is a cyclical version of diff and the coloring is just 1,2 alternating with the Jacobian as a general sparse matrix. Filling the Jacobian should ideally be somewhere around 2-4 x the value function call from two AD passes since each Dual contains two floats if we disregard cache effects and the cost of dealing with the sparse data structure. Output for n = 2, 1000 and 1,000,000:

n = 1000000
Calling value version:
  25.803 ns (0 allocations: 0 bytes)
Explored path: SparsityDetection.Path(Bool[], 1)
Calling sparse AD without cache
  8.733 μs (83 allocations: 4.38 KiB)
Calling sparse AD with cache
  93.172 ns (0 allocations: 0 bytes)
n = 1000
Calling value version:
  1.550 μs (0 allocations: 0 bytes)
Explored path: SparsityDetection.Path(Bool[], 1)
Calling sparse AD without cache
  213.300 μs (2573 allocations: 223.75 KiB)
Calling sparse AD with cache
  26.000 μs (0 allocations: 0 bytes)
n = 1000000
Calling value version:
  1.579 ms (0 allocations: 0 bytes)
Explored path: SparsityDetection.Path(Bool[], 1)
Calling sparse AD without cache
  283.338 ms (2999579 allocations: 218.15 MiB)
Calling sparse AD with cache
  35.923 ms (0 allocations: 0 bytes)

We see that the cache seems to work and gives a 8x or so improvement in runtime, and eliminates the allocations while yielding the same output, which is good! The ratio between the value and the AD versions is 3.72 for n = 2, 17.3 for n = 1000 and 23.3 for n = 1e6 so there may be something that is not linear here with respect to the input size... Or I have a bug somewhere in my setup. My versioninfo():

Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, icelake-client)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 4

Relevant packages:

  "ForwardDiff"       => v"0.10.17"
  "BenchmarkTools"    => v"0.7.0"
  "SparsityDetection" => v"0.3.4"
  "SparseDiffTools" => v"1.13.0"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions