Skip to content

Conversation

@KDr2
Copy link
Member

@KDr2 KDr2 commented Jan 7, 2022

No description provided.

@KDr2
Copy link
Member Author

KDr2 commented Jan 7, 2022

It's interesting, in the last commit I found step_in a tape is faster than running a model directly, maybe I missed something?

$ julia --project perf/p0.jl
  2.286 μs (38 allocations: 1.95 KiB)
  97.440 ns (1 allocation: 48 bytes)
  440.273 μs (48 allocations: 2.56 KiB)
  969.364 ns (4 allocations: 288 bytes)

@yebai
Copy link
Member

yebai commented Jan 7, 2022

It is possible that step_in the tape executes faster than the original function f(args...) since the tape specialises more (e.g. removing control flows, caching input/output arguments). Note that the total runtime (of Turing inference algorithm) also depends on the CTask constructor (see 1, 2):

julia> @btime t = Libtask.CTask(f, args...);
  258.923 ms (766345 allocations: 43.38 MiB)

julia> @btime Libtask.step_in(t.tf.tape, args)
  95.054 ns (1 allocation: 48 bytes)

julia> @btime f(args...)
  2.549 μs (38 allocations: 1.95 KiB)
(2.0, VarInfo (2 variables (μ, σ), dimension 2; logp: -1.2750123006e7))

So it appears that a lot of time is spent on repetitively constructing CTask. Maybe we can speed this up by resuing tapes?

@KDr2
Copy link
Member Author

KDr2 commented Jan 9, 2022

Without Cache:

$ julia --project perf/p0.jl
"Directly call..." = "Directly call..."
  2.233 μs (38 allocations: 1.95 KiB)
"CTask construction..." = "CTask construction..."
  410.719 ms (974878 allocations: 59.77 MiB)
"Step in a tape..." = "Step in a tape..."
  90.543 ns (1 allocation: 48 bytes)
"Directly call..." = "Directly call..."
  422.273 μs (48 allocations: 2.56 KiB)
"CTask construction..." = "CTask construction..."
  416.812 ms (974908 allocations: 59.77 MiB)
"Step in a tape..." = "Step in a tape..."
  923.559 ns (4 allocations: 288 bytes)

With IR and Tape Cache:

$ julia --project perf/p0.jl
"Directly call..." = "Directly call..."
  2.117 μs (38 allocations: 1.95 KiB)
"CTask construction..." = "CTask construction..."
  99.222 μs (489 allocations: 22.02 KiB)
"Step in a tape..." = "Step in a tape..."
  87.400 ns (1 allocation: 48 bytes)
"Directly call..." = "Directly call..."
  417.133 μs (48 allocations: 2.56 KiB)
"CTask construction..." = "CTask construction..."
  103.745 μs (495 allocations: 22.48 KiB)
"Step in a tape..." = "Step in a tape..."
  924.314 ns (4 allocations: 288 bytes)

@KDr2
Copy link
Member Author

KDr2 commented Jan 11, 2022

In spite of numeric test failures and a few errors, unit tests finished in about 2 hours on my machine:

real    130m46.448s
user    126m4.793s
sys     5m25.685s

@KDr2 KDr2 marked this pull request as ready for review January 12, 2022 00:41
Co-authored-by: David Widmann <[email protected]>
@yebai yebai changed the title [WIP] Performance and Benchmarks Performance and Benchmarks Jan 12, 2022
Copy link
Member

@yebai yebai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that the tests now run correctly - we can rerun the Turing CI once this PR is merged. Fingers crossed!

@KDr2
Copy link
Member Author

KDr2 commented Jan 19, 2022

This PR is ready to merge. @yebai

@KDr2 KDr2 requested a review from yebai January 19, 2022 00:47
@yebai yebai merged commit ccc293c into master Jan 19, 2022
@delete-merged-branch delete-merged-branch bot deleted the perf branch January 19, 2022 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants