NUTS refactoring towards GPU support #2345

aseyboldt · 2017-06-23T00:54:45Z

So far we have mostly investigated possibilities of moving more of nuts into theano. But it might be better to go the opposite direction, and limit theano to computing logp and gradient. The remaining computations are mostly simple BLAS1 operations, so we should be able to do those quickly on a gpu or cpu on our own.
This was meant as an experiment to see how much slower it would be if we did the leapfrog steps in pure python, but to my surprise I get from 5% to 60% faster sampling.
I'm not sure I understand why that is yet, but my guess is that the shared computations between logp and gradient are responsible. Usually theano should be smart enough to do those only once, but if the leapfrog steps are part of the graph this seems to break. In complicated models the resulting runtime cost can be high.

CC @ColCarroll @hvasbath

ColCarroll

This looks really cool! Left some comments just to keep myself engaged. I will try to do some benchmarking this weekend and see if I get similar speedups, and under what conditions.

ColCarroll · 2017-06-23T01:50:21Z

pymc3/model.py

these lines are left over experiments, yeah?

ups, I'll take those out on the next push.

ColCarroll · 2017-06-23T01:54:28Z

pymc3/step_methods/hmc/quadpotential.py

any reason for this syntax?

oh, does it do automatic shape checking/something clever with reusing allocated memory?

Yes, I'm trying to avoid allocations. Haven't done much profiling yet though.

ColCarroll · 2017-06-23T02:17:52Z

pymc3/step_methods/hmc/quadpotential.py

this is x.dot(velocity), but modestly extra faster, I guess? (just ran a silly benchmark, and it seems like it is ~30% faster)

yes. I thought we might as well just use the blas version directly. This would probably also be what we want to call on a gpu. (Although pygpu doesn't expose level 1 BLAS at the moment...)

junpenglao · 2017-06-23T05:45:05Z

If this works it will be also much easier to move to other backends (PyTorch?).

hvasbath · 2017-06-23T10:12:11Z

I always had the feeling that theano is very slow in some cases. I guess thats the cost of wrapping everything into layers and layers of code. Some in C some in python and then the communication overhead is high. I know too little about nuts and pygpu not at all so I cannot comment on the code but thanks for cc anyways as this touches the other issue. Can you profile how much is being shifted back and forth to the GPU?

aseyboldt · 2017-06-23T13:36:09Z

@hvasbath It's not running on the gpu yet, this is still about figuring out how we could do that in the future.
I don't think calling overhead is much of a problem in most of my models at least. Most of the ops are C anyway. I think the reason for the particular slowdown this PR fixes has more to do with what theano is computing in the first place. There are many cases where the gradient and the value both contain the same expression somewhere. Theano analyses the graph to find nodes that occur more than once, but if this expression is first changed by some other optimization, then that doesn't work anymore. This might be why it ends up recomputing something twice. I'm still not sure about the exact reason, but just from the Apply node counts, it definitely does some things more often before the PR than after. If those things happen to be very expensive in a particular model, than we get a large speedup.

hvasbath · 2017-06-24T08:29:26Z

You could try switching optimizations of? Whats the timing then? If you know there are things used repeatedly in the graph couldnt you then program it explicitely in the way not to recompute it?

ferrine · 2017-06-24T09:47:53Z

Interesting, waiting formenchmarks

aseyboldt · 2017-06-24T19:54:15Z

Edit: There was a bug in my profiling code, I updated the times. The new time went up from 7ms to 14ms. (still much faster than master, but not as much :-(

@ferrine I just tried it on an old model with ~30000 variables I still had around:

Before

N = model.ndim
with model:
    step = pm.step_methods.hmc.base_hmc.BaseHMC(
        scaling=np.ones(N), profile=False, use_single_leapfrog=True)

np.random.seed(42)
q = floatX(np.random.randn(N))
p = floatX(np.random.randn(N))
epsilon = floatX(np.array(0.01))
q_grad = model.dlogp_array(q)
%timeit _ = step.leapfrog(q, p, q_grad, epsilon)

20.8 ms ± 889 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Profiling:

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  82.7%    82.7%       1.327s       1.02e-04s     C    13041     161   theano.tensor.elemwise.Elemwise
  10.6%    93.3%       0.170s       5.51e-05s     C     3078      38   theano.tensor.elemwise.Sum
   5.7%    99.0%       0.092s       2.85e-04s     Py     324       4   theano.sparse.basic.Dot
   0.2%    99.2%       0.003s       2.10e-05s     C      162       2   theano.tensor.basic.Join
   0.2%    99.4%       0.003s       9.77e-06s     C      324       4   theano.tensor.basic.Alloc
   0.1%    99.5%       0.002s       2.79e-05s     C       81       1   theano.tensor.blas_c.CGemv
   0.1%    99.7%       0.002s       1.06e-06s     C     1944      24   theano.tensor.subtensor.Subtensor
   0.1%    99.8%       0.002s       7.30e-07s     C     2592      32   theano.tensor.elemwise.DimShuffle
   0.1%    99.9%       0.001s       1.39e-05s     C       81       1   theano.compile.ops.DeepCopyOp
   0.1%    99.9%       0.001s       8.60e-07s     C     1134      14   theano.tensor.basic.Reshape
   0.0%    99.9%       0.000s       3.44e-07s     C     1134      14   theano.compile.ops.ViewOp
   0.0%   100.0%       0.000s       7.84e-07s     C      405       5   theano.tensor.basic.ScalarFromTensor
   0.0%   100.0%       0.000s       3.05e-07s     C      810      10   theano.compile.ops.Rebroadcast
   0.0%   100.0%       0.000s       2.12e-06s     C       81       1   theano.compile.ops.Shape_i
   0.0%   100.0%       0.000s       1.26e-06s     C       81       1   theano.tensor.basic.AllocEmpty
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

After

N = model.ndim
with model:
    step = pm.step_methods.hmc.base_hmc.BaseHMC(
        scaling=np.ones(N), use_single_leapfrog=True, profile=False)

np.random.seed(42)
q = floatX(np.random.randn(N))
p = floatX(np.random.randn(N))
epsilon = floatX(np.array(0.01))
state = step.integrator.compute_state(q, p)
%timeit step.integrator.step(epsilon, state)

14.4 ms ± 25.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Profiling:

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  86.6%    86.6%       9.967s       1.29e-04s     C    77140      95   theano.tensor.elemwise.Elemwise
   8.6%    95.2%       0.991s       5.31e-05s     C    18676      23   theano.tensor.elemwise.Sum
   3.9%    99.0%       0.445s       2.74e-04s     Py    1624       2   theano.sparse.basic.Dot
   0.4%    99.4%       0.047s       4.82e-06s     C     9744      12   theano.tensor.subtensor.IncSubtensor
   0.3%    99.7%       0.036s       1.50e-05s     C     2436       3   theano.tensor.basic.Alloc
   0.1%    99.8%       0.009s       7.26e-07s     C    12180      15   theano.tensor.elemwise.DimShuffle
   0.1%    99.9%       0.009s       8.97e-07s     C     9744      12   theano.tensor.subtensor.Subtensor
   0.1%   100.0%       0.006s       6.67e-07s     C     9744      12   theano.tensor.basic.Reshape
   0.0%   100.0%       0.002s       3.57e-07s     C     5684       7   theano.compile.ops.ViewOp
   0.0%   100.0%       0.002s       3.97e-07s     C     4872       6   theano.tensor.opt.MakeVector
   0.0%   100.0%       0.001s       1.04e-06s     C      812       1   theano.compile.ops.Shape_i
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

junpenglao · 2017-06-24T20:09:56Z

wow. I will try on some of my slow models.

hvasbath · 2017-06-25T07:46:23Z

Did you find out where the speedup comes from? If you compare the numbers in the profiling it is not so clear-at least not for me. The state calculation is included in leapfrog for case1 isnt it? So to be a fair comparison you would need to time that as well? Or is that the point that this is stuff that doesnt need to be recalculated all the time?

aseyboldt · 2017-06-25T09:47:33Z

@hvasbath They both do just one leapfrog step. The compute_state is pretty much the same as q_grad = model.dlogp_array(q).
I still haven't really figured it out. I'm trying to look at the theano graphs of some very simple models. What makes this difficult is that those graphs get very large very quickly. :-)
But have a look at the number of Apply nodes in the profile above (#apply). In the model I do one multiplication with a sparse matrix. So I would expect to see an apply count of 2 for theano.sparse.basic.Dot: One for computing the value, and one because of the gradient. This is the case in the second run. But in master there are 4 of those.

aseyboldt · 2017-06-25T12:06:56Z

I think I have a rough idea now where the problem in master is coming from.
This is the (slightly expanded) part where the leapfrog step is done. (trajectory.py::_theano_single_leapfrog)

    p_new = p + 0.5 * epsilon * q_grad  # half momentum update
    q_new = q + epsilon * H.pot.velocity(p_new)  # full position update
    q_new_grad = H.dlogp(q_new)
    p_new += 0.5 * epsilon * q_new_grad  # half momentum update
    # energy_new = energy(H, q_new, p_new)
    energy_new = H.pot.energy(q_new) - H.logp(q_new)
    v_new = H.pot.velocity(p_new)

In the third line we compute the gradient of the logp at q_new. In the 6th line we compute the logp at q_new. H.logp and H.dlogp are both instanced of theanof.CallableTensor. They use theano.clone to change the input. So we end up with two clones for most of our computations. Theano doesn't seem to merge them again.

aseyboldt · 2017-06-25T13:15:53Z

I messed up the profiling a bit. I updated the comment. The times are now 20ms vs 14ms instead of 20ms vs 7ms.

ferrine · 2017-06-29T18:02:27Z

pymc3/tests/test_model.py

I thought we moved to pytest

Ah, I didn't really think about this but just used the unittest version. pytest supports this though. I guess the alternative would be to use a pytest fixture? So you'd use that?

Yes, I tried pytest and fell in love with it

we've still got unittest stuff sprinkled around. it looks like this would actually work fine if you just deleted the super class, but I don't think it hurts anything, either.

ColCarroll · 2017-07-18T10:01:42Z

Happy to help/review this when needed/ready

aseyboldt · 2017-07-18T10:04:21Z

@ColCarroll great. I'm still working on fixing HamiltonianMC again, but I'll let you know when it's done.

aseyboldt · 2017-07-18T15:48:18Z

@ColCarroll I guess there will still be a test failure somewhere, but I think this is ready for review now.

ColCarroll

looks good to me -- wish we had the benchmark suite set up to check this out from a performance standpoint, but I think numpy is preferable for maintaining the library (and will give better error messages).

ColCarroll · 2017-07-18T16:12:05Z

pymc3/step_methods/arraystep.py

+        array = self._logp_dlogp_func.dict_to_array(point)
+
+        if self.generates_stats:
+            apoint, stats = self.astep(array)


just thinking for the future, this would be easier to reason about if self.astep always returned at least an empty dictionary for stats

When I wrote this I though I'd rather not change the interface for step methods, so that we don't break custom step methods. Now I guess we should probably treat that as private part of the api, so yes, I agree.

ColCarroll · 2017-07-18T16:13:59Z

pymc3/step_methods/hmc/integration.py

+
+    def step(self, epsilon, state, out=None):
+        pot = self._potential
+        axpy = linalg.blas.get_blas_funcs('axpy', dtype=self._dtype)


can you add a comment or link to explain what axpy is/does?

That's in the comments just above the places I use the function.

ColCarroll · 2017-07-18T16:16:12Z

pymc3/step_methods/hmc/nuts.py

            depth is reached.
        early_max_treedepth : int, default=8
-            The maximum tree depth during tuning.
+            The maximum tree depth during the first 200 tuning samples.


technically, min(200, tune_steps)

I really hope people don't use less than 200 tuning steps :-)
I think "during the first 200 tuning samples" would imply that, no?

ColCarroll · 2017-07-18T16:16:59Z

pymc3/step_methods/hmc/quadpotential.py


 class QuadPotential(object):
-    def velocity(self, x):
+    def velocity(self, x, out=None):


this is to use blas functions to change values in-place?

yes. Since we are doing this in python now, avoiding allocations seems to matter.

ColCarroll · 2017-07-18T16:19:15Z

pymc3/tests/test_model.py

we've still got unittest stuff sprinkled around. it looks like this would actually work fine if you just deleted the super class, but I don't think it hurts anything, either.

kyleabeauchamp · 2017-07-18T22:26:55Z

I have a dumb question: do we have a fully-GPU version of vanilla HMC? That could be easier to implement than NUTS, and because of architectural differences might end up being almost as good...

aseyboldt · 2017-07-18T22:30:28Z

@kyleabeauchamp no, we don't have that yet. My plan going forward would be to use the transfer method in theano to allow us to return data on the gpu and avoid the copy, and then use pygpu for the couple of array ops we need to do in nuts/hmc.

junpenglao · 2017-07-19T09:23:32Z

pymc3/tests/test_step.py

+            prob = Beta('prob', alpha=5., beta=3.)
            Binomial('outcome', n=1, p=prob)
-            with warnings.catch_warnings(record=True) as warns:
+            with pytest.warns(None) as warns:


Is this finally going to solve the Travis weird fail? 💯

No clue. But I thought it is worth a shot.

ok, that wasn't it. This failure is a bit of a mystery to me, but it's not related to the PR anyway.

aseyboldt · 2017-07-20T15:09:37Z

I think this is ready to be merged now.
I tested this against a couple of my models, and got a speedups between 20% to 40% for all of them. For very simple models (eg just a single normal variable) it is slower (~20%). I guess we could remedy that by working on the integration a bit more, maybe optionally use numba in there to avoid a couple of unnecessary passes over arrays.

twiecki · 2017-07-20T15:21:24Z

@aseyboldt Very cool! Thanks, also to the reviewers.

aseyboldt changed the title ~~Use~~ NUTS refactoring towards GPU support Jun 23, 2017

aseyboldt added gpu labels Jun 23, 2017

ColCarroll reviewed Jun 23, 2017

View reviewed changes

aseyboldt force-pushed the no-theano-nuts branch 2 times, most recently from fe05701 to f07ea00 Compare June 25, 2017 19:05

aseyboldt mentioned this pull request Jun 28, 2017

[WIP] Riemannian Manifold HMC #2240

Closed

ferrine reviewed Jun 29, 2017

View reviewed changes

aseyboldt mentioned this pull request Jul 16, 2017

Implement mass matrix adaptation #2327

Merged

aseyboldt added 4 commits July 16, 2017 20:59

Add ValueGradFunction to model

983d444

Add names to logp variables

e42cd9e

Run leapfrog steps outside of theano

fac28fa

Slight speedup for leapfrog step

42be392

aseyboldt force-pushed the no-theano-nuts branch 2 times, most recently from 28b954d to 56a26fa Compare July 18, 2017 09:39

Fix bugs after rebase

444308b

aseyboldt force-pushed the no-theano-nuts branch from 56a26fa to 444308b Compare July 18, 2017 10:46

Fix dtype handling in nuts

88cd16a

ColCarroll reviewed Jul 18, 2017

View reviewed changes

aseyboldt force-pushed the no-theano-nuts branch 3 times, most recently from 511a6da to 6076d71 Compare July 19, 2017 07:48

Fix some test failures

56aba98

aseyboldt force-pushed the no-theano-nuts branch from 6076d71 to 56aba98 Compare July 19, 2017 08:21

Fix test failure in NutsCheckTrace

cf785bc

junpenglao reviewed Jul 19, 2017

View reviewed changes

twiecki merged commit 61ba6ee into pymc-devs:master Jul 20, 2017

junpenglao mentioned this pull request Aug 5, 2017

computation of logp_elemwise is slower than before #2480

Closed

NUTS refactoring towards GPU support #2345

NUTS refactoring towards GPU support #2345

Uh oh!

Conversation

aseyboldt commented Jun 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ColCarroll left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

junpenglao commented Jun 23, 2017

Uh oh!

hvasbath commented Jun 23, 2017

Uh oh!

aseyboldt commented Jun 23, 2017

Uh oh!

hvasbath commented Jun 24, 2017

Uh oh!

ferrine commented Jun 24, 2017

Uh oh!

aseyboldt commented Jun 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before

After

Uh oh!

junpenglao commented Jun 24, 2017

Uh oh!

hvasbath commented Jun 25, 2017

Uh oh!

aseyboldt commented Jun 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aseyboldt commented Jun 25, 2017

Uh oh!

aseyboldt commented Jun 25, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ColCarroll commented Jul 18, 2017

Uh oh!

aseyboldt commented Jul 18, 2017

Uh oh!

aseyboldt commented Jul 18, 2017

Uh oh!

ColCarroll left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

aseyboldt commented Jun 23, 2017 •

edited

Loading

aseyboldt commented Jun 24, 2017 •

edited

Loading

aseyboldt commented Jun 25, 2017 •

edited

Loading