-
Notifications
You must be signed in to change notification settings - Fork 2.1k
NUTS refactoring towards GPU support #2345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really cool! Left some comments just to keep myself engaged. I will try to do some benchmarking this weekend and see if I get similar speedups, and under what conditions.
pymc3/model.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these lines are left over experiments, yeah?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ups, I'll take those out on the next push.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason for this syntax?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, does it do automatic shape checking/something clever with reusing allocated memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm trying to avoid allocations. Haven't done much profiling yet though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is x.dot(velocity)
, but modestly extra faster, I guess? (just ran a silly benchmark, and it seems like it is ~30% faster)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. I thought we might as well just use the blas version directly. This would probably also be what we want to call on a gpu. (Although pygpu doesn't expose level 1 BLAS at the moment...)
If this works it will be also much easier to move to other backends (PyTorch?). |
I always had the feeling that theano is very slow in some cases. I guess thats the cost of wrapping everything into layers and layers of code. Some in C some in python and then the communication overhead is high. I know too little about nuts and pygpu not at all so I cannot comment on the code but thanks for cc anyways as this touches the other issue. Can you profile how much is being shifted back and forth to the GPU? |
@hvasbath It's not running on the gpu yet, this is still about figuring out how we could do that in the future. |
You could try switching optimizations of? Whats the timing then? If you know there are things used repeatedly in the graph couldnt you then program it explicitely in the way not to recompute it? |
Interesting, waiting formenchmarks |
Edit: There was a bug in my profiling code, I updated the times. The new time went up from 7ms to 14ms. (still much faster than master, but not as much :-( @ferrine I just tried it on an old model with ~30000 variables I still had around: BeforeN = model.ndim
with model:
step = pm.step_methods.hmc.base_hmc.BaseHMC(
scaling=np.ones(N), profile=False, use_single_leapfrog=True)
np.random.seed(42)
q = floatX(np.random.randn(N))
p = floatX(np.random.randn(N))
epsilon = floatX(np.array(0.01))
q_grad = model.dlogp_array(q)
%timeit _ = step.leapfrog(q, p, q_grad, epsilon)
Profiling:
AfterN = model.ndim
with model:
step = pm.step_methods.hmc.base_hmc.BaseHMC(
scaling=np.ones(N), use_single_leapfrog=True, profile=False)
np.random.seed(42)
q = floatX(np.random.randn(N))
p = floatX(np.random.randn(N))
epsilon = floatX(np.array(0.01))
state = step.integrator.compute_state(q, p)
%timeit step.integrator.step(epsilon, state)
Profiling:
|
wow. I will try on some of my slow models. |
Did you find out where the speedup comes from? If you compare the numbers in the profiling it is not so clear-at least not for me. The state calculation is included in leapfrog for case1 isnt it? So to be a fair comparison you would need to time that as well? Or is that the point that this is stuff that doesnt need to be recalculated all the time? |
@hvasbath They both do just one leapfrog step. The |
I think I have a rough idea now where the problem in master is coming from. p_new = p + 0.5 * epsilon * q_grad # half momentum update
q_new = q + epsilon * H.pot.velocity(p_new) # full position update
q_new_grad = H.dlogp(q_new)
p_new += 0.5 * epsilon * q_new_grad # half momentum update
# energy_new = energy(H, q_new, p_new)
energy_new = H.pot.energy(q_new) - H.logp(q_new)
v_new = H.pot.velocity(p_new) In the third line we compute the gradient of the logp at |
I messed up the profiling a bit. I updated the comment. The times are now 20ms vs 14ms instead of 20ms vs 7ms. |
fe05701
to
f07ea00
Compare
pymc3/tests/test_model.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we moved to pytest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't really think about this but just used the unittest
version. pytest supports this though. I guess the alternative would be to use a pytest fixture? So you'd use that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I tried pytest and fell in love with it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we've still got unittest stuff sprinkled around. it looks like this would actually work fine if you just deleted the super class, but I don't think it hurts anything, either.
28b954d
to
56a26fa
Compare
Happy to help/review this when needed/ready |
@ColCarroll great. I'm still working on fixing HamiltonianMC again, but I'll let you know when it's done. |
56a26fa
to
444308b
Compare
@ColCarroll I guess there will still be a test failure somewhere, but I think this is ready for review now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me -- wish we had the benchmark suite set up to check this out from a performance standpoint, but I think numpy is preferable for maintaining the library (and will give better error messages).
array = self._logp_dlogp_func.dict_to_array(point) | ||
|
||
if self.generates_stats: | ||
apoint, stats = self.astep(array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just thinking for the future, this would be easier to reason about if self.astep
always returned at least an empty dictionary for stats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I wrote this I though I'd rather not change the interface for step methods, so that we don't break custom step methods. Now I guess we should probably treat that as private part of the api, so yes, I agree.
|
||
def step(self, epsilon, state, out=None): | ||
pot = self._potential | ||
axpy = linalg.blas.get_blas_funcs('axpy', dtype=self._dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment or link to explain what axpy
is/does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's in the comments just above the places I use the function.
depth is reached. | ||
early_max_treedepth : int, default=8 | ||
The maximum tree depth during tuning. | ||
The maximum tree depth during the first 200 tuning samples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically, min(200, tune_steps)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really hope people don't use less than 200 tuning steps :-)
I think "during the first 200 tuning samples" would imply that, no?
|
||
class QuadPotential(object): | ||
def velocity(self, x): | ||
def velocity(self, x, out=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to use blas
functions to change values in-place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. Since we are doing this in python now, avoiding allocations seems to matter.
pymc3/tests/test_model.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we've still got unittest stuff sprinkled around. it looks like this would actually work fine if you just deleted the super class, but I don't think it hurts anything, either.
I have a dumb question: do we have a fully-GPU version of vanilla HMC? That could be easier to implement than NUTS, and because of architectural differences might end up being almost as good... |
@kyleabeauchamp no, we don't have that yet. My plan going forward would be to use the |
511a6da
to
6076d71
Compare
6076d71
to
56aba98
Compare
prob = Beta('prob', alpha=5., beta=3.) | ||
Binomial('outcome', n=1, p=prob) | ||
with warnings.catch_warnings(record=True) as warns: | ||
with pytest.warns(None) as warns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this finally going to solve the Travis weird fail? 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No clue. But I thought it is worth a shot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, that wasn't it. This failure is a bit of a mystery to me, but it's not related to the PR anyway.
I think this is ready to be merged now. |
@aseyboldt Very cool! Thanks, also to the reviewers. |
So far we have mostly investigated possibilities of moving more of nuts into theano. But it might be better to go the opposite direction, and limit theano to computing logp and gradient. The remaining computations are mostly simple BLAS1 operations, so we should be able to do those quickly on a gpu or cpu on our own.
This was meant as an experiment to see how much slower it would be if we did the leapfrog steps in pure python, but to my surprise I get from 5% to 60% faster sampling.
I'm not sure I understand why that is yet, but my guess is that the shared computations between logp and gradient are responsible. Usually theano should be smart enough to do those only once, but if the leapfrog steps are part of the graph this seems to break. In complicated models the resulting runtime cost can be high.
CC @ColCarroll @hvasbath