Skip to content

Conversation

@yebai
Copy link
Member

@yebai yebai commented Dec 14, 2021

No description provided.

Copy link
Contributor

@rikhuijzer rikhuijzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳

Also, the julia compat entry can be reverted (#1740) now that Julia 1.7 will be supported again.

@devmotion
Copy link
Member

CI setup has to be reverted as well, currently only 1.3 and 1.6 are tested.

@KDr2
Copy link
Member

KDr2 commented Dec 14, 2021

https://github.com/TuringLang/Turing.jl/blob/master/src/inference/AdvancedSMC.jl#L354

The produce is called in a function much deeper than I thought: The instruction is Instruction{DynamicPPL.Model}, and it calls tilde_obseve for many times, then in the deep bottom it calls produce.

Any good idea to fix this? @yebai

@yebai
Copy link
Member Author

yebai commented Dec 14, 2021

We might have to trace into Instruction{DynamicPPL.Model} in that case, I think. Can we implement an option to selectly trace into nested instructions, e.g. Instruction{DynamicPPL.Model} but still keep other same-level instructions primitive?

@yebai
Copy link
Member Author

yebai commented Dec 14, 2021

@KDr2 can you share an example of the complete call stack that leads to produce?

@KDr2
Copy link
Member

KDr2 commented Dec 22, 2021

I have made produce be allowed in nested function call, and delay the produce to the end of instruction. There are still 3(at least) kind of errors:

  1. produce is called twice in a function which we don't trace down:

    ┌ Error: TapedTask Error:
    │   exception =
    │    There is a produced value which is not consumed
    │    Stacktrace:
    │      [1] error(s::String)
    │        @ Base ./error.jl:33
    │      [2] produce(val::Float64)
    │        @ Libtask ~/Work/julia/Libtask.jl/src/tapedtask.jl:134
    │      [3] observe
    │        @ ~/Work/julia/Turing.jl/src/inference/AdvancedSMC.jl:370 [inlined]
    │      [4] tilde_observe
    │        @ ~/Work/julia/DynamicPPL.jl/src/context_implementations.jl:138 [inlined]
    │      [5] tilde_observe
    │        @ ~/Work/julia/DynamicPPL.jl/src/context_implementations.jl:136 [inlined]
    │      [6] tilde_observe
    │        @ ~/Work/julia/DynamicPPL.jl/src/context_implementations.jl:131 [inlined]
    │      [7] tilde_observe!
    │        @ ~/Work/julia/DynamicPPL.jl/src/context_implementations.jl:182 [inlined]
    │      [8] tilde_observe!(context::DynamicPPL.SamplingContext{Sampler{PG{(:z1, :z2, :z3, :z4), ....
    │        @ DynamicPPL ~/Work/julia/DynamicPPL.jl/src/context_implementations.jl:169 
    │      [9] MoGtest(__model__::DynamicPPL.Model{typeof(MoGtest), (:D,), (), (), Tuple{Matr ....
    │        @ Main ~/Work/julia/Turing.jl/test/test_utils/models.jl:36

2: about random data generating (I'm not sure)

[ Info: Testing forwarddiff
gibbs constructor: Error During Test at /data/zhuoql/Work/julia/Turing.jl/test/test_utils/staging.jl:42
  Got exception outside of a @test
  mis-aligned execution traces: # particles = 10 # completed trajectories = 5. Please make sure the number of observations is NOT random.
  Stacktrace:
    [1] error(::String, ::Int64, ::String, ::Int64, ::String)
  1. some numeric errors.

@yebai
Copy link
Member Author

yebai commented Dec 22, 2021

produce is called twice in a function which we don't trace down:

@KDr2 I checked carefully, and it appears that we only called produce once. The produce call can be found here. The call stack is roughly about MoGtest ==> tilde_observe! ==> tilde_observe ==> observe, where produce is only called in the overloaded version of observe in AdvancedSMC. Am I missing anything?

EDIT: can you check whether another top-level instruction called produce (maybe indirectly via nested calls) but not added to our traced function list?

@yebai
Copy link
Member Author

yebai commented Dec 22, 2021

A useful debugging trick is to print the call stack while inside produce, so we can easily identify top-level instructions which are not in the traced function list.

@KDr2
Copy link
Member

KDr2 commented Dec 23, 2021

BT1:

    produce(val::Float64) at tapedtask.jl:142,
    observe at AdvancedSMC.jl:368 [inlined],
    tilde_observe at context_implementations.jl:138 [inlined],
    tilde_observe at context_implementations.jl:136 [inlined],
    tilde_observe at context_implementations.jl:131 [inlined],
    tilde_observe! at context_impl$mentations.jl:182 [inlined],
    tilde_observe!(...) at context_implementations.jl:169,
    MoGtest() at models.jl:24,
    (::Libtask.Instruction{typeof(MoGtest)})() at tapedfunction.jl:84,
    step_in(t::Libtask.Tape, args::Tuple{}) at tasedtask.jl:66,
    step_in(t::Libtask.Tape, args::Tuple{}) at tapedtask.jl:64,
    step_in(t::Libtask.Tape, args::Tuple{}) at tapedtask.jl:64,
    step_in(t::Libtask.Tape, args::Tuple{}) at tapedtask.jl:64,
    step_in(t::Libtask.Tape, args::Tuple{}) at tapedtask.jl:64,
    step_in(t::Libtask.Tape, args::Tuple{}) at tapedtask.jl:64,
    step_in(t::Libtask.Tape, args$:Tuple{}) at tapedtask.jl:64,

BT2

   [1] error(s::String)      @ Base ./error.jl:33
   [2] produce(val::Float64) @ Libtask ~/Work/julia/Libtask.jl/src/tapedtask.jl:139
   [3] observe               @ ~/Work/julia/Turing.jl/src/inference/AdvancedSMC.jl:368 [inlined]
   [4] tilde_observe         @ ~/Work/julia/DynamicPPL.jl/src/context_implementations.jl:138 [inlined]
   [5] tilde_observe         @ ~/Work/julia/DynamicPPL.jl/src/context_implementations.jl:136 [inlined]
   [6] tilde_observe         @ ~/Work/julia/DynamicPPL.jl/src/context_implementations.jl:131 [inlined]
   [7] tilde_observe!        @ ~/Work/julia/DynamicPPL.jl/src/context_implementations.jl:182 [inlined]
   [8] tilde_observe!(...)   @ DynamicPPL ~/Work/julia/DynamicPPL.jl/src/context_implementations.jl:169
   [9] MoGtest()             @ Main ~/Work/julia/Turing.jl/test/test_utils/models.jl:36
   [10] (::Libtask.Instruction{typeof(MoGtest)})()

It seems they are the same path to produce, but when the second reaches produce, the value produced by BT1 is not consumed.

UPDATE:
Oh, I didn't mark MoGtest as a should-trace-into.

@yebai
Copy link
Member Author

yebai commented Dec 23, 2021

Oh, I didn't mark MoGtest as a should-trace-into.

That explains the current behaviour!

@KDr2
Copy link
Member

KDr2 commented Dec 23, 2021

And now, the errors become back to we can't get the IR code of some functions, below is a non-exhaustive list:

  • gdemo_d
  • MoGtest

@KDr2
Copy link
Member

KDr2 commented Dec 29, 2021

The stack overflow still exists, we didn't see it the last time we ran tests last night, because the Julia process was stuck on my machine :(

[ Info: [Turing]: progress logging is disabled globally
[ Info: [AdvancedVI]: global PROGRESS is set as false
adr: Error During Test at /data/zhuoql/Work/julia/Turing.jl/test/test_utils/staging.jl:42
  Got exception outside of a @test
  StackOverflowError:
  Stacktrace:
    [1] evaluate!!(::DynamicPPL.Model{typeof(gdemo_d), (), (), (), Tuple{}, Tuple{}, DynamicPPL.DefaultContext}, ::Random._GLOBAL_RNG, ::Random._GLOBAL_RNG, ::Random._GLOBA
L_RNG, ::Random._GLOBAL_RNG, ::Random._GLOBAL_RNG, ::Vararg{Any})
      @ DynamicPPL ~/.julia/packages/DynamicPPL/c8MjC/src/model.jl:420
    [2] evaluate!!(::DynamicPPL.Model{typeof(gdemo_d), (), (), (), Tuple{}, Tuple{}, DynamicPPL.DefaultContext}, ::Random._GLOBAL_RNG, ::Random._GLOBAL_RNG, ::Random._GLOBA
L_RNG, ::Random._GLOBAL_RNG, ::Random._GLOBAL_RNG, ::Vararg{Any}) (repeats 11429 times)
      @ DynamicPPL ~/.julia/packages/DynamicPPL/c8MjC/src/model.jl:421
    [3] (::DynamicPPL.Model{typeof(gdemo_d), (), (), (), Tuple{}, Tuple{}, DynamicPPL.DefaultContext})(::Random._GLOBAL_RNG, ::Vararg{Any})
      @ DynamicPPL ~/.julia/packages/DynamicPPL/c8MjC/src/model.jl:377
    [4] (::DynamicPPL.Model{typeof(gdemo_d), (), (), (), Tuple{}, Tuple{}, DynamicPPL.DefaultContext})(x::Random._GLOBAL_RNG, y::DynamicPPL.TypedVarInfo{NamedTuple{(:s, :m)
, Tuple{DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:s, Setfield.IdentityLens}, Int64}, Vector{InverseGamma{Float64}}, Vector{AbstractPPL.VarName{:s, Setfield.IdentityLens
}}, Vector{Float64}, Vector{Set{DynamicPPL.Selector}}}, DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:m, Setfield.IdentityLens}, Int64}, Vector{Normal{Float64}}, Vector{Abs
tractPPL.VarName{:m, Setfield.IdentityLens}}, Vector{Float64}, Vector{Set{DynamicPPL.Selector}}}}}, Float64}, z::SampleFromPrior)
      @ Turing.Inference ~/Work/julia/Turing.jl/src/inference/AdvancedSMC.jl:359
    [5] (::DynamicPPL.Model{typeof(gdemo_d), (), (), (), Tuple{}, Tuple{}, DynamicPPL.DefaultContext})(x::DynamicPPL.TypedVarInfo{NamedTuple{(:s, :m), Tuple{DynamicPPL.Meta
data{Dict{AbstractPPL.VarName{:s, Setfield.IdentityLens}, Int64}, Vector{InverseGamma{Float64}}, Vector{AbstractPPL.VarName{:s, Setfield.IdentityLens}}, Vector{Float64}, Ve
ctor{Set{DynamicPPL.Selector}}}, DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:m, Setfield.IdentityLens}, Int64}, Vector{Normal{Float64}}, Vector{AbstractPPL.VarName{:m, Se
tfield.IdentityLens}}, Vector{Float64}, Vector{Set{DynamicPPL.Selector}}}}}, Float64}, y::SampleFromPrior)

@yebai

@yebai
Copy link
Member Author

yebai commented Dec 29, 2021

@KDr2 The stackflow error is caused by these lines

(m::DynamicPPL.Model)(x)=m(Random.GLOBAL_RNG, x)
(m::DynamicPPL.Model)(x, y)=m(Random.GLOBAL_RNG, x, y)
(m::DynamicPPL.Model)(x, y, z)=m(Random.GLOBAL_RNG, x, y, z)

I guess these new function definitions do not specify argument types, so the downstream calls to DynamicPPL.Model(...) become

[3] (::DynamicPPL.Model{typeof(gdemo_d), (), (), (), Tuple{}, Tuple{}, DynamicPPL.DefaultContext})(::Random._GLOBAL_RNG, ::Vararg{Any})

I fixed these lines by adding argument type information in 0cfb35e and d4eeeb9

EDIT: it seems CI is still using the wrong branch of Libtask.

@FredericWantiez
Copy link
Member

@yebai changed the calls to Model to:

(m::DynamicPPL.Model)(x::DynamicPPL.AbstractVarInfo)=first(DynamicPPL.evaluate!!(m, Random.GLOBAL_RNG, x))

but that runs into convert errors.

@FredericWantiez
Copy link
Member

I can get past it if I intercept the instruction we use to intercept return values:

function (instr::Instruction{F})() where F
    output = instr.fun(map(val, instr.input)...)
    if instr.fun == identity
        instr.output = box(output)
    else
        instr.output.val = output
    end
end

Now the model would run but fails on the resampling step, looks like the copied particles do not resume properly, returning nothing too early

@yebai
Copy link
Member Author

yebai commented Dec 29, 2021

@FredericWantiez thanks, can you push your code please?

@KDr2 the remaining issue is likely associated with the new Libtask, can you take a look?

@yebai
Copy link
Member Author

yebai commented Dec 29, 2021

@KDr2 we might need to add the function ‘first’ to traceable functions. Might be useful to carefully check whether we correctly trace every function again now that the code can run.

Comment on lines +357 to +367
(m::DynamicPPL.Model)(x::DynamicPPL.AbstractVarInfo)=first(DynamicPPL.evaluate!!(m, Random.GLOBAL_RNG, x))
(m::DynamicPPL.Model)(x::DynamicPPL.AbstractVarInfo, y::DynamicPPL.AbstractSampler)=first(DynamicPPL.evaluate!!(m, Random.GLOBAL_RNG, x, y))
(m::DynamicPPL.Model)(x::DynamicPPL.AbstractVarInfo, y::DynamicPPL.AbstractSampler, z::DynamicPPL.AbstractContext)=first(DynamicPPL.evaluate!!(m, Random.GLOBAL_RNG, x, y, z))

# trace down into
Libtask.trace_into(::DynamicPPL.Model) = true
Libtask.trace_into(::typeof(DynamicPPL.evaluate_threadsafe!!)) = true
Libtask.trace_into(::typeof(DynamicPPL.evaluate_threadunsafe!!)) = true
Libtask.trace_into(::typeof(DynamicPPL._evaluate!!)) = true
Libtask.trace_into(::typeof(DynamicPPL.tilde_observe)) = true
Libtask.trace_into(::typeof(DynamicPPL.tilde_observe!!)) = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These definitions are quite severe type piracies, in particular the ones for (m::Model)(...). Is the plan to move these definitions to DynamicPPL once tests pass and before this PR is merged? The definitions of trace_into would require that DynamicPPL depends on Libtask though, is this intended?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a cleaner way to implement the new Libtask-Turing integration. However, for now, we are trying to make things work together. This code will be replaced later.

@yebai
Copy link
Member Author

yebai commented Dec 30, 2021

@KDr2 I just re-run the tests. Code is now running but there is one issue

mis-aligned execution traces: # particles = 15 # completed trajectories = 8. Please make sure the number of observations is NOT random.

This normally means different ctasks have different numbers of oberve statements. Given that all ctasks are sharing the same TracedFunction it is strange.

Note: it's fine to ignore numerical errors for now because without adding support for stochastic control flows in Libtask, some models like MoGtest will produce incorrect results.

@yebai
Copy link
Member Author

yebai commented Dec 30, 2021

We might need to update the following function in order to use the new Libtask copying mechanism

function Base.copy(trace::AdvancedPS.Trace{<:TracedModel})

@yebai
Copy link
Member Author

yebai commented Jan 4, 2022

Replaced by #1757

@yebai yebai closed this Jan 24, 2022
@yebai yebai deleted the hg/new-libtask branch January 24, 2022 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants