Skip to content

Conversation

@yebai
Copy link
Member

@yebai yebai commented Jul 7, 2023

Alternative to #481 by @YongchaoHuang

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@YongchaoHuang YongchaoHuang requested a review from torfjelde July 7, 2023 17:37
### Yong ##############################################################
# Yong added the below new functions on 2023-07-04, they are doing the some functionalities as Tor's functions. Some redundancy needs to be removed?
using Turing, Distributions, DynamicPPL, MCMCChains, Random, Test
using Distributions, DynamicPPL, MCMCChains, Random, Test
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't depend on DynamicPPL -- self-dependency is not needed. As long as the code is defined from inside DynamicPPL.jl, it will automatically have access to all DynamicPPL functionalities.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, removed now.

@github-actions
Copy link
Contributor

Pull Request Test Coverage Report for Build 5520892964

  • 0 of 64 (0.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-1.9%) to 74.517%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/model_utils.jl 0 64 0.0%
Totals Coverage Status
Change from base Build 5358326778: -1.9%
Covered Lines: 1927
Relevant Lines: 2586

💛 - Coveralls

### Yong ##############################################################
# Yong added the below new functions on 2023-07-04, they are doing the some functionalities as Tor's functions. Some redundancy needs to be removed?
# using Turing, Distributions, DynamicPPL, MCMCChains, Random, Test
using Distributions, Random, Test
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments on this:

  1. There is no need to import Distributions or Random as it is already imported by DynamicPPL (see here and here)
  2. There is no need to import Test package from DynamicPPL as it is only useful for testing. You can import them from runtests.jl

I think many of these issues are from bad programing practices, i.e., scripting instead of setting up a proper workflow and sticking to it consistently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I learned about the best practice this afternoon, particularly where to put the using stuff for docs, src and test folders. Thanks for that.
I thus removed all the imports after checking their presence in their project files.

@torfjelde
Copy link
Member

What is this PR doing that my PR isn't? I ask because right now it's somewhat unclear why another PR was needed.

@YongchaoHuang
Copy link
Contributor

Thanks @torfjelde for checking this. These new functions are doing the same thing as yours, in a lightly different ways? After these tests, would it be possible to merge these functions into yours as multi-dispatched versions?

@codecov
Copy link

codecov bot commented Jul 11, 2023

Codecov Report

Patch coverage has no change and project coverage change: -1.90 ⚠️

Comparison is base (e6dd4ef) 76.40% compared to head (57cd078) 74.51%.

❗ Current head 57cd078 differs from pull request most recent head cf30820. Consider uploading reports for the commit cf30820 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #495      +/-   ##
==========================================
- Coverage   76.40%   74.51%   -1.90%     
==========================================
  Files          21       22       +1     
  Lines        2522     2586      +64     
==========================================
  Hits         1927     1927              
- Misses        595      659      +64     
Impacted Files Coverage Δ
src/DynamicPPL.jl 100.00% <ø> (ø)
src/model_utils.jl 0.00% <0.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@yebai
Copy link
Member Author

yebai commented Jul 11, 2023

@YongchaoHuang can you clarify the main implementation differences, the pros and cons? We will only merge one PR based on merit.

@YongchaoHuang
Copy link
Contributor

YongchaoHuang commented Jul 11, 2023

@YongchaoHuang can you clarify the main implementation differences, the pros and cons? We will only merge one PR based on merit.

Sure. Below is a raw comparison.

Same functionalities:
Both PRs are claiming 3 major functions: varname_in_chain/varnames_in_chain, values_from_chain.

Differences:

  1. Differences in the implementation of varname_in_chain.
    PR495 uses the namesingroup property of a chain object, which is convenient and only 1 function is needed (no input information from model):
varname_in_chain(vn::VarName, chain, chain_idx=1, iteration_idx=1)

It outputs a logical True/False indicator (if vn or any of its leaves are present in chain) and an OrderedDict storing the leaf names belonging to vn.

PR481 has multi-dispatched functions for this:

varname_in_chain(model::Model, vn, chain, chain_idx, iteration_idx)
varname_in_chain(varinfo::AbstractVarInfo, vn, chain, chain_idx, iteration_idx)
varname_in_chain(x, vn, chain, chain_idx, iteration_idx)
varname_in_chain!(model::Model, vn, chain, out, chain_idx, iteration_idx)
varname_in_chain!(vi::AbstractVarInfo, vn_parent, chain, out, chain_idx, iteration_idx)
varname_in_chain!(x, vn_parent, chain, out, chain_idx, iteration_idx)
varname_in_chain!(x::AbstractArray, vn_parent::VarName{sym}, chain, out, chain_idx, iteration_idx)

they output an array out which stores the presence of vn and its leaves.

My understanding is that, when testing if a vn is in the chain, we don't need the information from model or varinfo or x, because the namesingroup property of a chain object can test if vn and its leaves are present in the chain. Also, putting the output array out in the signature may not be a good idea (maybe I did not get its use). Further, in varname_in_chain!(x::AbstractArray, vn_parent::VarName{sym}, chain, out, chain_idx, iteration_idx), this line varname_in_chain!(x::AbstractArray, vn_parent::VarName{sym}, chain, out, chain_idx, iteration_idx) relies on the example array x and varname_leaves to obtain vn and then compose the leaf name vn_parent ∘ AbstractPPL.getlens(vn) for further query - not sure if this way of constructing a name is fragile or not.

  1. Differences in varnames_in_chain.
    In PR495, I defined two signatures with 1 method (as VarInfo(model) can be obtained from model):
 varnames_in_chain(model:::Model, chain)
 varnames_in_chain(varinfo::VarInfo, chain)

It iteratively extracts each vn from model and tests its presence (and its leaves) in the chain using the previously defined varname_in_chain. They output two items: a logical value out if all variable names from model are in chain, an OrderedDict storing the leaf names of each vn from the model that are also present in chain.

PR481 also iteratively tests each the presence of each vn and its leaves in chain, using the previously defined varname_in_chain functions. They output a logical value out if all variables names from model are in chain.

The only confusion for me here is again the presence of the output array out in the signature, as seen here.

  1. Differences in values_in_chain.
    PR495 first defines an extra function vn_values_from_chain (which is not defined in PR481) as follows:
vn_values_from_chain(vn, chain, chain_idx, iteration_idx)

This allows one to extract the sample values of any user-specified vn (if vn or any of its leaves are present in chain) without any model information input.

Then I defined two methods for extracting sample values from chain:

values_from_chain(model::Model, chain, chain_idx, iteration_idx)
values_from_chain(varinfo::VarInfo, chain, chain_idx, iteration_idx)

they both return an OrderedDict which contains (vn_leaves,value) pair from chain. Here we iterate through the variable names in model and use the previously defined vn_value_from_chain to extract each vn and its leaves' values.

further, we may not be happy just extract a single row of sample values for all variables; we may want to specify the index ranges for chain number and iteration number. So I defined and multi-dispatched the following functions:

values_from_chain(model::Model, chain, chain_idx_range, iteration_idx_range)
values_from_chain(varinfo::VarInfo, chain, chain_idx_range::UnitRange, iteration_idx_range::UnitRange)

They return an OrderedDict with each element also an OrderedDict - first level corresponds to different chains, and each sub OrderedDict contains all sample values for all variables in that chain.
This allows the user to extract the sample values of all variables in model within the user-specified chain and iteration ranges.
In designing these functions, I mainly concern about the usability and simplicity of the interfaces, i.e. minimizing the information needed for doing it.

PR481 defines multiple functions for values_from_chain:

values_from_chain(x, vn_parent, chain, chain_idx, iteration_idx)
values_from_chain(x::AbstractArray, vn_parent::VarName{sym}, chain, chain_idx, iteration_idx)
values_from_chain(vi::AbstractVarInfo, vn_parent, chain, chain_idx, iteration_idx)
values_from_chain!(model::DynamicPPL.Model, chain, out, chain_idx, iteration_idx)
values_from_chain!(vi::AbstractVarInfo, chain, out, chain_idx, iteration_idx)

Similar to varnames_in_chain, it composes the leaves of each vn in model by vn_parent ∘ AbstractPPL.getlens(vn).
These functions are quite convenient (PR438 used to use them, although the current version uses direct chain indexing). However, I do think the x::AbstractArray thing can be eliminated from the inputs - we don't need an example array of the sample values, combining with varname_leaves to construct the leaves names; we can instead using namesingroup(chain, Symbol(vn)) to conveniently test and extract any leaves of vn.

PR481 has two functions to extract all values from chain for all variables from model:

value_iterator_from_chain(model::DynamicPPL.Model, chain)
value_iterator_from_chain(vi::AbstractVarInfo, chain)

This implementation is very efficient (nicer than my for loops - though some optimization work can be done in PR495 as well). However, it doesn't allow user-specified ranges of chain number and iteration number, e.g. I want the parameter values at chain 1:2 and iteration 100:200.

Above are a crude comparison of both PRs, I may have missed/misunderstood some details. Please correct me if anything wrong or confusing @torfjelde @yebai

@torfjelde
Copy link
Member

Replies

My understanding is that, when testing if a vn is in the chain, we don't need the information from model or varinfo or x, because the namesingroup property of a chain object can test if vn and its leaves are present in the chain.

A couple of things:

  • namesingroup depends on MCMCChains.jl, which we don't want to do.
  • namesingroup works with strings, which we ideally don't want to do. The main point with my PR is to make easy to go from value in chain to value in varinfo; when the keys are strings, this becomes harder.

My understanding is that, when testing if a vn is in the chain, we don't need the information from model or varinfo or x, because the namesingroup property of a chain object can test if vn and its leaves are present in the chain.

But what if I want to use a chain for model, even though chain came from some model with a superset of the variables in model? Then I'm extracting more than I need from chain.

Also, putting the output array out in the signature may not be a good idea (maybe I did not get its use)

This is a very common pattern in Julia. This is more memory efficient, and we're not going to be differentiating through this code, so mutation is not an issue.

relies on the example array x and varname_leaves to obtain vn and then compose the leaf name vn_parent ∘ AbstractPPL.getlens(vn) for further query - not sure if this way of constructing a name is fragile or not.

This is the most robust way we have doing things, and ,in particular, it gives us VarName keys which continue much more structural information than their string counterparts (which is what we've done in the past, and it's been a source of many awkward bugs).

The only confusion for me here is again the presence of the output array out in the signature

As above, this is very common in Julia.

further, we may not be happy just extract a single row of sample values for all variables; we may want to specify the index ranges for chain number and iteration number. So I defined and multi-dispatched the following

This seems like unnecessary complication to me; it's really just replacing a map over the indices.

They return an OrderedDict with each element also an OrderedDict - first level corresponds to different chains, and each sub OrderedDict contains all sample values for all variables in that chain.

Why change the return-value from the Matrix that I believe is currently returned?

we can instead using namesingroup(chain, Symbol(vn)) to conveniently test and extract any leaves of vn.

As before, we don't want to work with stirngs.

However, it doesn't allow user-specified ranges of chain number and iteration number, e.g. I want the parameter values at chain 1:2 and iteration 100:200.

Same as above, it's just a map away.

Note

All in all, I'd give two pieces of advice here @yongchao:

  1. Before you make these design decisions that deviate from the an existing PR, I'd suggest consulting with the person who initiated that PR to understand if a) if what you want to do is sensible, and b) if you want to change their code, to understand why they made the decisions they made. I realize I've been busy lately, but if you send me a concrete question of "does this make sense?" or "why did you make this design decision?", I'm generally quite responsive. This will save you a lot of time.
  2. If you have to write a two-pager on the motivation and description of the PR, it's maybe a bit too big 😬 And PRs, in particular with changes like these, should have clear motivational examples (either a bug it addresses, or a demonstration of why its utility); there's a lot of text in the above comment, and I really have to read through it all to even get a sense of what the motivation is. A few examples would convey the same information much quicker.

@YongchaoHuang
Copy link
Contributor

  • namesingroup depends on MCMCChains.jl, which we don't want to do.

Many thanks @torfjelde for the explanations, they are really helpful.

The reason why I raises this new PR is, I thought these functions can be done in a slightly simple way, and the user interfaces look more clean I guess (e.g. the use of vi, x array, out, etc in the signature may confuse users, so I use as minimal info to allow the values to be extracted).

Some quick thoughts:

  1. yes namesingroup uses strings in its search, which may not be desirable in this case. And it relies on MCMCChains.jl. It explains the rational behind your solution.
    I got the idea that VarName contains more info than simple strings when it is used in 438. Totally agree that it's more robust - I have not experiences with the bugs caused by simple string comparison though.

  2. Could you give an illustrative example on the following (so that I can more easily get your point):
    "But what if I want to use a chain for model, even though chain came from some model with a superset of the variables in model? Then I'm extracting more than I need from chain."

  3. passing out to input argument. Sorry, that's totally my misunderstanding - by passing out as an input argument and modifying it directly within the function, it avoids creating a new array every time it is called, which can be more efficient in terms of memory usage and performance. So it's essentially modification in place.

  4. iterating over chains and iterations. Yes, using map over indices can do the job. Maybe we can add a dispatched version which does this?

  5. Re: " They return an OrderedDict with each element also an OrderedDict - first level corresponds to different chains, and each sub OrderedDict contains all sample values for all variables in that chain.
    Why change the return-value from the Matrix that I believe is currently returned?"
    When the user queries multiple chains, then each chain is wrapped in an OrderedDict. I think this multiple chain and multiple iteration query has not been implemented in 481 - although they can be just implemented using map as you mentioned.

@yebai
Copy link
Member Author

yebai commented Jul 13, 2023

Closed in favour of #497

@yebai yebai closed this Jul 13, 2023
@yebai yebai deleted the yongchao/extract-model-values-from-chain branch July 13, 2023 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants