-
Notifications
You must be signed in to change notification settings - Fork 77
Add function for covariance matrix #898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📖 Docs for this PR can be previewed here |
Codecov Report
@@ Coverage Diff @@
## main #898 +/- ##
=======================================
Coverage 93.40% 93.40%
=======================================
Files 25 25
Lines 20362 20389 +27
Branches 825 825
=======================================
+ Hits 19019 19045 +26
- Misses 1306 1307 +1
Partials 37 37
Continue to review full report at Codecov.
|
Great! Very excited! Let's see, a few comments. We've discussed elsewhere that this is actually And, I believe this is the same thing, written as a branch statistic:
Do you agree, that this is the same thing? Regardless of which thing we use, most of the work is in writing the "naive" calculation and the tests. So, this is great. |
This is potentially a minefield - I have spent a lot of time on this and the terminology, notation and descriptions are a huge mess IMO. I see it merely as a metric that tells us how similar/identical two chromosomes are. There are lots of details. I think this is the summary. With a pedigree, we can only make probabilistic statements about this similarity/identity because we never observe chromosomes; further because pedigree is always finite, we always miss some ancient relationships and can only report similarity/identity in the IBD sense with respect to some base population where we assume all individuals are unrelated - so we are constantly underestimating kinship and there is the variance of true/realised values around our expectations. With genomic data (some markers or whole sequence) we can observe similarity/identity - kinship is just a function of the proportion of identical alleles (some subtract 2*allele freq, some don't) - this is IBS. Here we can bring back IBD by looking stretches of IBS similarity/identity, but that is crude as IBD segments have exponential lengths (it easy to filter out short segments thinking they are very old). With trees, we can either use tree topology, lengths and times to get ~"prior/expected" covariance - I think this is identical to what @gtsambos was doing, no? We can also look at sequence similarity, so this is ~"posterior/realised" covariance, which is the same as the previous paragraph, but we can use tree sequence to compute this more efficiently. @brieuclehmann @petrelharp @jeromekelleher does the above make sense? |
Agreed that this is a minefield. As you say, talking about "IBD" has the same issues. A big part of the confusion, though, is because sometimes people are talking about the pedigree quantity (analagous to the branch statistic) and sometimes they are talking about the genotype quantity (site statistic), which estimates the pedigree quantity. Here we're going to implement a particular statistic, though, so we do need to figure out (a) what exactly to calculate, and (b) what exactly to call it. As for what to calculate, often a good way to figure this out is to take the quantity that people most commonly compute from genotype data, and implement that. That way, people can use tskit to compute the thing they're used to calculating from their genotype matrices. You're saying people compute this as "proportion of identical alleles", which let's translate to "proportion of segregating sites at which the two target genomes agree", or "number of sites at which the two agree divided by number of segregating sites". Hm, ok, let's see. The branch statistic we want to compute is "total shared area". The corresponding site statistic would be "total number of segregating sites at which the genomes agree", since that's what mutations that fall on the 'shared area' would create. So, the branch statistic that corresponds to "proportion of segregating sites at which the two target genomes agree" would be "total shared area divided by total area". So, how about this for a proposal: we implement
where if And, what to call it? I would like to call it A note: the IBD notion of kinship (i.e., proportion of genome that is IBD back to some particular time, say) is obviously related, and could be obtained from @gtsambos's ibd implementation. I think if we implement that, we could call it |
@petrelharp I agree that the branch statistic does the same thing, and I've checked it with some test examples :). I didn't realise the stats framework was so flexible! Regarding terminology, I agree that kinship would be a good name for this, as this suggests a notion of similarity between individuals. We could have a different ts.xxx_kinship() functions to specify exactly which type of kinship we mean, or add it as an argument in ts.kinship(type = "xxx") (or both?). I would favour Regarding which implementation to use, it may just be a question of efficiency, since computing the kinship for large tree sequences could quickly become expensive. |
@petrelharp I like this site & branch duality - I still need to get my head around it though. Do we really get the same answer if I work with site statistic (genotypes) and branch statistic here? They will obviously be very similar (tree is inferred from mutations), but not necessarily the same, will they? I vote for naming this function |
Which sort of "the same" do you mean? You will get different numbers if you do
I don't see this? These aren't exactly the same, because we're getting a single number for each pair, and @gtsambos is getting a list of genomic segments. But I don't think that kinship is a summary of the IBD segments, even: IBD segments tell you how long ago a MRCA is, but for kinship we need to also know the age of the root of the tree. I suppose that given the IBD segments and the age of the root of the tree at each site you could compute kinship, but it'd be a very inefficient way to do it? |
I agree, on all points! I'd vote for |
Ok, this clears bits in my head! |
Yes, but @gtsambos is also summarising the IBD segments between pairs of chromosomes into a single number by summing the length of segments and dividing by the chromosome length. So, not the same procedure, but the end is the same and possibly some intermediate steps too. |
Ah, right - sorry, I was thinking you meant in the method that's implemented in tskit. But this would give kinship relative to the population at a particular time (if that's how IBD is computed), which depends on the time chosen, and is very different than what we're computing. |
Note: I wrote a bunch of the below before looking back at Speed & Balding - you should probably first read the "SNP-based measures of relatedness" section in that paper first. Ok, here's a summary of the issues. The kinships coefficient ("kinship") is pretty universally defined to be the "proportion of IBD"... but that's not an actual definition because IBD is not a singly-defined concept. This was (maybe originally?) called "relatedness" by Wright 1922. There's (at least?) three main use cases:
Estimators for the first use case also always normalize somehow for "background relatedness", as they must. Talking through this a bit: suppose that
Note that
so if we estimate
and so another estimator of
Ok, and how about trait covariance? Let
So, So, here's one proposal:
where
So,
Remaining questions:
I need to digest the summary in Speed & Balding more before feeling sure about this proposal. (Hopefully someone else will, also?) Notes:
References and notes:
|
I will just add some bits for completeness. I will focus on quantitative genetics view. I apologise for the length and a bit of ramble, but I thought it would be better if I spill it all out and connect seemingly different things;) Wright 1922 worked on mating systems in general pedigrees (instead of just simple cases). Using his method of path analysis he came up with:
This was all with pedigrees where we can only make probabilistic IBD statements relative to some founder population where we assume everybody is unrelated (though that is not true! - this matters quite a bit when we move to genomics). Later Emik and Terrill 1949 came up with a simple recursive algorithm to calculate these pedigree-based coefficients from an ordered pedigree (parents before progeny). Henderson 1976 [apparently he had bits of this worked out much earlier, but here he published linear algebra treatment] put all this into the context of a linear model by linking together the phenotype model pheno_value[i] = mean + add_gen_value[i] + error[i] and the genetic model on a pedigree (of diploid individuals) add_gen_value[i] = 0.5add_gen_value[father[i]] + 0.5add_gen_value[father[i]] + mendelian_sampling_value[i]. Mendelian sampling captures deviation from parent average due to recombination, mutation, gene conversion and segregation, possibly also selection. With pedigrees, we do not observe any genes, but we can infer add_gen_values and their mendelian_sampling_value from regressing phenotypes on a pedigree. If we treat additive genetic values as random variables and we do Cov() and Var() algebraic calculations on an ordered pedigree we are effectively running the Emik and Terrill algorithm. For the whole vector of a we get Var(a) = A\sigma^2 - an important caveat here - we are doing covariances here, not correlations - so Henderson strictly called the matrix A as the numerator relationship matrix (referring to the numerator of Wright's relationship correlation coefficient). Today this matrix is often called just a relationship matrix and causes confusions between correlations and covariances. Sometimes it is also called a kinship matrix (see below why this is confusing). To be pedantic, A is not covariance matrix! A\sigma^2 is covariance matrix between additive genetic values, while A itself is just a covariance coefficient matrix. An important point for me here is that one first sets up a genetic model (process) and then co-variance calculations fall out automatically (most genomic variants are not following this logic). What is in this A matrix? Diagonal element gives variance of additive genetic values - for a diploid individual this is:
Off-diagonal elements give covariance between additive genetic values of two individuals - for diploids we have
Henderson was working in animal breeding. I am less familiar with the historical development of the human genetics branch, but my understanding is that Cotterman 1940 and Malecot 1948 worked on IBD probabilities (not correlations or covariances), which however gave the same intrinsic formulae as in Wright or Henderson (up to some scaling). I don't know this history but my understanding is that kinship coefficients rose out of asking what is the probabililty that two random alleles from two diploid individiuals (one allele per ind) are IBD. This gives:
If we apply the same to one individual we get
This is effectively the same as 1/2 of the above covariance calculations for additive genetic values, so kinship matrix between individuals K is 1/2 of Henderson's numerator relationship matrix A. A modern summary is in Lange, K. (1997) Mathematical and statistical methods for genetic analysis, see also Thompson 2013. As I wrote above, literature interchanges terms kinship, relationship, relatedness, genetic, covariance, ... all the time. Even my version of all of the above might be disputed by someone. OK, what about genomics? There is a gazillion of relationship variants! IBD, IBS, segments, scaling this and that. To my understanding, most of these variants are due to different views or different ways of calculating the fuzzy quantity. Following the IBD probability view from pedigrees, it seems natural that if we can get a better estimate of IBD probabilities we can simply replace these in the kinship matrix K or numerator relationship matrix A. This can for example capture one of "recombination, mutation, gene conversion and segregation, possibly also selection", but also relatedness between founders. There are at least two common ways of estimating IBD probabilities - lets focus on identical DNA segments between two chromosomes - this is giving us information about Pr(genome[i,k] = genome[j,l]). With a set of identical DNA segments between two individuals, we can calculate the kinship coefficient between these two individuals as a sum of segment lengths divided by the genome length as @gtsambos showed us the other day. One catch here is related to the length of segments that are treated as identical. Some aficionados will say that we care about IBD because that's what the field has been doing since ever and because for some reason we want to focus just on recent relationships we will impose some threshold on segment length, say 1cM. This can create bias because segment lengths have exponential distribution, so a short segment can also reflect a recent relationship. IBD concept was invented as a proxy for IBS because we could not observe DNA back in the day. Now we can. So why shouldn't we take into account single allele sharing? One reason for this was that until recently we mostly worked with SNP arrays with a limited number of markers and the IBD aficionados felt that sharing allele at one locus gives you no IBD info. If we allow IBD segment length down to a single locus (we are looking at IBS) then kinship coefficient between two individuals k[i,j] is the proportion of alleles that these two individuals share; for an individual k[i,i] would be 1/2 if it is heterozygous at all loci and 1 if it is homozygous at all loci. Note that here we are not centring anything - all "IBD" probabilities are since some time point in the past, potentially all the way to the root when IBD and IBS become the same thing. I think that working with trees can neatly solve this IBD vs IBS debate because it unifies them - namely, in pedigrees, IBD reflected relationship up to some founding populations. With genomics, we can now measure relationships between these founders and I think that at least conceptually it makes sense to go back in time to the root. See also Powell 2010 on IBD vs IBS. What about the "covariance/model" approach? Lets assume that additve genetic value of an individual is a linear combination of its genotype x (row vector for p loci with values 0, 1, and 2) and loci effects alpha (column vector):
or for a set of individuals (X is n-by-p matrix)
What is the variance of this genetic model conditional on observed X?
where XX' (n*n matrix for n individuals) is often called genomic numerator relationship matrix or genomic relationship matrix or genomic kinship (argh...). What's in XX'? Each element is an inner product between genotypes of two individuals (or for an individual with itself on diagonal). An alternative to the above view on modelling individual values is to model effects of loci with a penalised multiple regression:
Note that X'X/n (p*p matrix for p loci) is related to the linkage-disequilibrium (LD) (covariance) matrix between loci. But some centre and scale genotypes!? The above Var(add_gen_value | X) shows that we do not need to. The two most common versions of centring and scaling are: let maf[i] = mean(X[, i])/2 be allele frequency in the sample Vanraden 2008 The centring and scaling (by 2sum(maf*(1-maf)) = sum of heterozygosities) logic above was to get genomic numerator relationship matrix that would be somewhat "similar" to pedigree numerator relationship matrix - because VanRaden had millions of phenotyped and pedigreed cattle and only ~1000 genotyped bulls so it made sense to tailor genomic info to pedigree info. Note that centring introduces negative values in ZZ', which makes IBD probabiliby aficionados screaming! But, negative values do make sense in the context of Wright's inbreeding coefficient in structured populations 1-F_IT = (1-F_IS)(1-F_ST). I believe this also gave a somewhat similar estimate of additive genetic variance \sigma^2_a similar to what we get from the pedigree. But this argument is riddled with problems because the real parameter in Var(add_gen_value | X) = XX'\sigma^2_alpha is \sigma^2_alpha not \sigma^2_a (VanRaden assumed that \sigma^2_a ~ \sigma^2_alpha2sum(maf*(1-maf)), which would make sense if loci would not be in any LD). There are several Nature Genetics papers on a better genomic relationship matrix, but none mentions this, sic. Yang 2010 The logic here is that we upscale importance of rare alleles - trying to get to IBD in a crude way (if we share a rare allele then we must share more unobserved alleles around it). Note that this implies a different prior than Var(alpha)=I\sigma^2_alpha. This Yang version is very sensitive to small allele frequencies (division by a tiny value) and often genotype data is filtered on MAF to avoid this problem. The VanRaden version is stable and the de-facto standard in agriculture (where all this is used on a daily basis!). Of note on centring. Remember the phenotype model:
So, centring in the context of the above model only pushes the mean of genotypes in to intercept. There can be numerical reasons to centre covariates in a model (for example if we run the above model through MCMC). In the context of trees, I think that non-centring means that root's genotypes are reference (set to intercept) and estimates of alphas would then be allele substitution effects between the root's ancestral alleles and mutations. Centring on current population allele frequencies pushes intercept to today's mean. Again, if we would use IBD segments we would not do any subtraction (unless we would cut some ancient sharing). |
Based on the above "novel" we should indeed be careful how we name @brieuclehmann covariance calculation! It is not kinship, he is calculating covariance/similarity/identity between chromosomes k and l of individuals i and j, Pr(genome[i,k] = genome[j,l]). Possble names for this are: genomic covariance, gametic covariance (following Schaeffer et al 1989), node covariance, chromosome covariance, just covariance? Since the method operates on nodes (=chromosomes) we could just call it covariance. Maybe Kinship is when we sum these values into one value that describes the relationship between two diploid individuals (or more generally polyploid individuals). This could be called What to do with centring? Maybe we can centre the resulting values instead of the input (genotypes)? And with "scaling" to 0-1? |
Hm - is it not kinship just because it is acting on single chromosomes, instead of diploid individuals? If so, how would you define kinship between haploid individuals (say, male bees)? |
Another note: I think that in fact we're going to want to deal with the centred covariance matrix, since we don't observe phenotypes at the roots of the trees, and so our model is only defined up to an overall shift. |
Yes, following the established literature I believe kinship coefficient is between diploid individuals, but the authors obviously did not have other ploidy cases in mind. I am happy for either Or |
I see your point. Should we then add What about scaling to [0, 1]? |
just chiming in to say I've had a busy week, but will try to get my head around this soon! |
Thanks for the useful fleshed out explanations above @gregorgorjanc and @petrelharp, it's clear you've both thought about this a lot over the years! I've copied some of it to refer to later 🤓 I would vote against calling this method |
Yeah, this area is confusing, but if we start using different terms for very similar things (the difference is only in the type of information used (IBS vs IBD) and whether we centre or not) then we are just adding to the confusion. But I sense that whatever we name these methods it will not solve the confusion ;( |
I'm not so sure, to me it seems like the biggest difference is that IBD kinship is defined wrt to particular times, whereas this statistic is a summary of relatedness over all times. By IBS vs IBD, do you mean the site vs branch version of this statistic? I'd be a bit careful about calling the implemented site statistic a measure of IBS, as it will only be so under an infinite sites assumption |
Yes, IBD kinship can have an imposed limit on the age (or relatedly on the length of IBD segments that are considered in calculating IBD kinship), but it's up to a user to impose that limit or not. When we do not impose any limit, IBD and IBS kinship should be the same - IBD is just a proxy for IBS afterall. As to site vs branch statistics, I am not sure. I understood that branch statistic will assume neutral infinite sites, while site statistic will reflect deviations from the assumption(s) because we are looking at actual genotypes (hence identical by state). Or is it the other way around? I love these discussions! I would like to note that we are trying to bring together lots of different concepts here, hence so many comments. |
Whoa, good point! I had not realized this, or seen it anywhere! |
Isn't IBD kinship as a function of time a cumulative function - so it's already integrating; at the root IBD kinship gives the same as @brieuclehmann IBS kinship? I view IBD kinship as F_IS (from 1-F_IT = (1-F_IS)(1-F_ST)) and when time goes to root F_IS equals F_IT. |
It is cumulative: I think it's "proportion of the genome that has coalesced by this time"; but Georgia points out that what we're calculating has another integral in - the difference is that IBD kinship has units of genome length (well, proportion of the genome), whereas Brieuc's calculation has units of time * length. Oh! This is the big difference! With IBD kinship, it doesn't matter if you coalesce earlier or later in the pedigree, as the underlying model doesn't include new mutations and all genetic variation comes from the founders; but for covariance of quantitative traits, coalescing more recently makes for a higher covariance, because genetic variation comes from new mutations. |
BTW: another word for all this is "coancestry". Unfortunately it appears to be a synonym for "kinship", so it's not much helping. |
Another word common in plant breeding is coefficient of parentage (as in COP matrix). For example, see https://cropforge.github.io/iciswiki/articles/t/d/m/TDM_COP2.htm Coefficient of inbreeding is kinship of a diploid individual with itself and a synonym for it is coeff of consanguinity (sometimes this same term is a synonym for kinship) https://www.oxfordreference.com/view/10.1093/oi/authority.20110803095621781, https://www.oxfordreference.com/view/10.1093/oi/authority.20110803095621781 |
Just a comment following a chat with @gregorgorjanc, I think the current implementation assumes all samples are leaves and I'm not certain that it returns the right answer if we have internal samples. I'll add a test to this effect... |
Hmm, this is tricky. I think I would agree with this when looking at the first and third tree (where samples 1,2,3,5 are all in the same clade (started with 5), but in the second tree sample 3 is in a different clade - cutting the tree at 5 would suggest that 3 and 5 are unrelated, but they are potentially quite related (progeny of 8). |
Sorry, I hadn't spotted that 8 is the mrca in the second tree. |
OK, then the difference between taking into account oldest ancestors versus MRCA for each tree ( One argument for taking into account all ancestors would be that the associated variance component from a mixed model (where this relatedness matrix would be used) would not map to a well-defined time-point in the past across all trees (as it does if we use pedigrees with the same depth for all individuals), but ancestors and MRCA are or different ages anyway. |
d964243
to
be1929c
Compare
I’ve adjusted the tests to have mutations (so that I’m testing against the naive genotype covariance rather than branch covariance). There’s still the question of rescaling: it turns out that for |
This is a heads-up here @brieuclehmann that you might have some merge problems when you update because of #939 It shouldn't affect you much, but you might need to add a few It would be good to get a version of this PR in soon though - the codebase is moving quickly at the moment and PRs get out of date quickly. We can always file some issues to track anything that isn't quite worked out yet. |
be1929c
to
9e47224
Compare
Is this more-or-less ready to go then @brieuclehmann? We can open an issue to track the You have some lint issues to fix here I think. |
Add hack to divide by 2 - polarised in C implementation
Just a minor tweak to be made to remove Re linting, I think it's because I don't have clang-format installed... |
bf0637d
to
72f901c
Compare
Looks great, thanks @brieuclehmann! Can you open issues to track any remaining things that need to be resolved please? |
Two minor things:
These can be done in follow-up. We also want more testing in the follow-up. |
OK, great, I'm going to merge this and open an issue to track the points from Peter above. Thanks @brieuclehmann, this is a big step forward! |
An implementation of covariance matrix calculation for sample leaves in a tree sequence. The covariance between a pair of leaves in a single tree can be calculated as the time of their most recent common ancestor to the time of the root. The covariance across a tree sequence is a weighted sum of these tree covariances, with weights given by the span of the tree.
The incremental implementation uses similar reasoning as the KC distance implementation #548 . We maintain the covariance between all pairs of leaves as follows. For each edge e, perform an upward and downward traversal. While traversing up toward the root, update the pairs of leaves where one leaf is in the subtree affected by e and one is not. Traversing down from e, update all pairs of leaves where both leaves are in the subtree. Pairs where both leaves are outside of the subtree under e haven't been affected by the insertion/removal of that edge.
Addresses part of #275