Skip to content

Add function for covariance matrix #898

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Nov 4, 2020
Merged

Conversation

brieuclehmann
Copy link
Contributor

An implementation of covariance matrix calculation for sample leaves in a tree sequence. The covariance between a pair of leaves in a single tree can be calculated as the time of their most recent common ancestor to the time of the root. The covariance across a tree sequence is a weighted sum of these tree covariances, with weights given by the span of the tree.

The incremental implementation uses similar reasoning as the KC distance implementation #548 . We maintain the covariance between all pairs of leaves as follows. For each edge e, perform an upward and downward traversal. While traversing up toward the root, update the pairs of leaves where one leaf is in the subtree affected by e and one is not. Traversing down from e, update all pairs of leaves where both leaves are in the subtree. Pairs where both leaves are outside of the subtree under e haven't been affected by the insertion/removal of that edge.

Addresses part of #275

@AdminBot-tskit
Copy link
Collaborator

📖 Docs for this PR can be previewed here

@codecov
Copy link

codecov bot commented Oct 8, 2020

Codecov Report

Merging #898 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #898   +/-   ##
=======================================
  Coverage   93.40%   93.40%           
=======================================
  Files          25       25           
  Lines       20362    20389   +27     
  Branches      825      825           
=======================================
+ Hits        19019    19045   +26     
- Misses       1306     1307    +1     
  Partials       37       37           
Impacted Files Coverage Δ
c/tskit/trees.c 94.78% <100.00%> (+0.03%) ⬆️
_tskitmodule.c 90.20% <0.00%> (-0.02%) ⬇️
tskit/trees.py 97.25% <0.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c6ad9c...f250616. Read the comment docs.

@petrelharp
Copy link
Contributor

Great! Very excited! Let's see, a few comments. We've discussed elsewhere that this is actually kinship, since covariance should be the covariance one calculates from the matrix of genotypes, and which subtracts the means from each allele frequency. (That'd be nice to do, also, though!). And we need to track down the best definition of kinship, and make sure we match that.

And, I believe this is the same thing, written as a branch statistic:

n = ts.num_samples
def f(x):
  # x[i] gives the number of descendants in sample set i below the branch
  # and the summary function should return 1 if both are below it, 0 otherwise
  return np.array([x[i] * x[j] for i in range(n) for j in range(n)]) 

sample_sets = [[u] for u in ts.samples()]
kinship = ts.sample_count_stat(sample_sets, f, output_dim=n * n,
                                 polarised=True, mode="branch",
                                 span_normalise=False, strict=False).reshape((n,n))

Do you agree, that this is the same thing?

Regardless of which thing we use, most of the work is in writing the "naive" calculation and the tests. So, this is great.

@gregorgorjanc
Copy link
Member

And we need to track down the best definition of kinship

This is potentially a minefield - I have spent a lot of time on this and the terminology, notation and descriptions are a huge mess IMO. I see it merely as a metric that tells us how similar/identical two chromosomes are. There are lots of details. I think this is the summary.

With a pedigree, we can only make probabilistic statements about this similarity/identity because we never observe chromosomes; further because pedigree is always finite, we always miss some ancient relationships and can only report similarity/identity in the IBD sense with respect to some base population where we assume all individuals are unrelated - so we are constantly underestimating kinship and there is the variance of true/realised values around our expectations.

With genomic data (some markers or whole sequence) we can observe similarity/identity - kinship is just a function of the proportion of identical alleles (some subtract 2*allele freq, some don't) - this is IBS. Here we can bring back IBD by looking stretches of IBS similarity/identity, but that is crude as IBD segments have exponential lengths (it easy to filter out short segments thinking they are very old).

With trees, we can either use tree topology, lengths and times to get ~"prior/expected" covariance - I think this is identical to what @gtsambos was doing, no? We can also look at sequence similarity, so this is ~"posterior/realised" covariance, which is the same as the previous paragraph, but we can use tree sequence to compute this more efficiently.

@brieuclehmann @petrelharp @jeromekelleher does the above make sense?

@petrelharp
Copy link
Contributor

Agreed that this is a minefield. As you say, talking about "IBD" has the same issues. A big part of the confusion, though, is because sometimes people are talking about the pedigree quantity (analagous to the branch statistic) and sometimes they are talking about the genotype quantity (site statistic), which estimates the pedigree quantity.

Here we're going to implement a particular statistic, though, so we do need to figure out (a) what exactly to calculate, and (b) what exactly to call it.

As for what to calculate, often a good way to figure this out is to take the quantity that people most commonly compute from genotype data, and implement that. That way, people can use tskit to compute the thing they're used to calculating from their genotype matrices. You're saying people compute this as "proportion of identical alleles", which let's translate to "proportion of segregating sites at which the two target genomes agree", or "number of sites at which the two agree divided by number of segregating sites".

Hm, ok, let's see. The branch statistic we want to compute is "total shared area". The corresponding site statistic would be "total number of segregating sites at which the genomes agree", since that's what mutations that fall on the 'shared area' would create. So, the branch statistic that corresponds to "proportion of segregating sites at which the two target genomes agree" would be "total shared area divided by total area".

So, how about this for a proposal: we implement

ts.kinship(sample_sets, proportion=True, ...)

where if proportion=False then we just compute the numerator in the statistics above (i.e., number of shared alleles or total shared tree area), and if proportion=True then we divide by the denominator (by segregating_sites, basically).

And, what to call it? I would like to call it kinship, is this a reasonably common and expected definition? If so, should the default proportion above be True or False?

A note: the IBD notion of kinship (i.e., proportion of genome that is IBD back to some particular time, say) is obviously related, and could be obtained from @gtsambos's ibd implementation. I think if we implement that, we could call it ts.ibd_kinship( ), say, and not risk confusion.

@brieuclehmann
Copy link
Contributor Author

@petrelharp I agree that the branch statistic does the same thing, and I've checked it with some test examples :). I didn't realise the stats framework was so flexible!

Regarding terminology, I agree that kinship would be a good name for this, as this suggests a notion of similarity between individuals. We could have a different ts.xxx_kinship() functions to specify exactly which type of kinship we mean, or add it as an argument in ts.kinship(type = "xxx") (or both?). I would favour proportion = True as I generally prefer unitless quantities (but I don't feel strongly about this).

Regarding which implementation to use, it may just be a question of efficiency, since computing the kinship for large tree sequences could quickly become expensive.

@gregorgorjanc
Copy link
Member

@petrelharp I like this site & branch duality - I still need to get my head around it though. Do we really get the same answer if I work with site statistic (genotypes) and branch statistic here? They will obviously be very similar (tree is inferred from mutations), but not necessarily the same, will they?

I vote for naming this function kinship, but I feel this is exactly? the same as what @gtsambos is doing. Are we duplicating effort or are these two "bits" really doing something different? Or is just the approach different? Obviously cutting the tree at some point (to get the proportion of chromosomes that are IBD back to some particular time) will give

@petrelharp
Copy link
Contributor

Do we really get the same answer if I work with site statistic (genotypes) and branch statistic here?

Which sort of "the same" do you mean? You will get different numbers if you do mode="site" or mode="branch" (with different units, even). They are the "same" in that the site statistic is expected value of the branch statistic under a neutral, infinite-sites model of mutation, multiplied by the mutation rate - in other words, the site statistic (which is something you can compute from only the genotypes) is an estimator of the branch statistic (which you need the trees for).

I vote for naming this function kinship, but I feel this is exactly? the same as what @gtsambos is doing.

I don't see this? These aren't exactly the same, because we're getting a single number for each pair, and @gtsambos is getting a list of genomic segments. But I don't think that kinship is a summary of the IBD segments, even: IBD segments tell you how long ago a MRCA is, but for kinship we need to also know the age of the root of the tree. I suppose that given the IBD segments and the age of the root of the tree at each site you could compute kinship, but it'd be a very inefficient way to do it?

@petrelharp
Copy link
Contributor

We could have a different ts.xxx_kinship() functions to specify exactly which type of kinship we mean, or add it as an argument in ts.kinship(type = "xxx") (or both?). I would favour proportion = True as I generally prefer unitless quantities (but I don't feel strongly about this).

I agree, on all points! I'd vote for kinship(... type=xxx) for minor modifications of the same thing, and xxx_kinship() for fundamentally different things (i.e., things computed using totally different code paths). But I think we need to so some careful looking at the literature to see what people do in practice.

@gregorgorjanc
Copy link
Member

Which sort of "the same" do you mean? You will get different numbers if you do mode="site" or mode="branch" (with different units, even). They are the "same" in that the site statistic is expected value of the branch statistic under a neutral, infinite-sites model of mutation, multiplied by the mutation rate - in other words, the site statistic (which is something you can compute from only the genotypes) is an estimator of the branch statistic (which you need the trees for).

Ok, this clears bits in my head!

@gregorgorjanc
Copy link
Member

I vote for naming this function kinship, but I feel this is exactly? the same as what @gtsambos is doing.

I don't see this? These aren't exactly the same, because we're getting a single number for each pair, and @gtsambos is getting a list of genomic segments. But I don't think that kinship is a summary of the IBD segments, even: IBD segments tell you how long ago a MRCA is, but for kinship we need to also know the age of the root of the tree. I suppose that given the IBD segments and the age of the root of the tree at each site you could compute kinship, but it'd be a very inefficient way to do it?

Yes, but @gtsambos is also summarising the IBD segments between pairs of chromosomes into a single number by summing the length of segments and dividing by the chromosome length.

So, not the same procedure, but the end is the same and possibly some intermediate steps too.

@petrelharp
Copy link
Contributor

Yes, but @gtsambos is also summarising the IBD segments between pairs of chromosomes into a single number by summing the length of segments and dividing by the chromosome length.

So, not the same procedure, but the end is the same and possibly some intermediate steps too.

Ah, right - sorry, I was thinking you meant in the method that's implemented in tskit. But this would give kinship relative to the population at a particular time (if that's how IBD is computed), which depends on the time chosen, and is very different than what we're computing.

@petrelharp
Copy link
Contributor

Note: I wrote a bunch of the below before looking back at Speed & Balding - you should probably first read the "SNP-based measures of relatedness" section in that paper first.

Ok, here's a summary of the issues. The kinships coefficient ("kinship") is pretty universally defined to be the "proportion of IBD"... but that's not an actual definition because IBD is not a singly-defined concept. This was (maybe originally?) called "relatedness" by Wright 1922. There's (at least?) three main use cases:

  • close relative identification: for instance, two genomes obtained by separate meioses from the same diploid parent should have kinship of 0.5.
  • kin selection: in these studies they'd like the kinship coefficient to be the realtedness coefficient in Hamilton's rule, defined in terms of a threshold above which a gene increases in frequency. I'm not sure exactly what this should be.
  • mixed-model covariance matrix in GWAS: I think this is the covariance of trait deviations to the sample mean, rather than to the root, as we have been talking about.

Estimators for the first use case also always normalize somehow for "background relatedness", as they must. Talking through this a bit: suppose that P is the proportion of segregating alleles shared by two sibling (haploid) genomes, and PP the same quantity for the two parents (who are "unrelated"). Then the expected value of P is

E[P] = (1 + PP) / 2

Note that

(E[P] - PP) / (1 - PP) = 1 / 2

so if we estimate PP by the average value of P across all pairs and plug it in to the previous equation, we get an estimator of the first use case above, (P - PP)/(1 - PP). (This is in fact what is done by some methods; see below.) Note also that if S is the number of shared alleles (what we've started computing here), and SP the shared alleles between the parents, and N is the number of segregating sites, then P = S/N, so also

E[S] = (N + SP)/2

and so another estimator of P would be

(S - mean(S)) / (N - mean(S)) .

Ok, and how about trait covariance? Let X[i] be the trait value of i, and Z the trait value obtained by summing contributions across all the roots, and S[i,j] = cov(X[i] - Z, X[j] - Z), what we've been computing here. Instead, let XP be the mean of the X, so that with J[i,j] = delta(i,j) - 1/n,

  cov( X[i] - XP, X[j] - XP ) = (J S J) [i,j]
                   = S[i,j] - sum_k S[i,k] / n - sum_k S[k,j] / n + sum_{ij} S[i,j] / n^2
                   =: Sigma[i,j]

So, Sigma[i,j] is the number of shared alleles between i and j, minus the mean number shared by each of i and j, plus the overall shared number across all pairs. In shorthand, this is S[i,j] - mean(S[i]) - mean(S[j]) + mean(S). If n is big, then those three means will all be pretty close to the mean "background relatedness", so this would be very close to S - mean(S) in the previous paragraph. Note also that S[i,i] = N, so that cor( X[i] - XP, X[j] - XP ) is very close to the estimator of P above, (S - mean(S)) / (N - mean(S)). And, I haven't read the details of all the various kinship estimators, perhaps some actually do it this way (do you know, @gregorgorjanc?).

So, here's one proposal:

def kinship(self, sample_sets, subtract_mean=True, rescale=True):

where

  • if subtract_mean=True, then we return S[i,j] - mean(S[i]) - mean(S[j]) + mean(S)
  • if rescale=True, then we divide by sqrt(output[i,i], output[j,j]) (so getting a correlation matrix instead of a covariance matrix).

So,

  • subtract_mean=False, rescale=False would give the thing we've started calculating here, number of shared alleles
  • subtract_mean=True, rescale=False would give the covariance matrix of centered trait values
  • subtract_mean=True, rescale=True would give the correlation matrix of centered trait values, and also be an estimate of the kinship coefficient, thus agreeing with other uses of the term.

Remaining questions:

  • Is this right? =) In particular, would the proposed default values actually give you an estimator of a number that'd let you identify sibs, cousins, etc in the usual way?
  • Which other software computes exactly the same thing? I think at least the mixed-model GWAS methods, is this true?
  • should we call this relatedness( ) instead? which seems to mean the same thing?

I need to digest the summary in Speed & Balding more before feeling sure about this proposal. (Hopefully someone else will, also?)

Notes:

  • "proportion of shared genome" means something different if the proportion is by length or by number of segregating sites. In the method-of-moments estimator above, it doesn't matter. We're computing the "number of segregating sites" version, which is more strongly affected by regions of the genome with lots of tall trees. If we computed this using IBD tract lengths, we'd get the other version. I think this is fine.

References and notes:

  • Speed & Balding is a good summary of different methods, and explains how genotype correlation gives an estimate of the kinship coefficient
  • Wikipedia says it's the proportion of IBD alleles.
  • [Thompson 1975]](https://pubmed.ncbi.nlm.nih.gov/1052764/) uses "kinship" to refer to "proportion of genes shared in common" and refers to Malecot 1948 for it
  • How to estimate kinship, Goudet, Kay & Weir 2018 says that they follow others and "take the current population as the reference" so that "the kinship of a pair of individuals is compared to the average for all pairs of individuals in the sample, so the average estimated kinship is zero"; their method (Weir & Goudet 2017) estimates relative kinship as (M - \bar M) / (1 - \bar M) where M is the proportion of identical alleles and \bar M is the average of M across all pairs
  • KING, like plink, works with the rough idea that two genomes are either IBD, and thus identical, with probability f, or are like random draws from the allele frequency, with probability (1-f), and f is the kinship coefficient. But then note that their estimates can be negative. Also see the original KING paper.
  • Dou et al 2017 "Kinship coefficient, defined as the probability that two homologous alleles drawn from each of two individuals are identical by descent (IBD)"

@gregorgorjanc
Copy link
Member

gregorgorjanc commented Oct 10, 2020

I will just add some bits for completeness. I will focus on quantitative genetics view. I apologise for the length and a bit of ramble, but I thought it would be better if I spill it all out and connect seemingly different things;)

Wright 1922 worked on mating systems in general pedigrees (instead of just simple cases). Using his method of path analysis he came up with:

  • coefficient of relationship (relatedness) [= correlation between additive genetic values of two diploid individuals] and
  • coefficient of inbreeding [= correlation between (additive) genetic values of two genomes (uniting gametes) of a diploid individual].

This was all with pedigrees where we can only make probabilistic IBD statements relative to some founder population where we assume everybody is unrelated (though that is not true! - this matters quite a bit when we move to genomics).

Later Emik and Terrill 1949 came up with a simple recursive algorithm to calculate these pedigree-based coefficients from an ordered pedigree (parents before progeny).

Henderson 1976 [apparently he had bits of this worked out much earlier, but here he published linear algebra treatment] put all this into the context of a linear model by linking together the phenotype model

pheno_value[i] = mean + add_gen_value[i] + error[i]

and the genetic model on a pedigree (of diploid individuals)

add_gen_value[i] = 0.5add_gen_value[father[i]] + 0.5add_gen_value[father[i]] + mendelian_sampling_value[i].

Mendelian sampling captures deviation from parent average due to recombination, mutation, gene conversion and segregation, possibly also selection. With pedigrees, we do not observe any genes, but we can infer add_gen_values and their mendelian_sampling_value from regressing phenotypes on a pedigree.

If we treat additive genetic values as random variables and we do Cov() and Var() algebraic calculations on an ordered pedigree we are effectively running the Emik and Terrill algorithm. For the whole vector of a we get Var(a) = A\sigma^2 - an important caveat here - we are doing covariances here, not correlations - so Henderson strictly called the matrix A as the numerator relationship matrix (referring to the numerator of Wright's relationship correlation coefficient). Today this matrix is often called just a relationship matrix and causes confusions between correlations and covariances. Sometimes it is also called a kinship matrix (see below why this is confusing). To be pedantic, A is not covariance matrix! A\sigma^2 is covariance matrix between additive genetic values, while A itself is just a covariance coefficient matrix.

An important point for me here is that one first sets up a genetic model (process) and then co-variance calculations fall out automatically (most genomic variants are not following this logic).

What is in this A matrix? Diagonal element gives variance of additive genetic values - for a diploid individual this is:

add_gen_value[i] = add_gen_value[i,1] + add_gen_value[i,2] # there are two genomes in there
Var(add_gen_value[i]) = Var(add_gen_value[i,1] + add_gen_value[i,2])
                      = Var(add_gen_value[i,1]) + Var(add_gen_value[i,2]) + 2Cov(add_gen_value[i,1], add_gen_value[i,2])

assuming that

Var(add_gen_value[i,j]) = 1/2 \sigma^2_a

and that

2Cov(add_gen_value[i,1], add_gen_value[i,2]) = F[i]\sigma^2_a

where F[i] is Wright's coefficient of inbreeding, which is equal to 0.5Cov(add_gen_value[father[i]], add_gen_value[mother[i]])

we get 

Var(add_gen_value[i]) = (1 + F[i]) \sigma^2_a

Off-diagonal elements give covariance between additive genetic values of two individuals - for diploids we have

add_gen_value[i] = add_gen_value[i,1] + add_gen_value[i,2] # there are two genomes in there
add_gen_value[j] = add_gen_value[j,1] + add_gen_value[j,2] # there are two genomes in there
Cov(add_gen_value[i], add_gen_value[j]) = Cov(add_gen_value[i,1], add_gen_value[j,1]) +
                                          Cov(add_gen_value[i,1], add_gen_value[j,2]) +
                                          Cov(add_gen_value[i,2], add_gen_value[j,1]) +
                                          Cov(add_gen_value[i,2], add_gen_value[j,2])

Henderson was working in animal breeding. I am less familiar with the historical development of the human genetics branch, but my understanding is that Cotterman 1940 and Malecot 1948 worked on IBD probabilities (not correlations or covariances), which however gave the same intrinsic formulae as in Wright or Henderson (up to some scaling). I don't know this history but my understanding is that kinship coefficients rose out of asking what is the probabililty that two random alleles from two diploid individiuals (one allele per ind) are IBD. This gives:

k[i,j]=1/4*Pr(genome[i,1] = genome[j,1]) +
       1/4*Pr(genome[i,1] = genome[j,2]) + 
       1/4*Pr(genome[i,2] = genome[j,1]) + 
       1/4*Pr(genome[i,2] = genome[j,2])

If we apply the same to one individual we get

k[i,i]=1/4*Pr(genome[i,1] = genome[i,1]) + ==> 1/4
       1/4*Pr(genome[i,1] = genome[i,2]) + 
       1/4*Pr(genome[i,2] = genome[i,1]) + 
       1/4*Pr(genome[i,2] = genome[i,2])  + ==> 1/4
k[i,i]=1/2 + 1/2Pr(genome[i,1] = genome[i,2])

This is effectively the same as 1/2 of the above covariance calculations for additive genetic values, so kinship matrix between individuals K is 1/2 of Henderson's numerator relationship matrix A. A modern summary is in Lange, K. (1997) Mathematical and statistical methods for genetic analysis, see also Thompson 2013. As I wrote above, literature interchanges terms kinship, relationship, relatedness, genetic, covariance, ... all the time. Even my version of all of the above might be disputed by someone.

OK, what about genomics? There is a gazillion of relationship variants! IBD, IBS, segments, scaling this and that. To my understanding, most of these variants are due to different views or different ways of calculating the fuzzy quantity.

Following the IBD probability view from pedigrees, it seems natural that if we can get a better estimate of IBD probabilities we can simply replace these in the kinship matrix K or numerator relationship matrix A. This can for example capture one of "recombination, mutation, gene conversion and segregation, possibly also selection", but also relatedness between founders. There are at least two common ways of estimating IBD probabilities - lets focus on identical DNA segments between two chromosomes - this is giving us information about Pr(genome[i,k] = genome[j,l]). With a set of identical DNA segments between two individuals, we can calculate the kinship coefficient between these two individuals as a sum of segment lengths divided by the genome length as @gtsambos showed us the other day. One catch here is related to the length of segments that are treated as identical. Some aficionados will say that we care about IBD because that's what the field has been doing since ever and because for some reason we want to focus just on recent relationships we will impose some threshold on segment length, say 1cM. This can create bias because segment lengths have exponential distribution, so a short segment can also reflect a recent relationship. IBD concept was invented as a proxy for IBS because we could not observe DNA back in the day. Now we can. So why shouldn't we take into account single allele sharing? One reason for this was that until recently we mostly worked with SNP arrays with a limited number of markers and the IBD aficionados felt that sharing allele at one locus gives you no IBD info. If we allow IBD segment length down to a single locus (we are looking at IBS) then kinship coefficient between two individuals k[i,j] is the proportion of alleles that these two individuals share; for an individual k[i,i] would be 1/2 if it is heterozygous at all loci and 1 if it is homozygous at all loci. Note that here we are not centring anything - all "IBD" probabilities are since some time point in the past, potentially all the way to the root when IBD and IBS become the same thing.

I think that working with trees can neatly solve this IBD vs IBS debate because it unifies them - namely, in pedigrees, IBD reflected relationship up to some founding populations. With genomics, we can now measure relationships between these founders and I think that at least conceptually it makes sense to go back in time to the root. See also Powell 2010 on IBD vs IBS.

What about the "covariance/model" approach? Lets assume that additve genetic value of an individual is a linear combination of its genotype x (row vector for p loci with values 0, 1, and 2) and loci effects alpha (column vector):

add_gen_value[i] = x alpha

or for a set of individuals (X is n-by-p matrix)

add_gen_value[1:n] = X alpha

What is the variance of this genetic model conditional on observed X?

Var(add_gen_value | X) = Var(X alpha | X)
                       = XVar(alpha)X'
                       = XX'\sigma^2_alpha (assuming Var(alpha)=I\sigma^2_alpha)

where XX' (n*n matrix for n individuals) is often called genomic numerator relationship matrix or genomic relationship matrix or genomic kinship (argh...). What's in XX'? Each element is an inner product between genotypes of two individuals (or for an individual with itself on diagonal).

An alternative to the above view on modelling individual values is to model effects of loci with a penalised multiple regression:

pheno_value = 1mean + X alpha + error
error ~ N(0, I\sigma^2_e)
alpha ~ N(0, I\sigma^2_alpha)

Note that X'X/n (p*p matrix for p loci) is related to the linkage-disequilibrium (LD) (covariance) matrix between loci.

But some centre and scale genotypes!? The above Var(add_gen_value | X) shows that we do not need to. The two most common versions of centring and scaling are:

let maf[i] = mean(X[, i])/2 be allele frequency in the sample

Vanraden 2008
Z[, i] = X[, i] - 2maf[i] (we are centring the genotypes here)
ZZ'/2sum(maf*(1-maf))

The centring and scaling (by 2sum(maf*(1-maf)) = sum of heterozygosities) logic above was to get genomic numerator relationship matrix that would be somewhat "similar" to pedigree numerator relationship matrix - because VanRaden had millions of phenotyped and pedigreed cattle and only ~1000 genotyped bulls so it made sense to tailor genomic info to pedigree info. Note that centring introduces negative values in ZZ', which makes IBD probabiliby aficionados screaming! But, negative values do make sense in the context of Wright's inbreeding coefficient in structured populations 1-F_IT = (1-F_IS)(1-F_ST). I believe this also gave a somewhat similar estimate of additive genetic variance \sigma^2_a similar to what we get from the pedigree. But this argument is riddled with problems because the real parameter in Var(add_gen_value | X) = XX'\sigma^2_alpha is \sigma^2_alpha not \sigma^2_a (VanRaden assumed that \sigma^2_a ~ \sigma^2_alpha2sum(maf*(1-maf)), which would make sense if loci would not be in any LD). There are several Nature Genetics papers on a better genomic relationship matrix, but none mentions this, sic.

Yang 2010
Z[, i] = (X[, i] - 2maf[i])/k (we are standardizing the genotypes here)
above can be k = sd(X[, i]) (pure standardization) or k=sqrt(2maf(1-maf)) (assuming Hardy-Weinberg equilibrium)
ZZ'/p

The logic here is that we upscale importance of rare alleles - trying to get to IBD in a crude way (if we share a rare allele then we must share more unobserved alleles around it). Note that this implies a different prior than Var(alpha)=I\sigma^2_alpha. This Yang version is very sensitive to small allele frequencies (division by a tiny value) and often genotype data is filtered on MAF to avoid this problem. The VanRaden version is stable and the de-facto standard in agriculture (where all this is used on a daily basis!).

Of note on centring. Remember the phenotype model:

pheno_value[i] = mean + add_gen_value[i] + error[i]
add_gen_value[i] = X[i, ] alpha
pheno_value[i] = mean + X[i, ] alpha + error[i]
               = mean + (Z[i, ] + 2maf) alpha + error[i]
               = (mean + 2maf alpha) + Z[i, ] alpha + error[i]
Var((Z[i, ] + 2maf) alpha | (Z[i, ] + 2maf)) = Var(Z[i, ] alpha + 2maf alpha | (Z[i, ] + 2maf))
                                             = Var(Z[i, ] alpha | Z[i, ]) + Var(2maf alpha | 2maf)
                                             = Z[i, ]Z[i, ]'\sigma^2_alpha + 2maf2maf'\sigma^2_alpha
                                             = (Z[i, ]Z[i, ]' + 2maf2maf')\sigma^2_alpha
                                             = X[i, ]X[i, ]'\sigma^2_alpha

So, centring in the context of the above model only pushes the mean of genotypes in to intercept. There can be numerical reasons to centre covariates in a model (for example if we run the above model through MCMC). In the context of trees, I think that non-centring means that root's genotypes are reference (set to intercept) and estimates of alphas would then be allele substitution effects between the root's ancestral alleles and mutations. Centring on current population allele frequencies pushes intercept to today's mean. Again, if we would use IBD segments we would not do any subtraction (unless we would cut some ancient sharing).

@gregorgorjanc
Copy link
Member

gregorgorjanc commented Oct 10, 2020

Based on the above "novel" we should indeed be careful how we name @brieuclehmann covariance calculation! It is not kinship, he is calculating covariance/similarity/identity between chromosomes k and l of individuals i and j, Pr(genome[i,k] = genome[j,l]). Possble names for this are: genomic covariance, gametic covariance (following Schaeffer et al 1989), node covariance, chromosome covariance, just covariance?

Since the method operates on nodes (=chromosomes) we could just call it covariance. Maybe ts.covariance(type="ibs") should give the full depth (IBS) covariance and ts.covariance(type="ibd", otherargs=xxx) should give @gtsambos method (summed over all segments of the chromosome) and otherargs=xxx could control how deep we go or how long segments are taken into account.

Kinship is when we sum these values into one value that describes the relationship between two diploid individuals (or more generally polyploid individuals). This could be called ts.kinship(type="ibs") should give the full depth (IBS) kinship and ts.kinship(type="ibd", otherargs=xxx) should give IBD kinship. These two methods would call ts.covariance() and sum genome combinations of one individual or pairs of individuals.

What to do with centring? Maybe we can centre the resulting values instead of the input (genotypes)?

And with "scaling" to 0-1?

@petrelharp
Copy link
Contributor

It is not kinship, he is calculating covariance/similarity/identity between chromosomes

Hm - is it not kinship just because it is acting on single chromosomes, instead of diploid individuals? If so, how would you define kinship between haploid individuals (say, male bees)?

@petrelharp
Copy link
Contributor

Another note: I think that in fact we're going to want to deal with the centred covariance matrix, since we don't observe phenotypes at the roots of the trees, and so our model is only defined up to an overall shift.

@gregorgorjanc
Copy link
Member

It is not kinship, he is calculating covariance/similarity/identity between chromosomes

Hm - is it not kinship just because it is acting on single chromosomes, instead of diploid individuals? If so, how would you define kinship between haploid individuals (say, male bees)?

Yes, following the established literature I believe kinship coefficient is between diploid individuals, but the authors obviously did not have other ploidy cases in mind.

I am happy for either ts.covariance or ts.kinship, but there might be a case for both (to give similarly between chromosomes or between individuals) and in the case of haploids the two methods are the same?

Or ts.kinship(type="ibs|ibd", nodes=TRUE|FALSE, otherargs=xxx)?

@gregorgorjanc
Copy link
Member

Another note: I think that in fact we're going to want to deal with the centred covariance matrix, since we don't observe phenotypes at the roots of the trees, and so our model is only defined up to an overall shift.

I see your point.

Should we then add centre argument to ts.kinship with default set to true?

What about scaling to [0, 1]?

@gtsambos
Copy link
Member

just chiming in to say I've had a busy week, but will try to get my head around this soon!

@gtsambos
Copy link
Member

Thanks for the useful fleshed out explanations above @gregorgorjanc and @petrelharp, it's clear you've both thought about this a lot over the years! I've copied some of it to refer to later 🤓

I would vote against calling this method kinship simply because it is a different calculation from the 'kinship coefficients' calculated with IBD information (which is what I was showing in my talk), and I'd worry that users would mistake the two. However, they are conceptually very closely related. Say for a given pair of samples, you had the IBD kinships at every time point, ie. IBD kinships as a function of time. Then integrate that function over time. The number you'd get would be this statistic that @brieuclehmann has implemented here! Ie. it's more like an expectation of the kinship rather than the kinship itself. Pretty cool!

@gregorgorjanc
Copy link
Member

I would vote against calling this method kinship simply because it is a different calculation from the 'kinship coefficients' calculated with IBD information (which is what I was showing in my talk), and I'd worry that users would mistake the two.

Yeah, this area is confusing, but if we start using different terms for very similar things (the difference is only in the type of information used (IBS vs IBD) and whether we centre or not) then we are just adding to the confusion. But I sense that whatever we name these methods it will not solve the confusion ;(

@gtsambos
Copy link
Member

(the difference is only in the type of information used (IBS vs IBD) and whether we centre or not)

I'm not so sure, to me it seems like the biggest difference is that IBD kinship is defined wrt to particular times, whereas this statistic is a summary of relatedness over all times. By IBS vs IBD, do you mean the site vs branch version of this statistic? I'd be a bit careful about calling the implemented site statistic a measure of IBS, as it will only be so under an infinite sites assumption

@gregorgorjanc
Copy link
Member

gregorgorjanc commented Oct 12, 2020

(the difference is only in the type of information used (IBS vs IBD) and whether we centre or not)

I'm not so sure, to me it seems like the biggest difference is that IBD kinship is defined wrt to particular times, whereas this statistic is a summary of relatedness over all times. By IBS vs IBD, do you mean the site vs branch version of this statistic? I'd be a bit careful about calling the implemented site statistic a measure of IBS, as it will only be so under an infinite sites assumption

Yes, IBD kinship can have an imposed limit on the age (or relatedly on the length of IBD segments that are considered in calculating IBD kinship), but it's up to a user to impose that limit or not. When we do not impose any limit, IBD and IBS kinship should be the same - IBD is just a proxy for IBS afterall.

As to site vs branch statistics, I am not sure. I understood that branch statistic will assume neutral infinite sites, while site statistic will reflect deviations from the assumption(s) because we are looking at actual genotypes (hence identical by state). Or is it the other way around?

I love these discussions! I would like to note that we are trying to bring together lots of different concepts here, hence so many comments.

@petrelharp
Copy link
Contributor

Say for a given pair of samples, you had the IBD kinships at every time point, ie. IBD kinships as a function of time. Then integrate that function over time. The number you'd get would be this statistic that @brieuclehmann has implemented here! Ie. it's more like an expectation of the kinship rather than the kinship itself. Pretty cool!

Whoa, good point! I had not realized this, or seen it anywhere!

@gregorgorjanc
Copy link
Member

Say for a given pair of samples, you had the IBD kinships at every time point, ie. IBD kinships as a function of time. Then integrate that function over time. The number you'd get would be this statistic that @brieuclehmann has implemented here! Ie. it's more like an expectation of the kinship rather than the kinship itself. Pretty cool!

Whoa, good point! I had not realized this, or seen it anywhere!

Isn't IBD kinship as a function of time a cumulative function - so it's already integrating; at the root IBD kinship gives the same as @brieuclehmann IBS kinship? I view IBD kinship as F_IS (from 1-F_IT = (1-F_IS)(1-F_ST)) and when time goes to root F_IS equals F_IT.

@petrelharp
Copy link
Contributor

petrelharp commented Oct 13, 2020

Isn't IBD kinship as a function of time a cumulative function?

It is cumulative: I think it's "proportion of the genome that has coalesced by this time"; but Georgia points out that what we're calculating has another integral in - the difference is that IBD kinship has units of genome length (well, proportion of the genome), whereas Brieuc's calculation has units of time * length.

Oh! This is the big difference! With IBD kinship, it doesn't matter if you coalesce earlier or later in the pedigree, as the underlying model doesn't include new mutations and all genetic variation comes from the founders; but for covariance of quantitative traits, coalescing more recently makes for a higher covariance, because genetic variation comes from new mutations.

@petrelharp
Copy link
Contributor

BTW: another word for all this is "coancestry". Unfortunately it appears to be a synonym for "kinship", so it's not much helping.

@gregorgorjanc
Copy link
Member

BTW: another word for all this is "coancestry". Unfortunately it appears to be a synonym for "kinship", so it's not much helping.

Another word common in plant breeding is coefficient of parentage (as in COP matrix). For example, see https://cropforge.github.io/iciswiki/articles/t/d/m/TDM_COP2.htm

Coefficient of inbreeding is kinship of a diploid individual with itself and a synonym for it is coeff of consanguinity (sometimes this same term is a synonym for kinship) https://www.oxfordreference.com/view/10.1093/oi/authority.20110803095621781, https://www.oxfordreference.com/view/10.1093/oi/authority.20110803095621781

@brieuclehmann
Copy link
Contributor Author

Just a comment following a chat with @gregorgorjanc, I think the current implementation assumes all samples are leaves and I'm not certain that it returns the right answer if we have internal samples. I'll add a test to this effect...

@gregorgorjanc
Copy link
Member

I think the sample_count_stat version is probably doing the right thing here, so we should just change the naive implementation. We agonised endlessly about this sort of thing when doing the general stats, so I think just following the framework's conventions is probably OK.

Hmm, this is tricky. I think I would agree with this when looking at the first and third tree (where samples 1,2,3,5 are all in the same clade (started with 5), but in the second tree sample 3 is in a different clade - cutting the tree at 5 would suggest that 3 and 5 are unrelated, but they are potentially quite related (progeny of 8).

@brieuclehmann
Copy link
Contributor Author

Sorry, I hadn't spotted that 8 is the mrca in the second tree. sample_count_stat includes the shared branch areas up to 8 in the second tree, but only up to 5 in trees 1 and 3.

@gregorgorjanc
Copy link
Member

Sorry, I hadn't spotted that 8 is the mrca in the second tree. sample_count_stat includes the shared branch areas up to 8 in the second tree, but only up to 5 in trees 1 and 3.

OK, then the difference between taking into account oldest ancestors versus MRCA for each tree (sample_count_stat way) will be just a scale difference. I vote for the sample_count_stat way since that is the default elsewhere. It might be that this is not a good default, but kick this can further down the road.

One argument for taking into account all ancestors would be that the associated variance component from a mixed model (where this relatedness matrix would be used) would not map to a well-defined time-point in the past across all trees (as it does if we use pedigrees with the same depth for all individuals), but ancestors and MRCA are or different ages anyway.

@brieuclehmann
Copy link
Contributor Author

I’ve adjusted the tests to have mutations (so that I’m testing against the naive genotype covariance rather than branch covariance). There’s still the question of rescaling: it turns out that for polarised = False, we want to divide by 2p (where p is the number of segregating sites), since that’s the total number of alleles. It’s not clear to me if there’s a neat way to do this rescaling within the summary function / sample_count_stat ...

@jeromekelleher
Copy link
Member

This is a heads-up here @brieuclehmann that you might have some merge problems when you update because of #939

It shouldn't affect you much, but you might need to add a few consts in to make things consistent. I'm happy to do the update if this'll help.

It would be good to get a version of this PR in soon though - the codebase is moving quickly at the moment and PRs get out of date quickly. We can always file some issues to track anything that isn't quite worked out yet.

@jeromekelleher
Copy link
Member

Is this more-or-less ready to go then @brieuclehmann? We can open an issue to track the proportion and documentation issues sepearately.

You have some lint issues to fix here I think.

brieuc-mac added 2 commits November 3, 2020 18:12
Add hack to divide by 2 - polarised in C implementation
@brieuclehmann
Copy link
Contributor Author

Just a minor tweak to be made to remove polarised in the denominator. Also, that's very much a hack at the moment (see last commit) but I'm not sure where best to do the division by 2.

Re linting, I think it's because I don't have clang-format installed...

@jeromekelleher
Copy link
Member

Looks great, thanks @brieuclehmann! Can you open issues to track any remaining things that need to be resolved please?

@petrelharp
Copy link
Contributor

Two minor things:

  • I think the denominator should always be 2, and should be incorporated into the summary function.
  • I think the C and python methods should be consistently named: one is relatedness, the other genotype_relatedness; I think we decided to do the second thing.

These can be done in follow-up. We also want more testing in the follow-up.

@jeromekelleher
Copy link
Member

jeromekelleher commented Nov 4, 2020

OK, great, I'm going to merge this and open an issue to track the points from Peter above. Thanks @brieuclehmann, this is a big step forward!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants