Skip to content

Commit 22dd1a5

Browse files
gregorgorjancbenjeffery
authored andcommitted
Minor formatting what_is.md, terminology_and_concepts.md, getting_started.md
1 parent a340e52 commit 22dd1a5

File tree

3 files changed

+80
-73
lines changed

3 files changed

+80
-73
lines changed

getting_started.md

Lines changed: 47 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -18,20 +18,21 @@ kernelspec:
1818
# Getting started with {program}`tskit`
1919

2020
You've run some simulations or inference methods, and you now have a
21-
{class}`TreeSequence` object; what now? This tutorial is aimed
21+
{class}`TreeSequence` object; what now? This tutorial is aimed for
2222
users who are new to {program}`tskit` and would like to get some
2323
basic tasks completed. We'll look at five fundamental things you might
24-
need to do:
24+
want to do:
2525
{ref}`process trees<sec_processing_trees>`,
26-
{ref}`sites & mutations<sec_processing_sites_and_mutations>`, and
27-
{ref}`genotypes<sec_processing_genotypes>`,
26+
{ref}`process sites & mutations<sec_processing_sites_and_mutations>`,
27+
{ref}`extract genotypes<sec_processing_genotypes>`,
2828
{ref}`compute statistics<sec_tskit_getting_started_compute_statistics>`, and
2929
{ref}`save or export data<sec_tskit_getting_started_exporting_data>`.
3030
Throughout, we'll also provide pointers to where you can learn more.
3131

32+
3233
:::{note}
3334
The examples in this
34-
tutorial are all written using the {ref}`sec_python_api`, but it's also possible to
35+
tutorial are all written using the {ref}`tskit:sec_python_api`, but it's also possible to
3536
{ref}`use R <sec_tskit_r>`, or access the API in other languages, notably
3637
{ref}`C<sec_c_api>` and [Rust](https://github.com/tskit-dev/tskit-rust).
3738
:::
@@ -43,7 +44,6 @@ twenty diploid individuals. To make it a bit more interesting, we'll simulate th
4344
of a {ref}`selective sweep<msprime:sec_ancestry_models_selective_sweeps>` in the middle
4445
of the chromosome, then throw some neutral mutations onto the resulting tree sequence.
4546

46-
4747
```{code-cell} ipython3
4848
import msprime
4949
@@ -61,7 +61,7 @@ ts = msprime.sim_ancestry(
6161
recombination_rate=1e-8,
6262
random_seed=1234, # only needed for repeatabilty
6363
)
64-
# Optionally add finite-site mutations to the ts using the Jukes & Cantor model, creating SNPs
64+
# Optionally add finite-site mutations to the ts using the default Jukes & Cantor model, creating SNPs
6565
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=4321)
6666
ts
6767
```
@@ -250,13 +250,16 @@ for nmuts, count in enumerate(np.bincount(num_muts)):
250250

251251
(sec_processing_genotypes)=
252252

253-
## Processing genotypes
253+
## Extracting genotypes
254254

255255
At each site, the sample nodes will have a particular allelic state (or be flagged as
256256
{ref}`tskit:sec_data_model_missing_data`). The
257-
{meth}`TreeSequence.variants` method gives access to the
258-
full variation data. For efficiency, the {attr}`~Variant.genotypes`
259-
at a site are returned as a [numpy](https://numpy.org) array of integers:
257+
{meth}`TreeSequence.variants` method gives access to
258+
full variation data - possible allelic states and node genotypes at each site.
259+
Since nodes are by definition haploid, their genotype at a site is their allelic state
260+
at the site. For efficiency, its attribute {attr}`~Variant.genotypes`
261+
at a site returns a [numpy](https://numpy.org) array of integers
262+
representing allelic states:
260263

261264
```{code-cell} ipython3
262265
import numpy as np
@@ -275,20 +278,16 @@ Tree sequences are optimised to look at all samples at one site, then all sample
275278
adjacent site, and so on along the genome. It is much less efficient look at all the
276279
sites for a single sample, then all the sites for the next sample, etc. In other words,
277280
you should generally iterate over sites, not samples. Nevertheless, all the alleles for
278-
a single sample can be obtained via the
279-
{meth}`TreeSequence.haplotypes` method.
281+
a single sample can be obtained via the {meth}`TreeSequence.haplotypes` method.
280282
:::
281283

282-
283284
To find the actual allelic states at a site, you can refer to the
284285
{attr}`~Variant.alleles` provided for each {class}`Variant`:
285286
the genotype value is an index into this list. Here's one way to print them out; for
286287
clarity this example also prints out the IDs of both the sample nodes (i.e. the genomes)
287288
and the diploid {ref}`individuals <sec_nodes_or_individuals>` in which each sample
288289
node resides.
289290

290-
291-
292291
````{code-cell} ipython3
293292
samp_ids = ts.samples()
294293
print(" ID of diploid individual: ", " ".join([f"{ts.node(s).individual:3}" for s in samp_ids]))
@@ -309,7 +308,6 @@ such as indels, leading to allelic states which need not be one of these 4 lette
309308
even be a single letter.
310309
:::
311310

312-
313311
(sec_tskit_getting_started_compute_statistics)=
314312

315313
## Computing statistics
@@ -353,17 +351,17 @@ is the spectrum for a section of the tree sequence between 5 and 5.5Mb, which we
353351
created by deleting the regions outside that interval using
354352
{meth}`TreeSequence.keep_intervals`. Unsurprisingly,
355353
as we noted when looking at the trees, there's a far higher proportion of singletons in
356-
the region of the sweep.
354+
the region of the sweep compared to the entire genome.
357355

358356
(sec_tskit_getting_started_compute_statistics_windowing)=
359357

360358
### Windowing
361359

362360
It is often useful to see how statistics vary in different genomic regions. This is done
363-
by calculating them in {ref}`tskit:sec_stats_windows` along the genome. For this,
361+
by calculating them in {ref}`windows<tskit:sec_stats_windows>` along the genome. For this,
364362
let's look at a single statistic, the genetic {meth}`~TreeSequence.diversity` (π). As a
365-
site statistic this measures the average number of genetic differences between two
366-
randomly chosen samples, whereas as a branch length statistic it measures the average
363+
site statistic, it measures the average number of genetic differences between two
364+
randomly chosen samples, whereas as a branch statistic, it measures the average
367365
branch length between them. We'll plot how the value of π changes using 10kb windows,
368366
plotting the resulting diversity between positions 4 and 6 Mb:
369367

@@ -380,15 +378,15 @@ ax1.set_yscale("log")
380378
ax1.set_ylim(1e-6, 1e-3)
381379
ax2.stairs(ts.diversity(windows=windows, mode="branch"), windows/1_000, baseline=None)
382380
ax2.set_xlabel("Genome position (kb)")
383-
ax2.set_title("Branch-length-based calculation")
381+
ax2.set_title("Branch-based calculation")
384382
ax2.set_xlim(4e3, 6e3)
385383
ax2.set_yscale("log")
386384
plt.show()
387385
```
388386

389387
There's a clear drop in diversity in the region of the selective sweep. And as expected,
390-
the statistic based on branch-lengths gives a less noisy signal.
391-
388+
the statistic based on branch lengths gives a less noisy signal than the statistic based
389+
on randomly occuring mutations at sites.
392390

393391
(sec_tskit_getting_started_exporting_data)=
394392

@@ -434,17 +432,16 @@ print(small_ts.as_nexus(precision=3, include_alignments=False))
434432

435433
### VCF
436434

437-
The standard way of interchanging genetic variation data is the Variant Call Format,
435+
The standard way of interchanging genetic variation data is the Variant Call Format (VCF),
438436
for which tskit has basic support:
439437

440438
```{code-cell} ipython3
441439
import sys
442440
small_ts.write_vcf(sys.stdout)
443441
```
444442

445-
The write_vcf method takes a file object as a parameter; to get it to write out to the
446-
notebook here we ask it to write to stdout.
447-
443+
The {meth}`TreeSequence.write_vcf` method takes a file object as a parameter;
444+
to get it to write out to the notebook here we ask it to write to stdout.
448445

449446
(sec_tskit_getting_started_exporting_data_allel)=
450447

@@ -466,8 +463,8 @@ print(h.n_variants, h.n_haplotypes)
466463
h
467464
```
468465

469-
Sckit.allel has a wide-ranging and efficient suite of tools for working with genotype
470-
data, so should provide anything that's needed. For example, it gives us an
466+
{program}`scikit-allel` has a wide-ranging and efficient suite of tools for working
467+
with genotype data for most of users' needs. For example, it gives us an
471468
another way to compute the pairwise diversity statistic (that we calculated
472469
{ref}`above<sec_tskit_getting_started_compute_statistics_windowing>`
473470
using the native {meth}`TreeSequence.diversity` method):
@@ -477,6 +474,23 @@ ac = h.count_alleles()
477474
allel.mean_pairwise_difference(ac)
478475
```
479476

477+
Comparative call with tskit is:
478+
479+
```{code-cell} ipython3
480+
# Per site (not span-normalised to match scikit-allel)
481+
small_ts.diversity(windows="sites", span_normalise=False)
482+
# Per site (span-normalised to account for variable span of windows defined by sites)
483+
small_ts.diversity(windows="sites")
484+
485+
# Per whole genome
486+
small_ts.diversity(span_normalise=False)
487+
# ... equivalent call to the above
488+
sum(small_ts.diversity(windows="sites", span_normalise=False))
489+
small_ts.diversity()
490+
```
491+
492+
See {ref}`windows<tskit:sec_stats_windows>` and {ref}`windows<tskit:sec_stats_span_normalise>`
493+
for details about the ``span_normalise=False``.
480494

481495
(sec_tskit_getting_started_key_points)=
482496

@@ -496,15 +510,15 @@ in rough order of importance:
496510
* {ref}`sec_terminology_nodes` (i.e. genomes) can belong to
497511
{ref}`individuals<sec_terminology_individuals_and_populations>`. For example,
498512
sampling a diploid individual results in an {class}`Individual` object which
499-
possesses two distinct {ref}`sample nodes<sec_terminology_nodes_samples>`.
513+
links to two distinct {ref}`sample nodes<sec_terminology_nodes_samples>`.
500514
* Key tree sequence methods
501515
* {meth}`~TreeSequence.samples()` returns an array of node IDs specifying the
502516
nodes that are marked as samples
503517
* {meth}`~TreeSequence.node` returns the node object for a given integer node ID
504518
* {meth}`~TreeSequence.trees` iterates over all the trees
505519
* {meth}`~TreeSequence.sites` iterates over all the sites
506-
* {meth}`~TreeSequence.variants` iterates over all the sites with their genotypes
507-
and alleles
520+
* {meth}`~TreeSequence.variants` iterates over all the sites with their list of
521+
unique alleles and sample/node? genotypes
508522
* {meth}`~TreeSequence.simplify()` reduces the number of sample nodes in the tree
509523
sequence to a specified subset
510524
* {meth}`~TreeSequence.keep_intervals()` (or its complement,
@@ -519,4 +533,3 @@ in rough order of importance:
519533
sequence, for example {meth}`~TreeSequence.allele_frequency_spectrum`,
520534
{meth}`~TreeSequence.diversity`, and {meth}`~TreeSequence.Fst`; these can
521535
also be calculated in windows along the genome.
522-

0 commit comments

Comments
 (0)