@@ -18,20 +18,21 @@ kernelspec:
18
18
# Getting started with {program}` tskit `
19
19
20
20
You've run some simulations or inference methods, and you now have a
21
- {class}` TreeSequence ` object; what now? This tutorial is aimed
21
+ {class}` TreeSequence ` object; what now? This tutorial is aimed for
22
22
users who are new to {program}` tskit ` and would like to get some
23
23
basic tasks completed. We'll look at five fundamental things you might
24
- need to do:
24
+ want to do:
25
25
{ref}` process trees<sec_processing_trees> ` ,
26
- {ref}` sites & mutations<sec_processing_sites_and_mutations> ` , and
27
- {ref}` genotypes<sec_processing_genotypes> ` ,
26
+ {ref}` process sites & mutations<sec_processing_sites_and_mutations>` ,
27
+ {ref}` extract genotypes<sec_processing_genotypes>` ,
28
28
{ref}` compute statistics<sec_tskit_getting_started_compute_statistics> ` , and
29
29
{ref}` save or export data<sec_tskit_getting_started_exporting_data> ` .
30
30
Throughout, we'll also provide pointers to where you can learn more.
31
31
32
+
32
33
:::{note}
33
34
The examples in this
34
- tutorial are all written using the {ref}` sec_python_api ` , but it's also possible to
35
+ tutorial are all written using the {ref}` tskit: sec_python_api` , but it's also possible to
35
36
{ref}` use R <sec_tskit_r> ` , or access the API in other languages, notably
36
37
{ref}` C<sec_c_api> ` and [ Rust] ( https://github.com/tskit-dev/tskit-rust ) .
37
38
:::
@@ -43,7 +44,6 @@ twenty diploid individuals. To make it a bit more interesting, we'll simulate th
43
44
of a {ref}` selective sweep<msprime:sec_ancestry_models_selective_sweeps> ` in the middle
44
45
of the chromosome, then throw some neutral mutations onto the resulting tree sequence.
45
46
46
-
47
47
``` {code-cell} ipython3
48
48
import msprime
49
49
@@ -61,7 +61,7 @@ ts = msprime.sim_ancestry(
61
61
recombination_rate=1e-8,
62
62
random_seed=1234, # only needed for repeatabilty
63
63
)
64
- # Optionally add finite-site mutations to the ts using the Jukes & Cantor model, creating SNPs
64
+ # Optionally add finite-site mutations to the ts using the default Jukes & Cantor model, creating SNPs
65
65
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=4321)
66
66
ts
67
67
```
@@ -250,13 +250,16 @@ for nmuts, count in enumerate(np.bincount(num_muts)):
250
250
251
251
(sec_processing_genotypes)=
252
252
253
- ## Processing genotypes
253
+ ## Extracting genotypes
254
254
255
255
At each site, the sample nodes will have a particular allelic state (or be flagged as
256
256
{ref}` tskit:sec_data_model_missing_data ` ). The
257
- {meth}` TreeSequence.variants ` method gives access to the
258
- full variation data. For efficiency, the {attr}` ~Variant.genotypes `
259
- at a site are returned as a [ numpy] ( https://numpy.org ) array of integers:
257
+ {meth}` TreeSequence.variants ` method gives access to
258
+ full variation data - possible allelic states and node genotypes at each site.
259
+ Since nodes are by definition haploid, their genotype at a site is their allelic state
260
+ at the site. For efficiency, its attribute {attr}` ~Variant.genotypes `
261
+ at a site returns a [ numpy] ( https://numpy.org ) array of integers
262
+ representing allelic states:
260
263
261
264
``` {code-cell} ipython3
262
265
import numpy as np
@@ -275,20 +278,16 @@ Tree sequences are optimised to look at all samples at one site, then all sample
275
278
adjacent site, and so on along the genome. It is much less efficient look at all the
276
279
sites for a single sample, then all the sites for the next sample, etc. In other words,
277
280
you should generally iterate over sites, not samples. Nevertheless, all the alleles for
278
- a single sample can be obtained via the
279
- {meth}` TreeSequence.haplotypes ` method.
281
+ a single sample can be obtained via the {meth}` TreeSequence.haplotypes ` method.
280
282
:::
281
283
282
-
283
284
To find the actual allelic states at a site, you can refer to the
284
285
{attr}` ~Variant.alleles ` provided for each {class}` Variant ` :
285
286
the genotype value is an index into this list. Here's one way to print them out; for
286
287
clarity this example also prints out the IDs of both the sample nodes (i.e. the genomes)
287
288
and the diploid {ref}` individuals <sec_nodes_or_individuals> ` in which each sample
288
289
node resides.
289
290
290
-
291
-
292
291
```` {code-cell} ipython3
293
292
samp_ids = ts.samples()
294
293
print(" ID of diploid individual: ", " ".join([f"{ts.node(s).individual:3}" for s in samp_ids]))
@@ -309,7 +308,6 @@ such as indels, leading to allelic states which need not be one of these 4 lette
309
308
even be a single letter.
310
309
:::
311
310
312
-
313
311
(sec_tskit_getting_started_compute_statistics)=
314
312
315
313
## Computing statistics
@@ -353,17 +351,17 @@ is the spectrum for a section of the tree sequence between 5 and 5.5Mb, which we
353
351
created by deleting the regions outside that interval using
354
352
{meth}` TreeSequence.keep_intervals ` . Unsurprisingly,
355
353
as we noted when looking at the trees, there's a far higher proportion of singletons in
356
- the region of the sweep.
354
+ the region of the sweep compared to the entire genome .
357
355
358
356
(sec_tskit_getting_started_compute_statistics_windowing)=
359
357
360
358
### Windowing
361
359
362
360
It is often useful to see how statistics vary in different genomic regions. This is done
363
- by calculating them in {ref}` tskit:sec_stats_windows ` along the genome. For this,
361
+ by calculating them in {ref}` windows< tskit:sec_stats_windows> ` along the genome. For this,
364
362
let's look at a single statistic, the genetic {meth}` ~TreeSequence.diversity ` (π). As a
365
- site statistic this measures the average number of genetic differences between two
366
- randomly chosen samples, whereas as a branch length statistic it measures the average
363
+ site statistic, it measures the average number of genetic differences between two
364
+ randomly chosen samples, whereas as a branch statistic, it measures the average
367
365
branch length between them. We'll plot how the value of π changes using 10kb windows,
368
366
plotting the resulting diversity between positions 4 and 6 Mb:
369
367
@@ -380,15 +378,15 @@ ax1.set_yscale("log")
380
378
ax1.set_ylim(1e-6, 1e-3)
381
379
ax2.stairs(ts.diversity(windows=windows, mode="branch"), windows/1_000, baseline=None)
382
380
ax2.set_xlabel("Genome position (kb)")
383
- ax2.set_title("Branch-length- based calculation")
381
+ ax2.set_title("Branch-based calculation")
384
382
ax2.set_xlim(4e3, 6e3)
385
383
ax2.set_yscale("log")
386
384
plt.show()
387
385
```
388
386
389
387
There's a clear drop in diversity in the region of the selective sweep. And as expected,
390
- the statistic based on branch- lengths gives a less noisy signal.
391
-
388
+ the statistic based on branch lengths gives a less noisy signal than the statistic based
389
+ on randomly occuring mutations at sites.
392
390
393
391
(sec_tskit_getting_started_exporting_data)=
394
392
@@ -434,17 +432,16 @@ print(small_ts.as_nexus(precision=3, include_alignments=False))
434
432
435
433
### VCF
436
434
437
- The standard way of interchanging genetic variation data is the Variant Call Format,
435
+ The standard way of interchanging genetic variation data is the Variant Call Format (VCF) ,
438
436
for which tskit has basic support:
439
437
440
438
``` {code-cell} ipython3
441
439
import sys
442
440
small_ts.write_vcf(sys.stdout)
443
441
```
444
442
445
- The write_vcf method takes a file object as a parameter; to get it to write out to the
446
- notebook here we ask it to write to stdout.
447
-
443
+ The {meth}` TreeSequence.write_vcf ` method takes a file object as a parameter;
444
+ to get it to write out to the notebook here we ask it to write to stdout.
448
445
449
446
(sec_tskit_getting_started_exporting_data_allel)=
450
447
@@ -466,8 +463,8 @@ print(h.n_variants, h.n_haplotypes)
466
463
h
467
464
```
468
465
469
- Sckit. allel has a wide-ranging and efficient suite of tools for working with genotype
470
- data, so should provide anything that's needed . For example, it gives us an
466
+ {program} ` scikit- allel` has a wide-ranging and efficient suite of tools for working
467
+ with genotype data for most of users' needs . For example, it gives us an
471
468
another way to compute the pairwise diversity statistic (that we calculated
472
469
{ref}` above<sec_tskit_getting_started_compute_statistics_windowing> `
473
470
using the native {meth}` TreeSequence.diversity ` method):
@@ -477,6 +474,23 @@ ac = h.count_alleles()
477
474
allel.mean_pairwise_difference(ac)
478
475
```
479
476
477
+ Comparative call with tskit is:
478
+
479
+ ``` {code-cell} ipython3
480
+ # Per site (not span-normalised to match scikit-allel)
481
+ small_ts.diversity(windows="sites", span_normalise=False)
482
+ # Per site (span-normalised to account for variable span of windows defined by sites)
483
+ small_ts.diversity(windows="sites")
484
+
485
+ # Per whole genome
486
+ small_ts.diversity(span_normalise=False)
487
+ # ... equivalent call to the above
488
+ sum(small_ts.diversity(windows="sites", span_normalise=False))
489
+ small_ts.diversity()
490
+ ```
491
+
492
+ See {ref}` windows<tskit:sec_stats_windows> ` and {ref}` windows<tskit:sec_stats_span_normalise> `
493
+ for details about the `` span_normalise=False `` .
480
494
481
495
(sec_tskit_getting_started_key_points)=
482
496
@@ -496,15 +510,15 @@ in rough order of importance:
496
510
* {ref}` sec_terminology_nodes ` (i.e. genomes) can belong to
497
511
{ref}` individuals<sec_terminology_individuals_and_populations> ` . For example,
498
512
sampling a diploid individual results in an {class}` Individual ` object which
499
- possesses two distinct {ref}` sample nodes<sec_terminology_nodes_samples> ` .
513
+ links to two distinct {ref}` sample nodes<sec_terminology_nodes_samples> ` .
500
514
* Key tree sequence methods
501
515
* {meth}` ~TreeSequence.samples() ` returns an array of node IDs specifying the
502
516
nodes that are marked as samples
503
517
* {meth}` ~TreeSequence.node ` returns the node object for a given integer node ID
504
518
* {meth}` ~TreeSequence.trees ` iterates over all the trees
505
519
* {meth}` ~TreeSequence.sites ` iterates over all the sites
506
- * {meth}` ~TreeSequence.variants ` iterates over all the sites with their genotypes
507
- and alleles
520
+ * {meth}` ~TreeSequence.variants ` iterates over all the sites with their list of
521
+ unique alleles and sample/node? genotypes
508
522
* {meth}` ~TreeSequence.simplify() ` reduces the number of sample nodes in the tree
509
523
sequence to a specified subset
510
524
* {meth}` ~TreeSequence.keep_intervals() ` (or its complement,
@@ -519,4 +533,3 @@ in rough order of importance:
519
533
sequence, for example {meth}` ~TreeSequence.allele_frequency_spectrum ` ,
520
534
{meth}` ~TreeSequence.diversity ` , and {meth}` ~TreeSequence.Fst ` ; these can
521
535
also be calculated in windows along the genome.
522
-
0 commit comments