@@ -74,9 +74,9 @@ def create_notebook_data():
74
74
75
75
A * succinct tree sequence* , or "tree sequence" for short, represents the ancestral
76
76
relationships between a set of DNA sequences. Tree sequences are based on fundamental
77
- biological principles of inheritance, DNA duplication, and recombination; they can be
78
- created by [ evolutionary simulation] ( https://tskit.dev/software/#simulate ) or by
79
- [ inferring genealogies from empirical DNA data] ( https://tskit.dev/software/#infer ) .
77
+ biological principles of inheritance, DNA duplication, mutation, and recombination;
78
+ they can be created by [ evolutionary simulation] ( https://tskit.dev/software/#simulate )
79
+ or by [ inferring genealogies from empirical DNA data] ( https://tskit.dev/software/#infer ) .
80
80
81
81
:::{margin} Key point
82
82
Tree sequences are used to encode and analyse large genetic datasets
@@ -85,8 +85,9 @@ Tree sequences are used to encode and analyse large genetic datasets
85
85
Tree sequences provide an efficient way of storing
86
86
[ genetic variation] ( https://en.wikipedia.org/wiki/Genetic_variation ) data, and can
87
87
power analyses of millions of whole [ genomes] ( https://en.wikipedia.org/wiki/Genome ) .
88
- Plots (a) and (b) summarize results presented
89
- [ further] ( plot_storing_everyone ) [ down] ( plot_incremental_calculation ) this tutorial.
88
+ Plots (a) and (b) below summarize these aspects
89
+ (see additional details on [ storage] ( plot_storing_everyone ) and
90
+ [ compute] ( plot_incremental_calculation ) further down).
90
91
91
92
``` {code-cell} ipython3
92
93
:"tags": ["remove-input"]
@@ -141,8 +142,8 @@ plt.show()
141
142
As the name suggests, the simplest way to think about a tree sequence is that it
142
143
describes a sequence of correlated "local trees" --- i.e. genetic trees located at
143
144
different points along a [ chromosome] ( https://en.wikipedia.org/wiki/Chromosome ) .
144
- Here's a tiny example based on ten genomes, $\mathrm{a}$ to $\mathrm{j}$, spanning
145
- a short 1000 letter chromosome.
145
+ Here's a tiny example based on ten haploid genomes, $\mathrm{a}$ to $\mathrm{j}$,
146
+ spanning a short 1000 letter chromosome.
146
147
147
148
``` {code-cell} ipython3
148
149
:"tags": ["hide-input"]
@@ -173,11 +174,18 @@ the nodes are referred to by {ref}`numerical ID<sec_terminology_nodes>`.
173
174
::::
174
175
175
176
The tickmarks on the X axis and background shading indicate the genomic positions covered
176
- by the trees. For the first short portion of the chromosome, from the
177
- start until position 189, the relationships between the ten genomes are shown by
178
- the first tree. The second tree shows the relationships between positions 189 and 546,
179
- and the third from position 546 to the end. We can say that the first tree spans 189
180
- base pairs, the second 357, and the third 454.
177
+ by the trees. The tickmarks indicate recombination events that explain relationships
178
+ between the ten genomes. There were two such recombination events, giving us three local trees.
179
+ For the first short portion of the chromosome, from the start until position 189,
180
+ the relationships between the ten genomes are shown by the first tree.
181
+ The second tree shows the relationships between positions 189 and 546.
182
+ By inspecting the first and the second local tree we can see that genomes $\mathrm{b}-\mathrm{f}$
183
+ changed their "most recent common ancestor" (MRCA) with genome $\mathrm{a}$ to
184
+ MRCA with genome $\mathrm{g}$.
185
+ The third tree shows the relationships between positions 546 and 1000 (the end).
186
+ By inspecting the second and the third local tree we can see that
187
+ recombination changed the ancestry of genomes $\mathrm{b}-\mathrm{f}$
188
+ back to shared MRCA with genome $\mathrm{g}$.
181
189
182
190
(sec_what_is_genealogical_network)=
183
191
@@ -187,8 +195,8 @@ In fact, succinct tree sequences don't store each tree separately, but instead a
187
195
based on an interconnected * genetic genealogy* , in which
188
196
[ genetic recombination] ( https://en.wikipedia.org/wiki/Genetic_recombination ) has led
189
197
to different regions of the chromosome having different histories. Another way of
190
- thinking about the tree sequence above is that it describes the full genetic
191
- * family "tree" * (strictly, "network") of our 10 genomes.
198
+ thinking about the tree sequence above is that it describes the full genetic ancestry
199
+ of our 10 genomes.
192
200
193
201
(sec_what_is_dna_data)=
194
202
@@ -355,10 +363,10 @@ tree sequence and the underlying biological processes that produced the genetic
355
363
sequences in the first place, such as those pictured in the demography above. For
356
364
example, each branch point (or "internal node") in one of our trees can be
357
365
imagined as a genome which existed at a specific time in the past, and
358
- which is a "most recent common ancestor" ( MRCA) of the descendant genomes at that
359
- position on the chromosome. We can mark these extra "ancestral genomes" on our tree
360
- diagrams, distinguishing them from the * sampled* genomes ($\mathrm{a}$ to $\mathrm{j}$)
361
- by using circular symbols. We can even colour the nodes by the population that we know
366
+ which is a MRCA of the descendant genomes at that position on the chromosome.
367
+ We can mark these extra "ancestral genomes" on our tree diagrams with circular symbols,
368
+ distinguishing them from the * sampled* genomes ($\mathrm{a}$ to $\mathrm{j}$)
369
+ marked with square symbols. We can even colour the nodes by the population that we know
362
370
(or infer) them to belong to at the time:
363
371
364
372
``` {code-cell} ipython3
@@ -425,7 +433,7 @@ Most genetic calculations involve iterating over trees, which is highly efficien
425
433
426
434
For example, statistical measures of genetic variation can be thought of as a calculation
427
435
combining the local trees with the mutations on each branch (or, often preferably, the
428
- length of the branches: see [ this summary] ( https://academic.oup.com /genetics/article/215/3/779/5930459 ) ).
436
+ length of the branches: see [ this summary] ( https://doi.org/10.1534 /genetics.120.303253 ) ).
429
437
Because a tree sequence is built on a set of small branch changes along the chromosome,
430
438
statistical calculations can often be updated incrementally as we
431
439
move along the genome, without having to perform the calculation * de novo* on each tree.
0 commit comments