Skip to content

Commit 8fcda03

Browse files
authored
Merge pull request #19 from psteinb/callout-to-zipf
add callout detailing what Zipf's law
2 parents 59b786a + 7e8b0b8 commit 8fcda03

File tree

1 file changed

+35
-5
lines changed

1 file changed

+35
-5
lines changed

_episodes/11-snakemake-intro.md

Lines changed: 35 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -134,7 +134,7 @@ $ python plotcount.py isles.dat show
134134

135135
Close the window to exit the plot.
136136

137-
`plotcount.py` can also create the plot as an image file (e.g. a PNG file):
137+
`plotcount.py` can also create the plot as an image file (e.g. a PNG file):
138138

139139
```bash
140140
$ python plotcount.py isles.dat isles.png
@@ -154,6 +154,30 @@ isles 3822 2460 1.55
154154
```
155155
{: .output}
156156

157+
> ## Zipf's Law
158+
>
159+
> Zipf's Law is an [empirical law](https://en.wikipedia.org/wiki/Empirical_law) formulated
160+
> using [mathematical statistics](https://en.wikipedia.org/wiki/Mathematical_statistics)
161+
> that refers to the fact that many types of data studied in the physical and
162+
> social sciences can be approximated with a Zipfian distribution, one of a family
163+
> of related discrete [power law](https://en.wikipedia.org/wiki/Power_law) [probability distributions](https://en.wikipedia.org/wiki/Probability_distribution).
164+
>
165+
> Zipf's law was originally formulated in terms of [quantitative linguistics](https://en.wikipedia.org/wiki/Quantitative_linguistics),
166+
> stating that given some [corpus](https://en.wikipedia.org/wiki/Text_corpus)
167+
> of [natural language](https://en.wikipedia.org/wiki/Natural_language) utterances,
168+
> the frequency of any word is [inversely proportional](https://en.wikipedia.org/wiki/Inversely_proportional)
169+
> to its rank in the [frequency table](https://en.wikipedia.org/wiki/Frequency_table).
170+
> For example, in the [Brown Corpus](https://en.wikipedia.org/wiki/Brown_Corpus)
171+
> of American English text, the word the is the most frequently occurring word,
172+
> and by itself accounts for nearly 7% of all word occurrences (69,971 out of
173+
> slightly over 1 million). True to Zipf's Law, the second-place word of
174+
> accounts for slightly over 3.5% of words (36,411 occurrences), followed by
175+
> and (28,852). Only 135 vocabulary items are needed to account for half
176+
> the [Brown Corpus](https://en.wikipedia.org/wiki/Brown_Corpus).
177+
>
178+
> Source: [Wikipedia](https://en.wikipedia.org/wiki/Zipf%27s_law):
179+
{: .callout}
180+
157181
Together these scripts implement a common workflow:
158182

159183
1. Read a data file.
@@ -278,13 +302,19 @@ There are several reasons this tool was chosen:
278302

279303
* It’s free, open-source, and installs in about 5 seconds flat via `pip`.
280304

281-
* Snakemake works cross-platform (Windows, MacOS, Linux) and is compatible with all HPC schedulers. More importantly, the same workflow will work and scale appropriately regardless of whether it’s on a laptop or cluster without modification.
305+
* Snakemake works cross-platform (Windows, MacOS, Linux) and is compatible with all HPC
306+
schedulers. More importantly, the same workflow will work and scale appropriately
307+
regardless of whether it’s on a laptop or cluster without modification.
282308

283-
* Snakemake uses pure Python syntax. There is no tool specific-language to learn like in GNU Make, NextFlow, WDL, etc.. Even if students end up not liking Snakemake, you’ve still taught them how to program in Python at the end of the day.
309+
* Snakemake uses pure Python syntax. There is no tool specific-language to learn like
310+
in GNU Make, NextFlow, WDL, etc.. Even if students end up not liking Snakemake, you’ve
311+
still taught them how to program in Python at the end of the day.
284312

285-
* Anything that you can do in Python, you can do with Snakemake (since you can pretty much execute arbitrary Python code anywhere).
313+
* Anything that you can do in Python, you can do with Snakemake (since you can pretty
314+
much execute arbitrary Python code anywhere).
286315

287-
* Snakemake was written to be as similar to GNU Make as possible. Users already familiar with Make will find Snakemake quite easy to use.
316+
* Snakemake was written to be as similar to GNU Make as possible. Users already familiar
317+
with Make will find Snakemake quite easy to use.
288318

289319
* It’s easy. You can (hopefully!) learn Snakemake in an afternoon!
290320

0 commit comments

Comments
 (0)