-
Notifications
You must be signed in to change notification settings - Fork 77
Writing fasta files from tree sequences #350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Writing fasta files from tree sequences #350
Conversation
ebe8214
to
fd45c06
Compare
Great stuff, thanks @marianne-aspbury. How about we go through this in person tomorrow? |
Yep, sounds good, thanks |
Codecov Report
@@ Coverage Diff @@
## master #350 +/- ##
==========================================
+ Coverage 86.48% 87.59% +1.11%
==========================================
Files 20 19 -1
Lines 14024 10275 -3749
Branches 2740 1890 -850
==========================================
- Hits 12128 9000 -3128
+ Misses 977 756 -221
+ Partials 919 519 -400
Continue to review full report at Codecov.
|
102e896
to
e102e28
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks @marianne-aspbury. We're basically there I think.
It's probably as well to look at the CLI as well here We basically want to duplicate the VCF option in tskit/cli.py and tests/test_cli.py; give me a shout if you have questions.
Added functionality for printing fastas on command line, pretty much copied directly from the vcf CLI code and all its tests. |
8db52b8
to
d596ea1
Compare
Sorry - I've not been following this. Note my reservations in #326 (comment) about making sure that FASTA files (and haplotypes in general) output non-variable sites, at a minimum with a dot, but possibly filling in via a reference genome. |
I don't think we're ready to deal with invariant sites yet, @hyanwong - to do that we need to distinguish tree sequences with integer positions and a reference sequence from the current infinite-sites notions. I'm in favor of doing this, but it shouldn't get in the way of this effort. |
Agreed. Let's consider what we can do as a stopgap for finite sites as part of #326 (without changing the default behaviour; we don't break people's code). We can then decide what the final semantics of the fasta function should be in #353 (tagged as a 0.2.3 release issue so we don't forget about it). Let's get this PR finished up and merge first though. |
We've constructed tree sequences with a site at each integer position, but many sites without mutations. Isn't that the same as an invariant site? But I guess this might be though of as a bit of a hack. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks @marianne-aspbury! Some very minor comments above --- we're ready to merge after these are addressed and the branch is rebased to bring it up to date.
d596ea1
to
a8bddcc
Compare
Thanks @jeromekelleher, I've fixed these now |
I've written some code to cover integrating fasta output from tree sequences into tskit, following from issue #338 and have included various tests for it.
Happy to hear feedback on this (e.g. @jeromekelleher, @hyanwong).