Making a phylogeny with your new, beautiful SNP matrix

The way STACKS sets up the phylip files with populations actually makes it difficult to get a full SNP matrix. So we are going to use a custom script to convert vcf format to phylip for our raxml analyses.

curl -LO https://raw.githubusercontent.com/edgardomortiz/vcf2phylip/master/vcf2phylip.py
python vcf2phylip.py -i ./denovo/populations.snps.vcf

We will need another script to remove invariant sites from the phylip file, as RAxML cannot interpret them.

curl -LO https://raw.githubusercontent.com/btmartin721/raxml_ascbias/master/ascbias.py
python ascbias.py -p ./denovo/populations.snps.min4.phy

Great! Now we’re all set to estimate some phylogenies! We’re going to use raxml-ng

EXERCISE: What is the relative path to raxml-ng? You’ll need that in order to run the next command.

/path/to/raxml-ng --all --msa out.phy --model GTR+ASC_LEWIS --tree pars{10} --bs-trees 100 # takes about 3min
/path/to/raxml-ng --msa out.phy --model GTR+ASC_LEWIS # for just one ML tree
# might need to do chmod +x for raxml if a permission error appears
# might also need to provide the full path to raxml in order for it to run
# --all means do a full analysis: estimate best ML tree, then do bootstraps, then use bootstraps to plot support on best tree
# --msa means multiple-sequence alignment
# --model indicates evolutionary model, here we use General Time Reversible plus the Lewis ascertainment bias correction (takes into account the fact that we are only using variable sites)
# --bs-trees indicates the number of bootstraps you'd like to do

Now open out.phy.raxml.support in FigTree to see the bootstrapped tree! Root at Ahah (Ameerega hahneli)

Question: What do you think is behind the huge variation in branch lengths? How might you deal with this?

There are certainly other methods out there to compute trees using SNPs (and others within RAxML too!). One popular alternative is SVDquartets, which samples SNPs and infers branching patterns from subsets of four taxa at a time. This method is highly scalable but only infers branching patterns, not branch lengths. Here’s a recent SVDquartets tutorial.




Appendix

If you’re unable to run raxml, you can find the output files here.
A full run of stacks with these data (approx 1.5GB) can be downloaded here.