Step 1. Demultiplex by barcode in STACKS
In this workflow, we will use STACKS for the assembly.
Demultiplexing your sequencing pools is always the first step in any pipeline, before you even begin the genotyping or assembly process. In stacks, the files you need for demultiplexing are:
-
barcodes+sample names (tab-delimited .txt file)
-
process_radtags code (shell program from stacks)
EXERCISE: Let’s build the barcodes file for demultiplexing. The first column will be the unique adapter sequence (column two in
this text file). The second column
is the individual sample names (first column in the referenced file). Save the file in the epi
folder with the name epi_barcodes_final.txt
Again, that was:
unique adaptor (in R1) | sample name
There are MANY ways to build this file…. how do you want to do it?
Side-note 1
Whenever editing text files, first, NEVER use what you exported from excel or word directly… always check in a simple text editor (Text Wrangler, BBEdit, etc) and using “view invisible characters” to avoid unnecesary headaches of hidden characters or extra spaces/tabs, etc! Biggest waste of time in anything computing…Side-note 2
grep (Global regular expression print) is one of the most amazing things about text editors.Try this on the epi_barcodes.txt file once downloaded. Cmd+f, make sure 'grep' is checked, then search for
(.*)\t(.*)
and replace with \2\t\1
Ok! We have our barcode file, and we have our sequences. STACKS has already been installed on your machine,
so we just need to make a new folder called demultiplex
, and then run the process_radtags
program
to filter our reads for quality and then demultiplex them. Who remembers why we need to demultiplex?
mkdir demultiplex
The general code we will use for process_radtags, running it from within the epi folder, is the following:
process_radtags -P -p . -b ./epi_barcodes_final.txt -o ./demultiplex -c -q -r -D --inline_null --renz_1 sphI --renz_2 mluCI
This takes about 6 minutes to run, so let’s talk about the parameters used in process_radtags
# -P means reads are paired
# -p means path to directory with our sequence files ('.' means 'here!')
# -b indicates path to barcodes file
# -o is for output directory
# -c tells the program to remove reads with Ns
# -q tells the program to remove low-quality reads
# -r tells the program to "rescue" barcodes with RADtags
# -D tells the program to save the discarded reads
# you would alter these parameters depending how you make your libraries
# --inline_null tells the program that the first barcode is inline and the second is absent (recall we removed the second inline barcode with fastx-toolkit)
# --renz_1 is the first enzyme (cut site present in R1)
# --renz_2 is the second enzyme (cut site present in R2)
What do our demultiplexed files look like…?
Step 2. Genotyping in STACKS
Before running the next step, we need to generate a population map file.
EXERCISE: Open a new text file and populate it with the following data:
sample name | sample name
Save this file as epi_popmap.txt
Side-note
Again! grep (Global regular expression print) is one of the most amazing things about text editors.Copy the sample names from the epi_barcodes.txt file into a new file. Cmd+f, make sure 'grep' is checked, then search for
(\w*)\n
and replace with \1\t\1\n
Let’s start setting up denovo_map runs. First, move the files with good sequences into a new good
folder using the following commands. We’ll
also move the unwanted sequences into their own folder bad
. The good
directory will be a part of the input for running denovo_map.pl.
cd demultiplex
mkdir bad
mkdir good
mv *[a-b0-9]\.[1-2].fq.gz good # [0-9] is wildcard code for any one number between 0 and 9; [a-b] means any one letter between a and b; [1-2] means any one number between 1 and 2
mv *q.gz bad
mv *discards bad
cd .. # move up one level
Here is the general code we will use to run denovo_map:
mkdir denovo
denovo_map.pl -T 2 -m 3 -M 2 -n 3 -o ./denovo --samples ./demultiplex/good --popmap epi_popmap.txt --paired # takes ~50min to run
OK, let’s start denovo_map!!! Look at the terminal window as it runs…. what’s happening currently…? While we wait, let’s look more into the STACKS manual.
While we wait, let’s talk about the components of the denovo_map.pl pipeline in this mini-lecture, the challenges inherent to it (both computationally and biologically), and then move on to our denovo_map.pl code!
Troubleshooting parameters in stacks (and in any other pipeline!)
One thing that is very important in stacks is troubleshooting parameter settings. This has become such an important issue that many papers have come out trying to establish guidelines for how to determine ideal parameter settings for our datasets, including the authors of stacks
that came up with these useful guidelines for parameter tests. Why is this? What about Orthology and Paralogy, gene duplication, could affect our pipeline analysis if we have no reference genome to map our reads to?
Let’s take a look at some parameter tests that we recently did with our semester-long workshop, MingaGenomica2019, where we did troubleshooting of parameters in stacks for one project and in ipyrad for another project.
In general, the parameters tha tone is most interested in troubleshooting with stacks are:
m — specify a minimum number of identical, raw reads required to create a stack.
M — specify the number of mismatches allowed between reads when processing a single individual (default 2).
n — specify the number of mismatches allowed between reads (among inds.) when building the catalog (default 1).
Note 1: The higher the coverage, the higher the m parameter can be.
Note 2: M should not be 1 (diploid data) but also should not be very high since it will begin to stack paralogs.
Note 3: n will depend on how divergent our individuals/populations are. It should not be zero, since that would essentially allow zero SNPs, but 1 also seems unrealistically low (only a single difference between individuals in any given locus), so in these kinds of datasets we should have permutations that start from 2. If you use n 1 it is likely to oversplit loci among populations that are more divergent.
The same paper that discussed the issues with ref_map.pl that I mentioned previously, also mentions some tips for picking the ideal parameter settings for stacks… but, in general my recommendation would be: explore your dataset!!! Some general suggestions from it:
- Setting the value of m in essence is choosing how much “error” you will include/exclude from your dataset. This parameter creates a trade-off between including error and excluding actual alleles/polymorphism. Higher values of m increase the average sample coverage, but decreases the number of assembled loci. After m=3 loci number is more stable.
- Setting the value M is a trade-off between overmerging (paralogs) and undermerging (splitting) loci within samples. It is VERY dataset-specific since it depends on polymorphism in the species and in the amount of error (both in library prep and sequencing).
- Setting the value n is also critical when it comes to overmerging and undermerging loci. There seems to be an unlimited number of loci that can be merged with the catalog with increasing n!!
- Finally, authors suggest that a general rule for setting parameters is n=M, n=M-1, or n=M+1, and that M is the main parameter that needs to be explored for each dataset.
However, given that these methods are still very new and that we still don’t know how to “Easily” and properly assess error, the more permutations you do with the parameter settings, the more you will understand what your dataset is like, and the better/more “real” your loci/alleles will be. Here are some recommended permutations to run wiht your dataset:
Permutations | -m | -M | -n | –max_locus_stacks |
---|---|---|---|---|
a | 3 | 2 | 2 | 3 |
b | 5 | 2 | 2 | 3 |
c | 7 | 2 | 2 | 3 |
d | 3 | 3 | 2 | 3 |
e | 3 | 4 | 2 | 3 |
f | 3 | 5 | 2 | 3 |
g | 3 | 2 | 3 | 3 |
h | 3 | 2 | 4 | 3 |
i | 3 | 2 | 5 | 3 |
j | 3 | 2 | 2 | 4 |
k | 3 | 2 | 2 | 5 |
You can evaluate the number of loci, SNPs, and Fsts that you get from these parameters to assess “stability” of genotyping and pick optimal parameter settings.
Appendix
If you’re unable to perform the process_radtags step, download the complete demultiplex folder here.
If you’re unable to perform the denovo_map step, download the complete denovo folder (with populations output) here.
A full run of stacks with these data (approx 1.5GB) can be downloaded here.
Using reference genomes for genotyping with stacks and GBS data
In most cases, having a reference genome is a good thing. However, STACKS is designed for non-model organisms, so in fact their denovo_map algorithms are superior and more self-contained than their ref_map algorithms.
In ref_map.pl you need to use another alignment tool prior to running stacks. So then, the pipeline workflow would have one extra step:
process_radtags
GSNAP or bwa
ref_map.pl <br>
“The ref_map.pl program takes as input aligned reads. It does not provide the assembly parameters that denovo_map.pl does and this is because the job of assembling the loci is being taken over by your aligner program (e.g. BWA or GSnap). You must take care that you have good alignmnets – discarding reads with multiple alignments, making sure that you do not allow too many gaps in your sequences (otherwise loci with repeat elements can easily be collapsed during alignments), and take care not to allow soft-masking in the alignments. This occurs when an aligner can not make a full alignment and instead soft-masks the portion of the read that could not be aligned (pretending that this part of the read does not exist). These factors, if not cared for, can cause spurious SNP calls and problems in the downstream analysis.”
A recent paper that came out, Lost in Parameter Space: a roadmap for STACKS, shows how building ref_map loci in STACKS is not very efficient, and loses too much data, which complicates the pipeline even more! Thus, the pipeline for ref_map.pl, using their so-called “integrated” method should be:
process_radtags #demultiplex
denovo_map.pl #initial genotyping
GSNAP or bwa #align raw reads to catalog loci from denovo
integrate_alignments.py #integrate alignment information back into denovo ouput
populations
So many steps!!