Getting started with raw Illumina data
In this workflow, we will use fastqc to check read quality and fastx-toolkit to trim reads. Our data come from a project looking at ddRAD and 2bRAD data in two frog clades. Here we have sequences from 12 individuals that span three genera (and 20MY) of poison frogs.
Let’s start by downloading the raw Illumina data that will be used in the next lessons.
Open the program Terminal and type the following commands.
cd # go to root of file system
cd ~/Desktop/Genomics_USFQ_2019/
mkdir workshop
cd workshop
curl -L -O https://www.dropbox.com/s/nuxs8c3po3mqksa/epiddrad_5M_R1_.fastq.gz?dl=1
curl -L -O https://www.dropbox.com/s/szf7lud0yhf7rqo/epiddrad_5M_R2_.fastq.gz?dl=1
# curl is a way to transfer files using a url from the command line
# -O option makes curl write to a file rather than stdout
# -L option forces curl to follow a redirect (determined by the web address)
Before we get started, a few words of wisdom:
Good judgment comes from experience
experience comes from bad judgment
go make mistakes!!
Your sequence files (like these) will likely be gzipped and with the file extension .fq.gz or fastq.gz.
ls # list all files in current directory
ls .. # list all files in one directory above
# rename files
mv epiddrad_5M_R1_.fastq.gz?dl=1 epiddrad_5M_R1_.fastq.gz
mv epiddrad_5M_R2_.fastq.gz?dl=1 epiddrad_5M_R2_.fastq.gz
Let’s move these files into a new directory
mkdir epi
mv *.gz epi
ls
cd epi
ls -lht # -l = long format, -h = human readable, -t = order by time
Great! Now let’s look at the files.
head epiddrad_5M_R1_.fastq.gz
Oops, what happened?! This gibberish is because of the format of the file, which is “gz” or “g-zipped”. Let’s unzip
gunzip *.gz
ls -lht
You can see that the size of the files expanded. Now try to peek:
head epiddrad_5M_R1_.fastq
This is the fastq format, which has four lines.
@K00179:78:HJ2KFBBXX:5:1121:25844:7099 1:N:0:TGACCA+AGATCT
GCATGCATGCAGCTTCTTCCTTCACATGATCTTCCACACGGTGCTTAGCTTTTTATAAAGGTGACTTCACGCACCGCGTTCATTGCAGTCAGAAGTGGCTTCAAAGGGAAATGGAAATGAATGACTTTTACTTCTCCTTTTCTCTGGATC
+
AAFFFF7FF7AAJFFJJJJJJJJJJJJJFJJFFJJJF<AFF-FFJJ<JJFJFF-F--7AJJ<J<JFJJ-FFFJF-AFJFJAFJJJJJJJJFA7AFFFFFJ7JAJAJJJ<J<7FF7A-AAJJJ<FJ-77A7FFAA-AAAAFF7FJJF<AFJ
# sequencer name:run ID:flowcell ID:flowcell lane:tile number in flow cell:x-coordinate of cluster :y-coordinate pair member:filtered?:control data:index sequence
# DNA sequence
# separator, can also sometimes (not often) hold extra information about the read
# quality score for each base
Step 0. The first step in any NGS project is to check read quality
# fastqc is already installed, but if you wanted to download it, this is how:
# curl -L -O https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip
# unzip fastqc_v0.11.5.zip
# let's take a look at fastqc options
../../programs/FastQC/fastqc -h # this is the relative path to fastqc
../../programs/FastQC/fastqc *.fastq
When the program is finished, take a look at what files are in the directory using ls
.
fastqc produces a nice .html file that can be viewed in any browser.
Since we are trying to get comfortable with the command line, let’s open the file directly.
open epiddrad_5M_R1__fastqc.html
Sequencing quality scores, “Q”, run from 20 to 40. In the fastq file, these are seen as ASCII characters. The values are log-scaled: 20 = 1/100 errors; 30 = 1/1000 errors. Anything below 20 is garbage and anything between 20 and 30 should be reviewed. There appear to be errors in the kmer content, but really these are just showing where the barcodes and restriction enzyme sites are.
Now, let’s take a short pause to look into how ddRAD libraries are constructed in this mini-lecture to understand why our Illumina reads look like this:
Ok, now back to our quality checks, we can see that overall our data are of high quality (high Q scores, no weird tile patterns, no adaptor contamination). But let’s also check out R2
open epiddrad_5M_R2__fastqc.html
Whoops what’s going on here? When Illumina first came out with their HiSeq 4000 machines, the R2 had a lot of error. Let’s try installing a new program to clean up these reads.
curl -LO http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit_0.0.13_binaries_MacOSX.10.5.8_i386.tar.bz2
tar -xjf fastx_toolkit_0.0.13_binaries_MacOSX.10.5.8_i386.tar.bz2
ls # the folder bin appeared!
rm fastx_toolkit_0.0.13_binaries_MacOSX.10.5.8_i386.tar.bz2 # remove the zipped file as we no longer need it
./bin/fastx_trimmer -h
./bin/fastx_trimmer -f 11 -l 100 -i epiddrad_5M_R2_.fastq -o epiddrad_5M_R2_.fastq.trim
# -f first base to keep (we are trimming the degenerate barcode)
# -l last base to keep (to remove the low quality bases)
# -i = infile
# -o = outfile
Oops! What happened?
EXERCISE: Let’s search the error on google.
How can we fix the error?
We just need to tell the program to use a different quality score with-Q33
:./bin/fastx_trimmer -f 11 -l 100 -i epiddrad_5M_R2_.fastq -o epiddrad_5M_R2_.fastq.trim -Q33 # takes ~3min
While we are waiting let’s do a quick exercise!
Open up a new Terminal window and navigate to your epi
folder. Try this command
touch newfile
What did that do? Now try opening the file with the program vi
vi newfile
# this is a text editor within the bash shell
# in order to add characters you must first type i for insert
# try it!
# in order to save the file you first type Esc then :wq which means write, then quit
# if you make a mistake, type :q! which means quit without saving changes
cat newfile
There is a lot to explore with vi, but let’s leave that for another time.
Is our fastx trimming done? Let’s take a look.
head epiddrad_5M_R2_.fastq.trim
head epiddrad_5M_R2_.fastq
# see the difference in length?
OK, let’s clean up our files and then move on to looking at the raw data!!
rm newfile
mkdir fastqc
mv *__* fastqc
mkdir old
mv epiddrad_5M_R2_.fastq old/epiddrad_5M_R2_.fastq.old
mv epiddrad_5M_R2_.fastq.trim epiddrad_5M_R2_.fastq
Looking at your raw data in the linux environment
Let’s practice some shell commands, starting with the cat
program:
cat epiddrad_5M_R1_.fastq
Uh oh… let’s quit before the computer crashes… it’s too much to look at! Ctrl+C
This is the essential computational problem with NGS data that is so hard to get over. You can’t
open a file in its entirety! Here are some alternative ways to view parts of a file at a time.
# print the first 10 lines of a file
head epiddrad_5M_R1_.fastq
# print the first 20 lines of a file
head epiddrad_5M_R1_.fastq -20 # '-20' is an argument for the 'head' program
# print lines 190-200 of a file
head -200 epiddrad_5M_R1_.fastq | tail # 'pipes' the first 200 lines to a program called tail, which prints the last 10 lines
# view the file interactively
less epiddrad_5M_R1_.fastq # can scroll through file with cursor, page with spacebar; quit with 'q'
# NOTE: here we can use less because the file is not in gzipped (remember that required the 'zless' command)
# open up manual for less program
man less # press q to quit
# print only the first 10 lines and only the first 30 characters of each line
head -200 epiddrad_5M_R1_.fastq | cut -c -30
# count the number of lines in the file
wc -l epiddrad_5M_R1_.fastq # (this takes a moment because the file is large)
# print lines with AAAAA in them
grep 'AAAAA' epiddrad_5M_R1_.fastq # ctrl+c to exit
# count lines with AAAAA in them
grep -c 'AAAAA' epiddrad_5M_R1_.fastq
# save lines with AAAAA in them as a separate file
grep 'AAAAA' epiddrad_5M_R1_.fastq > AAAAA # no file extensions necessary!!
# add lines with TTTTT to the AAAAA file: '>' writes or overwrites file; '>>' writes or appends to file
grep 'TTTTT' epiddrad_5M_R1_.fastq >> AAAAA
# print lines with aaaaa in them
grep 'aaaaa' epiddrad_5M_R1_.fastq
# why doesn't this produce any output?
# count number of uniq sequences in file with pattern 'AGAT'
grep 'AGAT' epiddrad_5M_R1_.fastq | sort | uniq | wc -l
# print only the second field of the sequence information line
head epiddrad_5M_R1_.fastq | grep '@' | awk -F':' '{ print $2 }'
# awk is a very useful program for parsing files; here ':' is the delimiter, $2 is the column location
EXERCISE
How would you count the number of reads in your file?
There are lots of answers for this one. One example: in the fastq format, the character '@' occurs once per sequence, so we can just count how many lines contain '@'.grep -c '@' epiddrad_5M_R1_.fastq.gz
Appendix
If you’re unable to perform the fastqc analyses, you can download them here
If you’re unable to perform the fastx_trimmer step, don’t worry, just proceed with the rest of the steps as normal.
A full run of stacks with these data (approx 1.5GB) can be downloaded here.
To learn more about awk basics, a very powerful tool for editing/rewriting text files, you can start here.
There are actually several ways to look at .gz files, such as:
zless epiddrad_t200_R1_.fastq.gz # press 'q' to exit
gzcat epiddrad_t200_R1_.fastq.gz | head # the "|" pipes stdout to the program "head"
gzcat epiddrad_t200_R1_.fastq.gz | head -100 # shows the first 100 lines
The linux environment
File systems
- your computer contains a nested hierarchy of directories
- directory is a folder on your computer which contains files
- keeping track of where you are in the file structure of your computer is an important component of computing
- highest level is the root (denoted:
/
) - forward slashes divide levels in the nested hierarchy of directories, e.g.
/top_level_directory/second_level_directory
- there are several high-level directories that users don’t usually go into where program files are stored
- /usr/bin
- /usr/lib
- every file on your computer has an address; if you are going to do an operation on a file, you need its address
- path: the address to a directory or file on your computer. There are, generally, two types of paths:
- absolute/full path represents the path of a given directory or file beginning at the root directory
- relative path represents the path of a given directory/file relative to the working/current directory
- for example, say you have a file “my_favorite_file.txt” located in the directory
/Users/myname/Desktop/my_directory
.- the full path to this file is
/Users/myname/Desktop/my_directory/my_favorite_file.txt
- the relative path to this file depends on where you are on the computer
- if you are calling this file from Desktop, the relative path would be
my_directory/my_favorite_file.txt
- if you are in
/Users/myname/
, the relative path becomesDesktop/my_directory/my_favorite_file.txt
- the full path to this file is
Remember - Whenever you call the full path, you can reach the file from anywhere on your computer. Relative paths will change based on your current location.
Unix
- commands are small programs
- type the name of a command and hit enter
- Unix searches for the program’s text file and executes it
- programs have preset arguments which change their behavior
- these can be found in the program’s manual
- programs interact with files that are in the directory that you are in
- we use a “shell” to interact with Unix: it exchanges information between user and program through standard streams
- standard input: input to programs
- standard output: information on screen, i.e. what the program outputs
- standard text editor for Unix is nano
- type
nano
to access it - opens a text editor within the shell
- saving, exiting, and other functions are controlled with ctrl + letter keys
- we will be using primarily the Atom text editor instead of nano
- type
Here are some commands you will use all the time, we will review them as we proceed with the lesson
Command | Translation | Examples |
---|---|---|
cd |
change directory | cd /absolute/path/of/the/directory/ Go to the home directory by typing simply cd or cd ~ Go up (back) a directory by typing cd .. |
pwd |
print working directory | pwd |
mkdir |
make directory | mkdir newDirectory creates newDirectory in your current directory Make a directory one level up with mkdir ../newDirectory |
cp |
copy | cp file.txt newfile.txt (and file.txt will still exist!) |
mv |
move | mv file.txt newfile.txt (but file.txt will no longer exist!) |
rm |
remove | rm file.txt removes file.txt rm -r directoryname/ removes the directory and all files within |
ls |
list | ls *.txt lists all .txt files in current directory ls -a lists all files including hidden ones in the current directory ls -l lists all files in current directory including file sizes and timestamps ls -lh does the same but changes file size format to be human-readable ls ../ lists files in the directory above the current one |
man |
manual | man ls opens the manual for command ls (use q to escape page) |
grep |
global regular expression parser |
grep ">" seqs.fasta pulls out all sequence names in a fasta file grep -c ">" seqs.fasta counts the number of those sequences |
cat |
concatenate | cat seqs.fasta prints the contents of seqs.fasta to the screen (ie stdout) |
head |
head | head seqs.fasta prints the first 10 lines of the file head -n 3 seqs.fasta prints first 3 lines |
tail |
tail | tail seqs.fasta prints the last 10 lines of the file tail -n 3 seqs.fasta prints last 3 lines |
wc |
word count | wc filename.txt shows the number of new lines, number of words, and number of characters wc -l filename.txt shows only the number of new lines wc -c filename.txt shows only the number of characters |
sort |
sort | sort filename.txt sorts file and prints output |
uniq |
unique | uniq -u filename.txt shows only unique elements of a list (must use sort command first to cluster repeats) |
Handy dandy shortcuts
Shortcut | Use |
---|---|
Ctrl + C | kills current process |
Ctrl + L (or clear ) |
clears screen |
Ctrl + A | Go to the beginning of the line |
Ctrl + E | Go to the end of the line |
Ctrl + U | Clears the line before the cursor position |
Ctrl + K | Clear the line after the cursor |
* |
wildcard character |
tab | completes word |
Up Arrow | call last command |
. |
current directory |
.. |
one level up |
~ |
home |
> |
redirects stdout to a file, overwriting file if it already exists |
>> |
redirects stdout to a file, appending to the end of file if it already exists |
pipe (| ) |
redirects stdout to become stdin for next command |