Getting started with raw Illumina data

In this workflow, we will use fastqc to check read quality and fastx-toolkit to trim reads. Our data come from a project looking at ddRAD and 2bRAD data in two frog clades. Here we have sequences from 12 individuals that span three genera (and 20MY) of poison frogs.

Let’s start by downloading the raw Illumina data that will be used in the next lessons.

Open the program Terminal and type the following commands.

cd # go to root of file system
cd ~/Desktop/Genomics_USFQ_2019/
mkdir workshop 
cd workshop
curl -L -O https://www.dropbox.com/s/nuxs8c3po3mqksa/epiddrad_5M_R1_.fastq.gz?dl=1
curl -L -O https://www.dropbox.com/s/szf7lud0yhf7rqo/epiddrad_5M_R2_.fastq.gz?dl=1
# curl is a way to transfer files using a url from the command line
# -O option makes curl write to a file rather than stdout
# -L option forces curl to follow a redirect (determined by the web address)

Before we get started, a few words of wisdom:

Good judgment comes from experience
experience comes from bad judgment
go make mistakes!!

Your sequence files (like these) will likely be gzipped and with the file extension .fq.gz or fastq.gz.

ls # list all files in current directory
ls .. # list all files in one directory above
# rename files
mv epiddrad_5M_R1_.fastq.gz?dl=1 epiddrad_5M_R1_.fastq.gz
mv epiddrad_5M_R2_.fastq.gz?dl=1 epiddrad_5M_R2_.fastq.gz

Let’s move these files into a new directory

mkdir epi
mv *.gz epi
ls
cd epi
ls -lht # -l = long format, -h = human readable, -t = order by time

Great! Now let’s look at the files.

head epiddrad_5M_R1_.fastq.gz

Oops, what happened?! This gibberish is because of the format of the file, which is “gz” or “g-zipped”. Let’s unzip

gunzip *.gz 
ls -lht

You can see that the size of the files expanded. Now try to peek:

head epiddrad_5M_R1_.fastq

This is the fastq format, which has four lines.

@K00179:78:HJ2KFBBXX:5:1121:25844:7099 1:N:0:TGACCA+AGATCT
GCATGCATGCAGCTTCTTCCTTCACATGATCTTCCACACGGTGCTTAGCTTTTTATAAAGGTGACTTCACGCACCGCGTTCATTGCAGTCAGAAGTGGCTTCAAAGGGAAATGGAAATGAATGACTTTTACTTCTCCTTTTCTCTGGATC
+
AAFFFF7FF7AAJFFJJJJJJJJJJJJJFJJFFJJJF<AFF-FFJJ<JJFJFF-F--7AJJ<J<JFJJ-FFFJF-AFJFJAFJJJJJJJJFA7AFFFFFJ7JAJAJJJ<J<7FF7A-AAJJJ<FJ-77A7FFAA-AAAAFF7FJJF<AFJ

# sequencer name:run ID:flowcell ID:flowcell lane:tile number in flow cell:x-coordinate of cluster :y-coordinate pair member:filtered?:control data:index sequence
# DNA sequence
# separator, can also sometimes (not often) hold extra information about the read
# quality score for each base

Step 0. The first step in any NGS project is to check read quality

# fastqc is already installed, but if you wanted to download it, this is how: 
# curl -L -O https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip
# unzip fastqc_v0.11.5.zip

# let's take a look at fastqc options
../../programs/FastQC/fastqc -h # this is the relative path to fastqc
../../programs/FastQC/fastqc *.fastq

When the program is finished, take a look at what files are in the directory using ls. fastqc produces a nice .html file that can be viewed in any browser.
Since we are trying to get comfortable with the command line, let’s open the file directly.

open epiddrad_5M_R1__fastqc.html

Sequencing quality scores, “Q”, run from 20 to 40. In the fastq file, these are seen as ASCII characters. The values are log-scaled: 20 = 1/100 errors; 30 = 1/1000 errors. Anything below 20 is garbage and anything between 20 and 30 should be reviewed. There appear to be errors in the kmer content, but really these are just showing where the barcodes and restriction enzyme sites are.

Now, let’s take a short pause to look into how ddRAD libraries are constructed in this mini-lecture to understand why our Illumina reads look like this:

Ok, now back to our quality checks, we can see that overall our data are of high quality (high Q scores, no weird tile patterns, no adaptor contamination). But let’s also check out R2

open epiddrad_5M_R2__fastqc.html

Whoops what’s going on here? When Illumina first came out with their HiSeq 4000 machines, the R2 had a lot of error. Let’s try installing a new program to clean up these reads.

curl -LO http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit_0.0.13_binaries_MacOSX.10.5.8_i386.tar.bz2
tar -xjf fastx_toolkit_0.0.13_binaries_MacOSX.10.5.8_i386.tar.bz2
ls # the folder bin appeared!
rm fastx_toolkit_0.0.13_binaries_MacOSX.10.5.8_i386.tar.bz2 # remove the zipped file as we no longer need it 
./bin/fastx_trimmer -h
./bin/fastx_trimmer -f 11 -l 100 -i epiddrad_5M_R2_.fastq -o epiddrad_5M_R2_.fastq.trim 

# -f first base to keep (we are trimming the degenerate barcode)
# -l last base to keep (to remove the low quality bases)
# -i = infile
# -o = outfile

Oops! What happened?

EXERCISE: Let’s search the error on google.

How can we fix the error?

We just need to tell the program to use a different quality score with -Q33:
./bin/fastx_trimmer -f 11 -l 100 -i epiddrad_5M_R2_.fastq -o epiddrad_5M_R2_.fastq.trim -Q33 # takes ~3min

While we are waiting let’s do a quick exercise!

Open up a new Terminal window and navigate to your epi folder. Try this command

touch newfile

What did that do? Now try opening the file with the program vi

vi newfile

# this is a text editor within the bash shell
# in order to add characters you must first type i for insert
# try it!
# in order to save the file you first type Esc then :wq which means write, then quit
# if you make a mistake, type :q! which means quit without saving changes

cat newfile

There is a lot to explore with vi, but let’s leave that for another time.
Is our fastx trimming done? Let’s take a look.

head epiddrad_5M_R2_.fastq.trim
head epiddrad_5M_R2_.fastq
# see the difference in length?

OK, let’s clean up our files and then move on to looking at the raw data!!

rm newfile
mkdir fastqc
mv *__* fastqc 
mkdir old
mv epiddrad_5M_R2_.fastq old/epiddrad_5M_R2_.fastq.old
mv epiddrad_5M_R2_.fastq.trim epiddrad_5M_R2_.fastq

Looking at your raw data in the linux environment

Let’s practice some shell commands, starting with the cat program:

cat epiddrad_5M_R1_.fastq

Uh oh… let’s quit before the computer crashes… it’s too much to look at! Ctrl+C

This is the essential computational problem with NGS data that is so hard to get over. You can’t open a file in its entirety! Here are some alternative ways to view parts of a file at a time.

# print the first 10 lines of a file
head epiddrad_5M_R1_.fastq

# print the first 20 lines of a file
head epiddrad_5M_R1_.fastq -20 # '-20' is an argument for the 'head' program

# print lines 190-200 of a file
head -200 epiddrad_5M_R1_.fastq | tail # 'pipes' the first 200 lines to a program called tail, which prints the last 10 lines

# view the file interactively
less epiddrad_5M_R1_.fastq # can scroll through file with cursor, page with spacebar; quit with 'q'
# NOTE: here we can use less because the file is not in gzipped (remember that required the 'zless' command)

# open up manual for less program
man less # press q to quit

# print only the first 10 lines and only the first 30 characters of each line
head -200 epiddrad_5M_R1_.fastq | cut -c -30 

# count the number of lines in the file
wc -l epiddrad_5M_R1_.fastq # (this takes a moment because the file is large)

# print lines with AAAAA in them
grep 'AAAAA' epiddrad_5M_R1_.fastq # ctrl+c to exit

# count lines with AAAAA in them
grep -c 'AAAAA' epiddrad_5M_R1_.fastq

# save lines with AAAAA in them as a separate file
grep 'AAAAA' epiddrad_5M_R1_.fastq > AAAAA # no file extensions necessary!!

# add lines with TTTTT to the AAAAA file: '>' writes or overwrites file; '>>' writes or appends to file
grep 'TTTTT' epiddrad_5M_R1_.fastq >> AAAAA 

# print lines with aaaaa in them
grep 'aaaaa' epiddrad_5M_R1_.fastq
# why doesn't this produce any output?

# count number of uniq sequences in file with pattern 'AGAT'
grep 'AGAT' epiddrad_5M_R1_.fastq | sort | uniq | wc -l

# print only the second field of the sequence information line
head epiddrad_5M_R1_.fastq | grep '@' | awk -F':' '{ print $2 }' 
# awk is a very useful program for parsing files; here ':' is the delimiter, $2 is the column location

EXERCISE

How would you count the number of reads in your file?

There are lots of answers for this one. One example: in the fastq format, the character '@' occurs once per sequence, so we can just count how many lines contain '@'.
grep -c '@' epiddrad_5M_R1_.fastq.gz

Appendix

If you’re unable to perform the fastqc analyses, you can download them here
If you’re unable to perform the fastx_trimmer step, don’t worry, just proceed with the rest of the steps as normal.
A full run of stacks with these data (approx 1.5GB) can be downloaded here.

To learn more about awk basics, a very powerful tool for editing/rewriting text files, you can start here.

There are actually several ways to look at .gz files, such as:

zless epiddrad_t200_R1_.fastq.gz # press 'q' to exit
gzcat epiddrad_t200_R1_.fastq.gz | head # the "|" pipes stdout to the program "head"
gzcat epiddrad_t200_R1_.fastq.gz | head -100 # shows the first 100 lines

The linux environment

File systems

your computer contains a nested hierarchy of directories
directory is a folder on your computer which contains files
keeping track of where you are in the file structure of your computer is an important component of computing
highest level is the root (denoted: /)
forward slashes divide levels in the nested hierarchy of directories, e.g. /top_level_directory/second_level_directory
there are several high-level directories that users don’t usually go into where program files are stored
- /usr/bin
- /usr/lib
every file on your computer has an address; if you are going to do an operation on a file, you need its address
path: the address to a directory or file on your computer. There are, generally, two types of paths:
- absolute/full path represents the path of a given directory or file beginning at the root directory
- relative path represents the path of a given directory/file relative to the working/current directory
for example, say you have a file “my_favorite_file.txt” located in the directory /Users/myname/Desktop/my_directory.
- the full path to this file is /Users/myname/Desktop/my_directory/my_favorite_file.txt
- the relative path to this file depends on where you are on the computer
- if you are calling this file from Desktop, the relative path would be my_directory/my_favorite_file.txt
- if you are in /Users/myname/, the relative path becomes Desktop/my_directory/my_favorite_file.txt

Remember - Whenever you call the full path, you can reach the file from anywhere on your computer. Relative paths will change based on your current location.

Unix

commands are small programs
- type the name of a command and hit enter
- Unix searches for the program’s text file and executes it
programs have preset arguments which change their behavior
- these can be found in the program’s manual
programs interact with files that are in the directory that you are in
we use a “shell” to interact with Unix: it exchanges information between user and program through standard streams
standard input: input to programs
standard output: information on screen, i.e. what the program outputs
standard text editor for Unix is nano
- type nano to access it
- opens a text editor within the shell
- saving, exiting, and other functions are controlled with ctrl + letter keys
- we will be using primarily the Atom text editor instead of nano

Here are some commands you will use all the time, we will review them as we proceed with the lesson

Command	Translation	Examples
`cd`	change directory	`cd /absolute/path/of/the/directory/` Go to the home directory by typing simply `cd` or `cd ~` Go up (back) a directory by typing `cd ..`
`pwd`	print working directory	`pwd`
`mkdir`	make directory	`mkdir newDirectory` creates newDirectory in your current directory Make a directory one level up with `mkdir ../newDirectory`
`cp`	copy	`cp file.txt newfile.txt` (and file.txt will still exist!)
`mv`	move	`mv file.txt newfile.txt` (but file.txt will no longer exist!)
`rm`	remove	`rm file.txt` removes file.txt `rm -r directoryname/` removes the directory and all files within
`ls`	list	`ls *.txt` lists all .txt files in current directory `ls -a` lists all files including hidden ones in the current directory `ls -l` lists all files in current directory including file sizes and timestamps `ls -lh` does the same but changes file size format to be human-readable `ls ../` lists files in the directory above the current one
`man`	manual	`man ls` opens the manual for command `ls` (use `q` to escape page)
`grep`	global regular expression parser	`grep ">" seqs.fasta` pulls out all sequence names in a fasta file `grep -c ">" seqs.fasta` counts the number of those sequences
`cat`	concatenate	`cat seqs.fasta` prints the contents of seqs.fasta to the screen (ie stdout)
`head`	head	`head seqs.fasta` prints the first 10 lines of the file `head -n 3 seqs.fasta` prints first 3 lines
`tail`	tail	`tail seqs.fasta` prints the last 10 lines of the file `tail -n 3 seqs.fasta` prints last 3 lines
`wc`	word count	`wc filename.txt` shows the number of new lines, number of words, and number of characters `wc -l filename.txt` shows only the number of new lines `wc -c filename.txt` shows only the number of characters
`sort`	sort	`sort filename.txt` sorts file and prints output
`uniq`	unique	`uniq -u filename.txt` shows only unique elements of a list (must use sort command first to cluster repeats)

Handy dandy shortcuts

Shortcut	Use
Ctrl + C	kills current process
Ctrl + L (or `clear`)	clears screen
Ctrl + A	Go to the beginning of the line
Ctrl + E	Go to the end of the line
Ctrl + U	Clears the line before the cursor position
Ctrl + K	Clear the line after the cursor
`*`	wildcard character
tab	completes word
Up Arrow	call last command
`.`	current directory
`..`	one level up
`~`	home
`>`	redirects stdout to a file, overwriting file if it already exists
`>>`	redirects stdout to a file, appending to the end of file if it already exists
pipe (`\|`)	redirects stdout to become stdin for next command

Genomics of Biodiversity @ IBS 2019

This is the course website for the Genomics of Biodiversity workshop at the International Biogeography Society meeting in Quito, Ecuador.