Getting started with raw Illumina data

In this workflow, we will use fastqc to check read quality and fastx-toolkit to trim reads. Our data come from a project looking at ddRAD and 2bRAD data in two frog clades. Here we have sequences from 12 individuals that span three genera (and 20MY) of poison frogs.

Let’s start by downloading the raw Illumina data that will be used in the next lessons.

Open the program Terminal and type the following commands.

cd # go to root of file system
cd ~/Desktop/Genomics_USFQ_2019/
mkdir workshop 
cd workshop
curl -L -O https://www.dropbox.com/s/nuxs8c3po3mqksa/epiddrad_5M_R1_.fastq.gz?dl=1
curl -L -O https://www.dropbox.com/s/szf7lud0yhf7rqo/epiddrad_5M_R2_.fastq.gz?dl=1
# curl is a way to transfer files using a url from the command line
# -O option makes curl write to a file rather than stdout
# -L option forces curl to follow a redirect (determined by the web address)

Before we get started, a few words of wisdom:

Good judgment comes from experience
experience comes from bad judgment
go make mistakes!!


Your sequence files (like these) will likely be gzipped and with the file extension .fq.gz or fastq.gz.

ls # list all files in current directory
ls .. # list all files in one directory above
# rename files
mv epiddrad_5M_R1_.fastq.gz?dl=1 epiddrad_5M_R1_.fastq.gz
mv epiddrad_5M_R2_.fastq.gz?dl=1 epiddrad_5M_R2_.fastq.gz

Let’s move these files into a new directory

mkdir epi
mv *.gz epi
ls
cd epi
ls -lht # -l = long format, -h = human readable, -t = order by time

Great! Now let’s look at the files.

head epiddrad_5M_R1_.fastq.gz

Oops, what happened?! This gibberish is because of the format of the file, which is “gz” or “g-zipped”. Let’s unzip

gunzip *.gz 
ls -lht

You can see that the size of the files expanded. Now try to peek:

head epiddrad_5M_R1_.fastq

This is the fastq format, which has four lines.

@K00179:78:HJ2KFBBXX:5:1121:25844:7099 1:N:0:TGACCA+AGATCT
GCATGCATGCAGCTTCTTCCTTCACATGATCTTCCACACGGTGCTTAGCTTTTTATAAAGGTGACTTCACGCACCGCGTTCATTGCAGTCAGAAGTGGCTTCAAAGGGAAATGGAAATGAATGACTTTTACTTCTCCTTTTCTCTGGATC
+
AAFFFF7FF7AAJFFJJJJJJJJJJJJJFJJFFJJJF<AFF-FFJJ<JJFJFF-F--7AJJ<J<JFJJ-FFFJF-AFJFJAFJJJJJJJJFA7AFFFFFJ7JAJAJJJ<J<7FF7A-AAJJJ<FJ-77A7FFAA-AAAAFF7FJJF<AFJ

# sequencer name:run ID:flowcell ID:flowcell lane:tile number in flow cell:x-coordinate of cluster :y-coordinate pair member:filtered?:control data:index sequence
# DNA sequence
# separator, can also sometimes (not often) hold extra information about the read
# quality score for each base



Step 0. The first step in any NGS project is to check read quality

# fastqc is already installed, but if you wanted to download it, this is how: 
# curl -L -O https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip
# unzip fastqc_v0.11.5.zip

# let's take a look at fastqc options
../../programs/FastQC/fastqc -h # this is the relative path to fastqc
../../programs/FastQC/fastqc *.fastq

When the program is finished, take a look at what files are in the directory using ls. fastqc produces a nice .html file that can be viewed in any browser.
Since we are trying to get comfortable with the command line, let’s open the file directly.

open epiddrad_5M_R1__fastqc.html 

Sequencing quality scores, “Q”, run from 20 to 40. In the fastq file, these are seen as ASCII characters. The values are log-scaled: 20 = 1/100 errors; 30 = 1/1000 errors. Anything below 20 is garbage and anything between 20 and 30 should be reviewed. There appear to be errors in the kmer content, but really these are just showing where the barcodes and restriction enzyme sites are.

Now, let’s take a short pause to look into how ddRAD libraries are constructed in this mini-lecture to understand why our Illumina reads look like this:


Ok, now back to our quality checks, we can see that overall our data are of high quality (high Q scores, no weird tile patterns, no adaptor contamination). But let’s also check out R2

open epiddrad_5M_R2__fastqc.html 

Whoops what’s going on here? When Illumina first came out with their HiSeq 4000 machines, the R2 had a lot of error. Let’s try installing a new program to clean up these reads.

curl -LO http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit_0.0.13_binaries_MacOSX.10.5.8_i386.tar.bz2
tar -xjf fastx_toolkit_0.0.13_binaries_MacOSX.10.5.8_i386.tar.bz2
ls # the folder bin appeared!
rm fastx_toolkit_0.0.13_binaries_MacOSX.10.5.8_i386.tar.bz2 # remove the zipped file as we no longer need it 
./bin/fastx_trimmer -h
./bin/fastx_trimmer -f 11 -l 100 -i epiddrad_5M_R2_.fastq -o epiddrad_5M_R2_.fastq.trim 

# -f first base to keep (we are trimming the degenerate barcode)
# -l last base to keep (to remove the low quality bases)
# -i = infile
# -o = outfile

Oops! What happened?

EXERCISE: Let’s search the error on google.

How can we fix the error? We just need to tell the program to use a different quality score with -Q33:
./bin/fastx_trimmer -f 11 -l 100 -i epiddrad_5M_R2_.fastq -o epiddrad_5M_R2_.fastq.trim -Q33 # takes ~3min

While we are waiting let’s do a quick exercise!

Open up a new Terminal window and navigate to your epi folder. Try this command

touch newfile

What did that do? Now try opening the file with the program vi

vi newfile

# this is a text editor within the bash shell
# in order to add characters you must first type i for insert
# try it!
# in order to save the file you first type Esc then :wq which means write, then quit
# if you make a mistake, type :q! which means quit without saving changes

cat newfile

There is a lot to explore with vi, but let’s leave that for another time.
Is our fastx trimming done? Let’s take a look.

head epiddrad_5M_R2_.fastq.trim
head epiddrad_5M_R2_.fastq
# see the difference in length?

OK, let’s clean up our files and then move on to looking at the raw data!!

rm newfile
mkdir fastqc
mv *__* fastqc 
mkdir old
mv epiddrad_5M_R2_.fastq old/epiddrad_5M_R2_.fastq.old
mv epiddrad_5M_R2_.fastq.trim epiddrad_5M_R2_.fastq

Looking at your raw data in the linux environment

Let’s practice some shell commands, starting with the cat program:

cat epiddrad_5M_R1_.fastq

Uh oh… let’s quit before the computer crashes… it’s too much to look at! Ctrl+C

This is the essential computational problem with NGS data that is so hard to get over. You can’t open a file in its entirety! Here are some alternative ways to view parts of a file at a time.

# print the first 10 lines of a file
head epiddrad_5M_R1_.fastq

# print the first 20 lines of a file
head epiddrad_5M_R1_.fastq -20 # '-20' is an argument for the 'head' program

# print lines 190-200 of a file
head -200 epiddrad_5M_R1_.fastq | tail # 'pipes' the first 200 lines to a program called tail, which prints the last 10 lines

# view the file interactively
less epiddrad_5M_R1_.fastq # can scroll through file with cursor, page with spacebar; quit with 'q'
# NOTE: here we can use less because the file is not in gzipped (remember that required the 'zless' command)

# open up manual for less program
man less # press q to quit

# print only the first 10 lines and only the first 30 characters of each line
head -200 epiddrad_5M_R1_.fastq | cut -c -30 

# count the number of lines in the file
wc -l epiddrad_5M_R1_.fastq # (this takes a moment because the file is large)

# print lines with AAAAA in them
grep 'AAAAA' epiddrad_5M_R1_.fastq # ctrl+c to exit

# count lines with AAAAA in them
grep -c 'AAAAA' epiddrad_5M_R1_.fastq

# save lines with AAAAA in them as a separate file
grep 'AAAAA' epiddrad_5M_R1_.fastq > AAAAA # no file extensions necessary!!

# add lines with TTTTT to the AAAAA file: '>' writes or overwrites file; '>>' writes or appends to file
grep 'TTTTT' epiddrad_5M_R1_.fastq >> AAAAA 

# print lines with aaaaa in them
grep 'aaaaa' epiddrad_5M_R1_.fastq
# why doesn't this produce any output?

# count number of uniq sequences in file with pattern 'AGAT'
grep 'AGAT' epiddrad_5M_R1_.fastq | sort | uniq | wc -l

# print only the second field of the sequence information line
head epiddrad_5M_R1_.fastq | grep '@' | awk -F':' '{ print $2 }' 
# awk is a very useful program for parsing files; here ':' is the delimiter, $2 is the column location

EXERCISE

How would you count the number of reads in your file? There are lots of answers for this one. One example: in the fastq format, the character '@' occurs once per sequence, so we can just count how many lines contain '@'.
grep -c '@' epiddrad_5M_R1_.fastq.gz



Appendix

If you’re unable to perform the fastqc analyses, you can download them here
If you’re unable to perform the fastx_trimmer step, don’t worry, just proceed with the rest of the steps as normal.
A full run of stacks with these data (approx 1.5GB) can be downloaded here.

To learn more about awk basics, a very powerful tool for editing/rewriting text files, you can start here.


There are actually several ways to look at .gz files, such as:

zless epiddrad_t200_R1_.fastq.gz # press 'q' to exit
gzcat epiddrad_t200_R1_.fastq.gz | head # the "|" pipes stdout to the program "head"
gzcat epiddrad_t200_R1_.fastq.gz | head -100 # shows the first 100 lines


The linux environment

File systems

Remember - Whenever you call the full path, you can reach the file from anywhere on your computer. Relative paths will change based on your current location.

Unix

Here are some commands you will use all the time, we will review them as we proceed with the lesson

Command Translation Examples
cd change directory cd /absolute/path/of/the/directory/
Go to the home directory by typing simply cd or cd ~
Go up (back) a directory by typing cd ..
pwd print working directory pwd
mkdir make directory mkdir newDirectory creates newDirectory in your current directory
Make a directory one level up with mkdir ../newDirectory
cp copy cp file.txt newfile.txt (and file.txt will still exist!)
mv move mv file.txt newfile.txt (but file.txt will no longer exist!)
rm remove rm file.txt removes file.txt
rm -r directoryname/ removes the directory and all files within
ls list ls *.txt lists all .txt files in current directory
ls -a lists all files including hidden ones in the current directory
ls -l lists all files in current directory including file sizes and timestamps
ls -lh does the same but changes file size format to be human-readable
ls ../ lists files in the directory above the current one
man manual man ls opens the manual for command ls (use q to escape page)
grep global regular
expression parser
grep ">" seqs.fasta pulls out all sequence names in a fasta file
grep -c ">" seqs.fasta counts the number of those sequences
cat concatenate cat seqs.fasta prints the contents of seqs.fasta to the screen (ie stdout)
head head head seqs.fasta prints the first 10 lines of the file
head -n 3 seqs.fasta prints first 3 lines
tail tail tail seqs.fasta prints the last 10 lines of the file
tail -n 3 seqs.fasta prints last 3 lines
wc word count wc filename.txt shows the number of new lines, number of words, and number of characters
wc -l filename.txt shows only the number of new lines
wc -c filename.txt shows only the number of characters
sort sort sort filename.txt sorts file and prints output
uniq unique uniq -u filename.txt shows only unique elements of a list
(must use sort command first to cluster repeats)



Handy dandy shortcuts

Shortcut Use
Ctrl + C kills current process
Ctrl + L
(or clear)
clears screen
Ctrl + A Go to the beginning of the line
Ctrl + E Go to the end of the line
Ctrl + U Clears the line before the cursor position
Ctrl + K Clear the line after the cursor
* wildcard character
tab completes word
Up Arrow call last command
. current directory
.. one level up
~ home
> redirects stdout to a file, overwriting file if it already exists
>> redirects stdout to a file, appending to the end of file if it already exists
pipe (|) redirects stdout to become stdin for next command