quick popgen analyses using adegenet in R
For this lesson, we will be using a couple of different (and pre-filtered) datasets!! The puma dataset is based on a couple of papers that were recently published in biorxiv, and the stream insect dataset is based on a recent paper that came out in Ecology Letters.
First, let’s take a look at our structure file format (and download the .stru file), as well as our .R
file found here.
Now, let’s get started with our R
code by opening R Studio and then starting a new R script (File>New File>R Script).
Before you get started, make sure you are in the workshop directory where you saved the .stru
and the .R
files:
setwd("~/Desktop/Genomics_USFQ_2019/workshop/epi")
Then, we need to load the libraries that will be used:
library("adegenet")
library("ggplot2")
Now, we import our first SNP dataset, Colorado pumas, and name it myFile by writing:
myFile <- import2genind("pumas-CO.stru") # must be saved as .stru - if you have an error, check its extension
Then R
will ask you to type the answers to these questions directly in the R Console
:
- Question: How many genotypes are there? (inds) Answer:134
- Question: How many markers are there? (SNPs) Answer: 12456
- Question: Which column contains the genotype label (‘1’ if absent)? Answer:1
- Question: Which column contains the population factor (‘0’ if absent)? Answer:2
- Question: Which other optional columns should be read (press ‘return’ when done)? Press return
- Question: Which row contains the marker names (‘0’ if absent)? Answer:1
- Question: Are genotypes coded by a single row (y/n)? Answer: n
If all questions were answered correctly, then it should give you the following message:
Converting data from a STRUCTURE .stru file to a genind object...
Once imported, you can look at your transformed (summarized) genind file that the package adegenet
uses, by typing:
myFile
Now, we can run the code necessary for running a Principal Components Analysis (PCA), but first we need to scale our missing data:
X <- scaleGen(myFile, NA.method="zero")
pca1<-dudi.pca(X,cent=FALSE,scale=FALSE,scannf=FALSE,nf=3)
We define our plotting colors (so that pop1=darkgreen and pop2=darkblue), and then we generate our PCA scatterplot and our PC eigenvalues:
myCol <-c("darkgreen","darkblue")
s.class(pca1$li,pop(myFile), col=myCol)
add.scatter.eig(pca1$eig[1:20], 3,1,2)
Now we are interested in finding potential migrant pumas between our populations, so we want to see where individual labels fall within the PCA plot:
s.label(pca1$li)
Hm, that’s not very informative. Let’s try running a Discriminant Analysis of Principal Components (DAPC), which tries to maximize the difference among pre-defined groups based on the Principal Components.
dapc1<-dapc(X,pop(myFile)) The `R Console` will now ask you a couple of questions based on the graphs. First, it will ask how many principal components you want to keep, we will keep ten for our analysis (so, type `10` into the `R Console`). then, it will ask you how many discriminant functions you want to keep, in this case we only hae two groups thus only one DF, so type `1` into the `R Console`, and then we can see the summary and plot of our DAPC:
summary(dapc1)
scatter(dapc1)
Finally, we can visualize the DAPC as a compoplot, which is analogous to STRUCTURE analyses, where each bar is an individual an the colors are its probability of belonging to a specific group, or similarly, probability of admixture.
compoplot(dapc1,posi="bottomright",lab="", ncol=1,xlab="individuals")
Now we can clearly identify and visualize both migrant individuals and admixed individuals as well. But, it’s still impossible to know which individual ID belongs to migrants and to admixed, so we now type:
assignplot(dapc1, subset=1:40)
assignplot(dapc1, subset=41:80)
assignplot(dapc1, subset=81:100)
assignplot(dapc1, subset=101:134)
Now we can clearly identify each migrant individual by name!
Finally, we can run all the same analyses by tweaking the code just a little bit with this other dataset… can you figure out the PCA colors and how many Principal Components and Discriminant Functions to keep?