PCAdapt¶

PCAdapt uses an ordination approach to find sites in a data set that are outliers with respect to background population structure. The PCAdapt manual is available here.

Citation

Privé, F., Luu, K., Vilhjálmsson, B. J., & Blum, M. G. B. (2020). Performing Highly Efficient Genome Scans for Local Adaptation with R Package pcadapt Version 4. Mol Biol Evol, 37(7), 2153-2154. https://doi.org/10.1093/molbev/msaa053

First, let's install PCAdapt and set your working directory.

code

module load R-bundle-Bioconductor/3.15-gimkl-2022a-R-4.2.1
R

code

library("pcadapt")

setwd("~/outlier_analysis/analysis/pcadapt/")

Now let's load in the data - PCAdapt uses bed file types.

code

starlings_bed <- "~/outlier_analysis/data/starling_3populations.bed"
starlings_pcadapt <- read.pcadapt(starlings_bed, type = "bed")

Produce K plot.

code

starlings_pcadapt_kplot <- pcadapt(input = starlings_pcadapt, K = 20)
pdf("pcadapt_starlings_kplot.pdf")
plot(starlings_pcadapt_kplot, option = "screeplot")
dev.off()

A K value of 3 is most appropriate, as this is the value of K after which the curve starts to flatten out more.

code

starlings_pcadapt_pca <- pcadapt(starlings_pcadapt, K = 3)
summary(starlings_pcadapt_pca)

Output

                Length  Class  Mode
scores            117  -none-  numeric
singular.values     3  -none-  numeric
loadings        15021  -none-  numeric
zscores         15021  -none-  numeric
af               5007  -none-  numeric
maf              5007  -none-  numeric
chi2.stat        5007  -none-  numeric
stat             5007  -none-  numeric
gif                 1  -none-  numeric
pvalues          5007  -none-  numeric
pass             4610  -none-  numeric

Investigate axis projections.

code

poplist.names <- c(rep("Lemon", 13),rep("Warrnambool", 13),rep("Nowra", 13))
print(poplist.names)

pdf("pcadapt_starlings_projection1v2.pdf")
plot(starlings_pcadapt_kplot, option = "scores", i = 1, j = 2, pop = poplist.names)
dev.off()

pdf("pcadapt_starlings_projection5v7.pdf")
plot(starlings_pcadapt_kplot, option = "scores", i = 5, j = 7, pop = poplist.names)
dev.off()

Ignore the warning

Warning message:
Use of `df$Pop` is discouraged. Use `Pop` instead.

Investigate Manhattan and Q-Qplot.

Manhattan plots

This is a way to visualize the GWAS (genome-wide association study) p-values (or other statistical values) at each SNP locus along the genome.

Q-Qplots

This is a quick way to check if your residuals are normally distributed. Check out more information here.

code

pdf("pcadapt_starlings_manhattan.pdf")
plot(starlings_pcadapt_pca, option = "manhattan")
dev.off()

pdf("pcadapt_starlings_qqplot.pdf")
plot(starlings_pcadapt_pca, option = "qqplot")
dev.off()

Plot and adjust the p-values.

code

pdf("pcadapt_starlings_pvalues.pdf")
hist(starlings_pcadapt_pca$pvalues, xlab = "p-values", main = NULL, breaks = 50, col = "orange")
dev.off()

code

starlings_pcadapt_padj <- p.adjust(starlings_pcadapt_pca$pvalues,method="bonferroni")
alpha <- 0.1
outliers <- which(starlings_pcadapt_padj < alpha)
length(outliers)

write.table(outliers, file="starlings_pcadapt_outliers.txt")

Output

[1] 3

After this, we will jump out of R and back into the command line by using the following command:

code

q()

Mapping Outliers: PCAdapt¶

Find the SNP ID of the outlier variants.

code

cd $DIR/analysis

The first thing we will do is create a list of SNPs in VCF and then assign line numbers that can be used to find matching line numbers in outliers (SNP IDs are lost in PCadapt & Bayescan, line numbers are used as signifiers).

We create this in the analysis/ directory because we will use it for more than just mapping the outlier SNPs for PCAdapt, we will also need it on day 2 for BayeScan and BayPass.

code

grep -v "^#" $DIR/data/starling_3populations.recode.vcf | cut -f1-3 | awk '{print $0"\t"NR}' > starling_3populations_SNPs.txt

Now let us jump back into the pcadapt/ directory to continue working with our outliers. We select column 2 of the outlier file using the AWK command, which contains the number of outliers.

code

cd $DIR/analysis/pcadapt
awk '{print $2}' starlings_pcadapt_outliers.txt > starlings_pcadapt_outliers_numbers.txt

Make a list of outlier SNPS IDs.

code

awk 'FNR==NR{a[$1];next} (($4) in a)' starlings_pcadapt_outliers_numbers.txt ../starling_3populations_SNPs.txt | cut -f3 > pcadapt_outlierSNPIDs.txt
head pcadapt_outlierSNPIDs.txt

Output

230955:72:-
238881:46:+
286527:46:-