A brief introduction to Genetic Outlier and Association Analysis¶

When we look through a genome to try to find loci that are under divergent selection, two common approaches are outlier analysis and association analysis. Outlier analysis requires just knowledge of the genetics of your samples (plus sample metadata, for example, population groupings) and tries to find loci that behave very differently from the underlying patterns across the genome (with the assumption being that the rest of the genome represents patterns of neutral genetic diversity). Meanwhile association analysis requires some sort of covariate data and tests whether there are any genetic variants statistically associated with this new data (you may have heard the term GWAS). This covariate data can come in the form of phenotype data (e.g., morphology, disease status, physiology measures) or could be spatial (e.g., environmental, climate). Association tests look for sites in the genome where the presence or absence of a variant is highly correlated with the values in the co-variate data, usually through some regression-type analysis.

Throughout this vignette, I will collectively refer to outlier and association analysis as selection analysis. There are a lot of programs that exist currently. You can find a very long (though not exhaustive) list here.

This vignette will start by covering some very simple outlier analyses:

PCAdapt can detect genetic marker outliers without having population¹ designations using a Principle Component Analysis (PCA) approach.
F_ST outlier analysis is an approach that uses pairwise comparisons between two populations and the fixation index metric to assess each genetic marker.

Next, we will conduct some more advanced outlier analyses:

Bayescan looks for differences in allele frequencies between populations to search for outliers.
Baypass elaborates on the bayenv model (another popular association analysis program).

We will cover the pre-processing of program-specific input files, how to run the programs, how to visualise the output, and in some cases we'll need to take extra steps to map the genetic markers of interest back to the SNP data.

Reduced representation versus whole genome sequencing (WGS)

Outlier analysis is often done on reduced representation data. It is important to remember how your genome coverage (the number of genome variant sites / the genome length) will affect your results and interpretation. Often with WGS data, you will see well-resolved 'peaks' with a fairly smooth curve of points leading up to it on either side. From this, we often infer that the highest point is the genetic variant of interest and the sites on either side of the peak exhibit signals of selection because they reside close to, and thus are linked to, the true variant of interest. However, consider that even in WGS data, unless we have every single genetic variant represented (which may not be the case, depending on our variant calling and filtering parameters), it is possible that the genetic variant of interest that we have identified is not the main one, but is simply another neighboring linked SNP to one that is not represented in the data. This problem becomes even more relevant with reduced representation sequencing (RRS), for which the genome coverage may be extremely patchy². Thus with all outlier analysis, but especially so for those using RRS data, remember that your flagged outliers are not exhaustive and may themselves only be liked to the variant that is truly under selection.

Manhattan plot of the association between P-values and chromosomes.

Define your working directory for this project and the VCF file location:¶

code

mkdir ~/outlier_analysis
DIR=~/outlier_analysis
cd $DIR

Our data tree will look like this:

outlier_analysis/
├── analysis
│   ├── bayescan
│   ├── baypass
│   ├── pcadapt
│   ├── summary
│   └── vcftools_fst
├── data
├── programs
└── workshop_material

So let's set up our directories to match this

code

mkdir -p {analysis/{bayescan,baypass,pcadapt,summary,vcftools_fst},data,programs,workshop_material}

Project data¶

The data provided in this workshop contains 5007 SNPs loci for 39 individuals (13 individuals each from 3 different locations). This data has some missingness (i.e., missing SNP calls).

There is also a metadata file containing the individual's unique IDs, assigned populations, and a wingspan measurement for each individual.

Let's grab this data from the project's git repository, place the data files into our data directory, and define the environmental variables VCF and METADATA with the locations of the genetic variant and metadata files, respectively.

code

# Enter data directory
cd $DIR/data

# Download required files to data directory
wget https://raw.githubusercontent.com/GenomicsAotearoa/Outlier_Analysis_Workshop/main/workshop_files/starling_3populations.recode.vcf
wget https://raw.githubusercontent.com/GenomicsAotearoa/Outlier_Analysis_Workshop/main/workshop_files/starling_3populations_metadata.txt

# Set environment variables
VCF=$DIR/data/starling_3populations.recode.vcf
METADATA=$DIR/data/starling_3populations_metadata.txt

Working with your own data

Alternatively, you can also use your own data for this workshop. If so, it is a good idea to thin your SNP dataset down to roughly 5,000 SNPs to ensure compute times are not too long. If you have more than 50 individuals, you may also want to reduce this. If you would like to do this, place your genetic variant and metadata file in the data directory and define VCF and METADATA based on their names.

Across this workshop, we will need the genetic data to be in several different formats. Let's prepare that now. First, we convert the VCF to PLINK and then to BED.

code

# Enter data directory
cd $DIR/data

# Load modules
module purge
module load VCFtools/0.1.15-GCC-9.2.0-Perl-5.30.1
module load PLINK/1.09b6.16 

# Convert VCF to PLINK
vcftools --vcf $VCF --plink --out starling_3populations.plink

# Convert PLINK to BED
plink --file starling_3populations.plink --make-bed --noweb --out starling_3populations

Output

VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
        --vcf /nesi/nobackup/nesi02659/outlier_analysis_workshop/data/starling_3populations.recode.vcf
        --out starling_3populations.plink
        --plink

After filtering, kept 39 out of 39 Individuals
Writing PLINK PED and MAP files ... 

Unrecognized values used for CHROM: starling4 - Replacing with 0.
Done.
After filtering, kept 5007 out of a possible 5007 Sites
Run Time = 0.00 seconds

Output

PLINK v1.90b6.16 64-bit (19 Feb 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to starling_3populations.log.
Options in effect:
  --file starling_3populations.plink
  --make-bed
  --noweb
  --out starling_3populations

Note: --noweb has no effect since no web check is implemented yet.
128824 MB RAM detected; reserving 64412 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (5007 variants, 39 people).
--file: starling_3populations-temporary.bed +
starling_3populations-temporary.bim + starling_3populations-temporary.fam
written.
5007 variants loaded from .bim file.
39 people (0 males, 0 females, 39 ambiguous) loaded from .fam.
Ambiguous sex IDs written to starling_3populations.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 39 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.723331.
5007 variants and 39 people pass filters and QC.
Note: No phenotypes present.
--make-bed to starling_3populations.bed + starling_3populations.bim +
starling_3populations.fam ... done.

I will say population for simplicity throughout this vignette. However, equally we can test for differences between sample sites, subpopulations, and other types of groupings. What counts as one 'group' of organisms will depend on your study system or question. ↩
You may also want to consider linkage blocks. ↩