top of page
image.png

Population VCF Pipeline

Streamline Population Analyses with VCF-Tools

IMG_20200207_101242.jpg

About PVCF-Pipe

Please Visit the GitHub Page to Download Pipeline Script

PVCF-Pipe streamlines the processing, calculation, and output of population genetics metrics (i.e. Pi, Tajima's D, FIS, Observed/Expected Heterozygosity and Homozygosity) generated from VCF-Tools. This pipeline was generated due to a lack of tools for effectively processing VCF-Tools output and compiling all data for populations into a single csv file for analysis. The pipeline gets averages across all loci and sites for each metric, using the unfiltered input VCF to calculate Pi and Tajima's D (assuming you want to keep all SNPs to calculate diversity) and using a filtered VCF for all other metrics based on minor allele frequency and SNP distance filtering parameters input by the user. VCF-Tools calculates Hardy-Weinberg equilibrium metrics for different loci, but the output is not properly formatted to calculate inbreeding co-effiecents (FIS = Hexp-Hobs/Hexp); PVCF-Pipe converts raw allele counts into proportions which are used to calculate FIS for each locus and are then averaged across all loci for the final output value. Finally, individual level metrics (i.e. using the VCF-Tools --het option) are generated and compiled into a single csv file that is annotated with population names for each sample. 

Latest Updates

Updates (as of August 29, 2024):

  • Release date, see documentation below.

Dependencies

Software

Python 3.8+ with Pandas

VCF-Tools (Tested with version 0.1.16)

Scripts

PVCFPipe.py

File Formatting

PVCF-Pipe only requires an uncompressed VCF file containing SNPs for a set of samples representing populations and a population mapping file relating each sample (i.e. usually the bam-file associated with each individual used to generate the VCF) to a specific population.

To properly run PVCF-Pipe make sure the input VCF file is uncompressed and that the population mapping text-file is tab-delimited (See Example). Double check that samples in the population mapping file correspond to the bam-file names for each sample, you can get this directly from your vcf by using:

vcftools --vcf your_vcf_file.vcf --012​

This command will output a set of files with '.012.' labels, check the out.012.indv file output to get the bam file names for each individual in the vcf file which can be used to generate the population mapping file.

Pipeline Description

  • Script Name: PVCFPipe.py

  • Input:

    • An uncompressed VCF file containing SNPs for samples belonging to different populations​

    • A population mapping file relating sample ID's (i.e. bam files, see example) to their respective population

  • Output:​

    • A csv file with the naming scheme 'population_statistics_mafX_thinX.csv' containing average values of Pi, Tajima's D (TjD column), FIS, Expected Heterozygosity, and Observed Heterozygosity for each population. mafX and thinX denote the user input minor allele frequency and SNP thinning filters used to calculate all metrics, except for Pi which uses the unfiltered input VCF.​

    • A csv file named 'ind_population_stats.csv' containing individual level metrics output by the --het option in VCF-Tools (i.e. Observed Homozygosity, Expected Homozygosity, Number of SNP sites, Individual F). The csv file has an additional column denoting the population that each sample belongs to.

    • A folder named 'popstat_files' containing all intermediate and log files output by VCF-Tools to generate output. Also includes modified population csv files processed from VCF-Tools to generate averages and properly format data for FIS calculation.

  • Example Use:​

python PVCFPipe.py -wd /path/to/inputfiles/ -vcf your_vcf_file.vcf -pop population_map_file.txt -tjd 300 -maf .05 -thin 300​

  • Flag Description:​

-wd: path to directory containing input VCF and population mapping text file​

-vcf: name of uncompressed input VCF file

-pop: name of population mapping text file (See Example)

-tjd: integer representing window size for calculating Tajima's D across loci

-maf: decimal number representing minor allele frequency filter (i.e. -maf .05)

-thin: integer representing SNP-thinning filter, where integer represents the minimum basepair distance between retained SNPs.​​

bottom of page