SORTER2-Toolkit | Endemic Bioinformatics

Sorter of Orthologous Regions
for Target-Enrichment Reads
v2.0 Toolkit

GitHub

About SORTER2-Toolkit

Step-by-Step Tutorials

About SORTER2-Toolkit

Please Visit the GitHub Page to Download Pipeline Scripts

SORTER2-Toolkit comprises several python scripts to process paired-end illumina reads for Target-Capture datasets (e.g. Reduced Representation, Probe or Bait-Capture, Target-Enrichment, Hyb-Seq, Amplicon Sequencing, Exome Sequencing, etc..). The pipeline outputs orthologous and phased fasta alignments for non-hybrid organisms as well as alignments including allopolyploid or homoploid hybrid sequences phased into putative progenitor sequence sets, relative to non-hybrid (i.e. 'progenitor') samples. With SORTER2-Processor, users can generate filtered references and get a VCF file amenable for STRUCTURE/ADMIXTURE analyses to help guide the identification of putative progenitor and hybrid samples. In combination with output summary statistics for ortholog copy number (i.e. allopolyploids will have a higher number of sequence copies per ortholog) and other system specific information (i.e. morphology, geography, ecology), SORTER2 provides a robust framework for incorporating hybrid organisms into phylogenetic analyses to identify progenitors for numerous hybrid samples in a single analysis.

SORTER2-Toolkit is an updated version of the previous SORTER pipeline to provide an easier user experience, improve compatibility, as well as provide additional summary statistics to inform hybrid vs non-hybrid hypotheses. Additionally, we have included a new tool (SORTER2_Processor.py) that can: A) Filter output alignments for all stages based on sample representation and number of ortholog clusters per targeted loci (i.e. if a targeted locus resulted in more than one paralogous copies, you can choose how many or these orthologs to retain for downstream analyses), and B) convert the latter filtered orthologous data into consensus references and call SNPs to output a VCF (Variant Call Format) file that can be used for ADMIXTURE or STRUCTURE analyses and help guide the identification of hybrids.

The three overarching goals of the pipeline are to:

1. Build multiple sequence alignments of orthologous sequences for loci generated from paired-end reads. The initial purpose of this pipeline was to reduce paralogous multi-copy sequences and handle heterozygosity by using identity clustering within and among samples, respectively, for the same reference locus. The pipeline generates consensus alleles from heterozygous variants by clustering highly similar contigs (i.e. 99% sequence similarity) associated with the same reference locus and sample to generate consensus sequences presumably representing allelic variation at IUPAC ambiguity sites. The pipeline then maps consensus alleles for all samples to the target references and does a second round of clustering among samples to separate potential paralogs. Given appropriate clustering identity, this can filter and separate paralogs derived from the same locus into separate orthologous sets across all samples, effectively generating additional loci for analysis when paralogs are present. We are in the process of developing a clustering threshold optimizer to assess the best clustering threshold relative to sample representation across clusters.

2. Consensus allele sequences are phased into respective bi-allelic haplotypes for each presumably diploid sample. Phased bi-allelic haplotypes derived from previously determined orthologs are used to build alignments where each sample is represented by two sequences representing heterozygous or homozygous alleles.

3. Infer hybrid haplotypes based on similarity to potential progenitor samples. This generates a multiple sequence alignment where hybrid samples can have several (phased) tips, corresponding to alternative progenitors, depending on the number of hybrid haplotypes present in the sample. This usually works better for allopolyploid hybrids where we expect low levels of inter-homeologous recombination, but the pipeline has been successful in phasing haplotypes from homoploid hybrids (i.e. Parental 'pseudo-haploblocks' which may have variable levels of diploid recombination between progenitors). The level of recombination in homoploid hybrids and the divergence between their progenitors will influence the relative resolution of the hybrids' phylogenetic relationships. Despite recombination, the pipeline usually generates resolved progenitor hybrid hypotheses at shallow phylogenetic scales (i.e. subspecies) even if they have lower support due to recombination based discordance (Mendez-Reneau et al., 2024)

Latest Updates

Updates September 29, 2024:

Modified Stage1A_TrimSPAdes.py to allow flexibility in running Trim Galore and Spades in separate runs for the same project.

Updates (as of August 19, 2024):

Added HaplOMiner.py script allowing users to generate haploid genome sequences for organellar genomes (i.e. plastid or mitochondrial) from TC "By-catch" data, given a genome reference.

Updates (as of release date August 13, 2024):

Added SORTER2_FormatReads.py script to easily format paired end read files into the required file name formatting required for SORTER2
Integrated all previously external scripts to simplify set up
All stages automatically output alignments that have any sequences with identical IDs filtered so that only the longest sequence is kept; multi-copy sequences can be inspected in the unaligned “_allsamples_allcontigs.fasta” files.
Stages 2 and 3 now uses bcftools instead of vcfutils.pl to output phased sequences. This was causing compatibility issues in some systems and is the most recently suggested method.
Stage 3 now outputs two summary statistic csv files for hybrids: 1. “ProgenitorSummary.csv” summarizing how many loci have 0, 1, 2, etc… progenitors for each hybrid sample and 2. “ProgenitorDistributions.csv” showing how many unique progenitors are represented for each sample at each output alignment.
Included the 'filter undifferentiated sequences' -fp (T/F) flag for stage 3, outputting a second set of alignments that only keep hybrid sequences which had two differentiated progenitors represented per locus. This might be useful for allopolyploids where we expect multiple progenitor sequence sets to be represented per locus.
Inclusion of SORTER2_Processor.py allowing conversion of ortholog alignments generated in Stage 1B into consensus references that is be used to map reads and generate a VCF file. This script can also use the defined filters to filter alignments generated by Stages 2 and 3.
Added SORTER2_ProgenitorProcessor.py allowing users to combine phased hybrid sequences based on shared clades when sequence sets map to several closely related species, outputting new summary statistics with the collapsed alignments. The pipeline also allows users to designate and filter outgroup taxa as well as setting a minimum sequence count to keep a progenitor sequence set across alignments to facilitate processing of Stage 3 alignments.

Dependencies

SORTER2-Toolkit was written with the most recent versions (unless otherwise noted) of the following software and is compatible with Python 3.8+. All required software is available via installation in conda envrionments (Except for USEARCH, see below). Please feel free to email us at contact@endemicbio.info if you are having compatibility issues.

Software

Python 3.8+ with BioPython, Pandas, and NumPy libraries
Trim Galore
bwa
SAMTools v1.19+ with htslib v1.19+
BCFTools
MAFFT v 7.450
SPAdes v 3.11.1
Seqtk
Trimal (With FASTQC and cutadapt dependencies)
EMBOSS 6.6 (For SORTER2_Processor.py)

vcfutils.pl script from bcftools (For SORTER2_HaplOMiner.py)

USEARCH, Directions:

Download the binaries from their website, the conda version appears to have bugs that cause the pipeline to fail.
Decompress the .gz file using gzip -d or gunzip
Rename the USEARCH file (e.g. will be something like usearch11.0.667_i86linux32) to 'usearch'
Make sure to add the directory containing the usearch to your path before running SORTER2 (i.e. using export PATH=$PATH:/path/to/usearch_directory/ or changing your .bash_profile to include the usearch directory path)

Scripts

SORTER2_FormatReads.py
SORTER2_Stage1A_TrimSPAdes.py
SORTER2_Stage1B_AssembleOrthologs.py
SORTER2_Stage2_PhaseOrthologs.py
SORTER2_Stage3_PhaseHybrids.py
SORTER2_Processor.py

SORTER2_ProgenitorProcessor.py

SORTER2_HaplOMiner.py

File Formatting

In order to run SORTER2-Toolkit, two key formatting schemes are required for raw paired-end FASTQ read files and locus FASTA references:

1. Paired-end FASTQ reads must be uncompressed (i.e. using gzip -d or gunzip), and for each sample must have the following naming scheme:

ID_Species_(R1/R2).fastq

Where:

'ID' represents a unique identifier (i.e. any combination of letters and or numbers) specific to the sample/individual.)
- Note: do not include 'R1' or 'R2' in the Unique ID string, it messes up the trimming step when moving R1/R2 files, working on a fix.
'Species' is a standardized name for the specific taxon the sample represents so that multiple samples belonging to the same taxon share the same label (if species origin is unclear or may represent a hybrid, make your best guess, this can be changed downstream as you gain insight into your dataset through downstream analyses)
'R1/R2.fastq' are suffixes for the paired end reads of the sample
You can use the script SORTER2_FileFormat.py to easily reformat FASTQ file names as long as the FASTQ files have identical formatting except for having 'R1/R2' somewhere in the file name to designate paired end read files for the same sample. The script takes a .csv file with the following columns:
- Full file name for (only) the R1 FASTQ file | Unique ID for sample | Species name

2. Reference file must be a FASTA file where each locus reference has the following sequence ID format:

>L1_*

>L2_*

>L3_*

etc...

The first part of the sequence ID represents an index for reference loci starting with the letter 'L', followed by the locus number, and is ended by an underscore '_'
Multiple references may be used for the same locus, but they must have the same locus number if they represent variants of the same targeted locus
The asterisk '*' may be any combination of compatible characters (i.e. can't include '>' for FASTA format) that give the reference a specific name. This can be used to differentiate variants of the same targeted locus.

Stage 1A + 1B

Stage 1A: Trim Reads and Assemble Contigs

After uncompressed paired end read files have been properly named (See File Formatting), Stage 1A will run Trim Galore and then assemble contigs using SPAdes, outputting trimmed reads and SPAdes assemblies in a newly made project folder (denoted by the -n flag) with folders ending with ‘_assembly’ for each sample. This sets up the directory structure for all downstream stages of SORTER2.
If you already have spades assemblies for your samples, you can optionally use -spades F to skip this step, but you must put the spades_hybrid_assembly output folder (the pipeline only requires the contigs.fasta to be in the folder) in each respective samples’ “assembly” folder.

Script Name: SORTER2_Stage1A_TrimSPAdes.py
Input: Properly formatted R1/R2 FASTQ files for each sample
Output: Trimmed FASTQ reads with FASTQC quality check and SPAdes contig assemblies for each sample. Each sample has their trimmed FASTQ's and SPAdes assemblies output into their own folder ending with '_assembly'. To save space the pipeline removes raw FASTQ files and only keeps the trimmed reads output by Trim Galore for use in stages 2 and 3. All sample assemblies are output in a project folder that are given a project name with the -n flag; if you are running trim galore and spades separately, simply use the same project name between runs for the -n flag. The project name will be the name of the working directory for all downstream stages of SORTER2.
Example Use:
- python SORTER2_Stage1A_TrimSPAdes.py -n projectname -trim T -spades T

Stage 1B: Assemble Orthologs

Stage 1B takes contigs generated by SPAdes to generate "consensus alleles” for each sample by clustering contigs that are mapped to the same reference at 99% identity in order to collapse heterozygous contig variants into a single sequence. Consensus alleles for all samples are then compiled according to reference locus for a second round of clustering among samples (-cl flag) to separate paralogous locus sets. Sensitivity analysis of a range of identities [i.e. 70-90%] is recommended to assess the relative effect of consensus allele copy number per locus-cluster, but Edgar (2010) recommends at least 75% identity thresholds for retrieving orthologs. Generally speaking, a dataset containing taxa of increasing evolutionary distance will probably require lower identity thresholds (e.g. 70-75%) than more closely related taxa or subspecies (e.g, 80-90%). Higher clustering thresholds are more conservative in calling orthologs, but you may lose sample representation across loci if a higher clustering threshold ends up separating orthologs based on lineages and clades rather than paralogs, if present. If you don't suspect paralogs to be present in your dataset we still recommend a low clustering threshold (i.e. 70%) which helps remove erroneous or contaminated sequences from downstream alignments.
Within the project folder, Stage 1B outputs MAFFT aligned fasta files of consensus alleles for clustered loci in the "diploids" folder with files ending in "duprem_trimmed_deint.fasta". The pipeline automatically removes sequences that had identical sequence ID’s due to multiple sequences prior to alignment and trimming, but multi-copy consensus allele sequences can be inspected in the locus files ending with "allbaits_allcontigs.fasta" within the "diploid_clusters" folder. A csv summary table for all retrieved consensus-alleles per sample per locus prior to among sample clustering can be found in the "diploids" folder, along with consensus allele files corresponding to references and samples. A secondary summary csv file found in the "diploidclusters" shows consensus-allele distributions across samples and loci after clustering.

Script Name: SORTER2_Stage1B_AssembleOrthologs.py
Input: Assembly folders output by Stage1A
Output:
- Trimmed FASTA alignment files with duplicate sequences removed in the 'diploids' folder (files ending with 'duprem_trimmed_deint.fasta' corresponding to clustered orthologs: LX_clX, where LX is the original targeted locus and clX represents the specific cluster of the targeted locus after clustering. These are the final output alignment files that can be used for phylogenetic analyses. These alignments can be filtered by sample representation and major clusters using SORTER2 Processor.
- 'diploids' folder containing all assembled consensus-alleles for each sample (files with sample ID_species_ labels) and all consensus alleles for all samples per unclustered locus (LX_allsamples_allcontigs.fasta files). A csv file summarizing consensus alleles across unclustered loci is output in this folder with the naming scheme: ALLsamples_consensusallele_cslX_csnX_SUMMARY_TABLE.csv, where csl represents contig length threshold and csn represents the number of contigs kept for consensus-allele assembly.
- 'diploid_clusters' containing the raw clustered files output by USEARCH (LX_clX_ files), as well as MAFFT aligned cluster files prior to trimming. Files ending in 'allsamples_allcontigs.fasta' are unaligned clustered loci. A csv file named 'ALLsamples_allclusterbaits_cid.70_SUMMARY_TABLE.csv' summarizes distribution of consensus alleles for all samples across clustered loci, where cid.XX represents the chosen clustering threshold.
- Within each sample assembly folder is a file ending with 'contigmap.fa' containing all SPAdes contigs mapping to reference loci for the given sample. Files ending in '_cons.fa' represent the consensus alleles generated for the sample for each reference locus.
Example Use:

python SORTER2_Stage1B_AssembleOrthologs.py -wd /path/to/projectfolder/ -ref /path/to/fastareference.fa -loci 455 -cl 0.80 -reclust F -csn 20 -csl 300 -al 1000 -indel 0.2 -idformat onlysample

Flag Description:

-wd: path to working directory, should be the path to the project folder generated by Stage1A, needs to end with '/'

-ref: path to fasta reference file

-loci: largest index number for reference loci. The pipeline will output empty files for any loci indices that are not represented up to the largest index number.

-cl: clustering identity threshold to generate orthologs

-reclust (T/F): F will assemble consensus alleles from scratch, using the -csn and -csl flags for assembly. T will use the -csn and -csl values used in a previous Stage1B run to assembly consensus alleles; this option is included to speed up re-runs at different clustering identity thresholds.

-csn: number of SPAdes contigs per reference locus per sample to keep for consensus allele clustering

-csl: minimum sequence length of SPAdes contigs used to generate consensus alleles.

-al: number of MAFFT alignment iterations

-indel: indel trimming threshold for TrimAL, where a value of 0.2 would require atleast 20% of samples to have an indel represented in order to be kept.

-idformat (onlysample/full): This option determines how samples are annotated in final output alignments. 'onlysample' is the recommended option for linking samples across loci for phylogenetic analyses(i.e. Sequence ID labels as: >ID_species); using 'full' will keep sample ID as well as locus cluster annotation in the sequence ID (i.e. >LX_clX_ID_species), and is not amenable for concatenation or linking sample sequences across loci.

Stage 2: Phase Orthologs

Stage 2 will take sample sample-specific consensus-alleles generated in Stage 1B as references to map sample reads to them for phasing putative bi-allelic haplotypes using SAMtools phase, assuming all samples are diploid. Read mapping statistics including average read depth are output as “readstats_fin.csv” file in the main project folder. As in Stage 1B, phased sequences are filtered for duplicate sample ID’s, choosing the longest variant, prior to aligning and trimming. Phased and trimmed alignments are placed in the "diploids_phased" folder, where each sample has two sequences associated with it denoted by “ph0/ph1” representing biallelic haplotypes. Output alignments can be filtered using SORTER 2 processor.
This phased dataset can potentially be used to identify hybrid/allopolyploid samples by assessing phylogenetic placement and support of samples. Phased haplotypes are more likely to pick up a hybrid signal than consensus alleles generated in Stage 1B because homeologs or recombined haplotypes are more accurately resolved and will cause higher discordance in phylogenetic analyses. If you observe phased sample sequences that don't result as sister to each other or to other samples of its expected taxon and are also associated with poor node support values (i.e. subtending other more resolved clades, or having intermediate placement between such clades with low node support) they may indicate a hybrid individual. You need to filter potential hybrids if you want to use Stage 3 to infer hybrid haplotypes, as the inference is highly dependent on correct identification and labeling of potential progenitors (i.e. accidentally including a hybrid as a progenitor could reduce the ability for the pipeline to accurately identify differing hybrid haplotypes).
In combination with phylogenetic analyses, you can use the SORTER 2 Processor to generate a VCF file of SNPs for all samples from orthologs generated in Stage 1B. This VCF file can be input into ADMIXTURE or STRUCTURE software to gain a better sense of how SNPs are structured across samples and may point to potential hybrids in the data (in combination with other previously mentioned metrics).

Script Name: SORTER2_Stage2_PhaseOrthologs.py
Input: Output of ortholog data generated by Stage1B (i.e. you need to run Stage1B to generate ortholog clusters prior to running Stage 2)
Output:
- Trimmed FASTA alignment files with duplicate sequences removed in the 'diploids_phased' folder (files ending with 'duprem_trimmed_final.fasta') containing phased sequences for orthologs determined in Stage 1B.
- A csv file named 'ALLsamples_allcontigs_allbaits_SUMMARY_TABLE_phased.csv' summarizing phased sequence representation for all samples across loci (after clustering) can be found in the 'diploids_phased' folder.
- A csv file in the project folder named 'readstats_fin.csv' summarizing read mapping statistics (to sample specific references) for all samples after phasing sequences.
Example Use:

python SORTER2_Stage2_PhaseOrthologs.py -wd /path/to/projectfolder/ -pq 20 -al 1000 -indel 0.2 -idformat phase

Flag Description:

-wd: path to working directory, should be the path to the project folder generated by Stage1A, needs to end with '/'

-pq: phase quality parameter, SAMtools -q flag for phasing

-al: number of MAFFT alignment iterations

-indel: indel trimming threshold for TrimAL, where a value of 0.2 would require atleast 20% of samples to have an indel represented in order to be kept.

-idformat (phase/onlysample): This option determines how samples are annotated in final output alignments. 'phase' is the recommended option for annotating phased bi-allelic haplotypes for each sample (denoted by _ph0/_ph1 suffixes) linking samples across loci for phylogenetic analyses (i.e. Sequence ID labels as: >ID_species_ph0/ph1). Linking phased sequence between loci are not representative of chromosomal haplotypes due to lack of linkage information between loci/chromosomes in reduced representation datasets, but are still useful in generating haplotypes and alignments with a more accurate representation of sequence variation. Using 'onlysample' will only keep sample ID as in the Stage1B option, so you will have identical sequence names for phased sequences belonging to the same sample (i.e. >ID_species) and will require filtering of duplicate sequence ID's for use in phylogenetic analyses.

Stage 3: Phase Hybrids

This script will phase and assign orthology to sequences in samples suspected to be allopolyploid or hybrid in origin, outputting multiple-sequence alignments with hybrids represented by two or more pairs of phased alleles, depending on how many progenitor species contributed to the hybrid. It uses the putatively orthologous and phased datasets generated from Stage 1B and Stage 2 as progenitor references to assign identities to different genomes or haplotypes found in a hybrid. Thus, this script assumes that the samples included through Stage 1B and Stage 2 will serve as possible progenitor contributors to each hybrid sample. Due to this, inference can be highly impacted by having miss-identified samples or cryptid hybrids included in the analysis.

The script outputs alignment files, where hybrid sequences are annotated with the name of putative progenitors, allowing for analyses linking hybrid haplotypes across loci without a priori knowledge of linkage or genomes. Progenitor sequence sets having a relatively smaller number of sequences may represent a haplotype that was inferred to different species due to ILS or poor sequence quality, so we recommend preliminary phylogenetic analyses to assess if disparate haplotypes should be further combined (i.e. the two sets result in a shared and well supported clade). Based on the latter, we have found that sets belonging to closely related clades or lineages can be combined, assuming they represent variants of the same haplotype.

We have provided the SORTER2 Progenitor Processor to allow users to combine and process such sequence sets from output alignments.

Script Name: SORTER2_Stage3_PhaseHybrids.py
Input:
- Output of ortholog data generated by Stage1B and phased data from Stage 2(i.e. you need to run Stages 1B and Stage 2, generating phased orthologous sequences for putative progenitor species which will serve as references for phasing hybrids)
- A folder within the project directory named "phaseset" (you must create this folder). Assembly folders for putative hybrid samples need to be added to this folder to be processed by Stage 3. You can generate assembly folders for hybrids by using Stage 1A and then add them to the phaseset folder prior to running Stage 3.
Output:
- Trimmed FASTA alignment files with duplicate sequences removed in the 'phaseset/diploidclusters_phased' folder (files ending with 'duprem_trimmed.fasta') containing the phased Stage 2 dataset as well as phased hybrid sequences.
- if the -dp flag is set to 'T', an additional set of alignments ending in 'differentiated.fasta' will have Stage 2 sequences and only include hybrid sequences which had more than 2 unique progenitor species represented per locus (i.e. if a sample only had one progenitor species associated with its sequences in a locus, then those sequences are removed from the alignment). This is a conservative filter for allopolyploids ensuring you are only keeping sequences where more than one homeolog is represented per locus per sample.
- A csv file named 'phaseset_readstats_fin.csv' in the 'phaseset' folder summarizing read mapping statistics for all hybrid samples.
- A csv file named 'progenitor_distrubutions.csv' in the 'phaseset/diploidclusters_phased/' directory summarizing how many unqiue progenitor species are represented for each hybrid sample across all loci
- A csv file named 'progenitor_summary.csv' in the 'phaseset/diploidclusters_phased/' directory summarizing the total amount of loci with X number of unique progenitors represented for each sample summed across all loci.
Example Use:

python SORTER2_Stage3_PhaseHybrids.py -wd /path/to/projectfolder/ -ref /path/to/projectfolder/references.fa -loci 455 -csn 20 -csl 300 -pq 20 -al 1000 -indel 0.2 -fp T

Flag Description:

-wd: path to working directory, should be the path to the project folder generated by Stage1A, needs to end with '/' (should be same path used in Stage 1B)

-ref: path to fasta reference file (should be same reference used in Stage 1B)

-loci: largest index number for reference loci. The pipeline will output empty files for any loci indices that are not represented up to the largest index number. (should be same value used in Stage 1B)

-csn: number of SPAdes contigs per reference locus per sample to keep for consensus allele clustering

-csl: minimum sequence length of SPAdes contigs used to generate consensus alleles.

-pq: phase quality parameter, SAMtools -q flag for phasing

-al: number of MAFFT alignment iterations

-indel: indel trimming threshold for TrimAL, where a value of 0.2 would require atleast 20% of samples to have an indel represented in order to be kept.

-fp (T/F): If set to 'T' will output an additional set of alignments ending with 'differentiated.fasta' that only includes hybrid sequences which had atleast 2 unique progenitor taxa represented per locus per sample.

Hapl-O-Miner : Mine Haploid Organellar Genomes from SORTER 2 TC Data

TC datasets often amplify what is referred to as "By-Catch" data, or off-target sequences that are sequenced, generating additional sequence data that can be useful for additional analyses. In plant systems specifically, the abundance of chloroplasts in leaf tissue usually used for DNA extraction and library prep makes it almost inevitable that some chloroplast genome data will be sequenced, which is some cases is enough to generate full or partial reconstructions of chloroplast genomes. Hapl-O-Miner (Haploid Organelle Genome Miner) leverages this by-catch data to generate such organellar genome sequences from TC datasets based on trimmed reads processed by SORTER2 Stage 1A. To run Hapl-O-Miner, simply process raw paired end reads with SORTER2 Stage 1A and provide the working directory containing assembly folders for samples as well as a reference haploid genome to map reads to. Hapl-O-Miner also takes user defined minimum threshold values of genome reference coverage (-c flag) and read depth (-d flag) to filter the final output alignment.

Script Name: SORTER2_HaplOMiner.py
Input:
- Trimmed paired-end read assembly folders output by Stage1A of SORTER2.
- A reference for the haploid genome (i.e. plastid or mitochondrial)
- Minimum thresholds for reference coverage and read depth to keep a sequence in the final output alignment
Output:
- A folder named 'all_chloroplasts' containing FASTA sequences of sample specific assembled genomes, regardless of coverage or read depth.
- a csv file named 'readstats_cp.csv' containing read statistics, coverage, and read-depth relative to organellar genome reference for each sample.
- Within the 'all_chloroplasts' folder is a sub-folder named 'coverageX_depthX_filtered' containing the filtered sequence set and final alignment containing all filtered genomes (file named 'Genomes_coverageX_depthX_filtered.fasta), where 'coverageX' denotes the user defined minimum coverage threshold and 'depthX' denotes the user defined minimum read-depth threshold.
Example Use:

python SORTER2_HaplOMiner.py -wd /path/to/sample_assemblies/ -cpref /path/to/reference.fasta -c 80 -d 10

Flag Descriptions:

-wd: path to directory containing '_assembly' sample folders with trimmed paired-end reads generated by Stage1A.

-cpref: path to organellar genome reference FASTA file

-c: Minimum reference coverage percentage to keep sample sequence in final alignment

-d: Minimum read-depth required to keep sample sequence in final alignment.

SORTER 2 Processor

This processing pipeline was developed to facilitate filtering of alignments output by stages 1-3 of SORTER 2-Toolkit by sample representation and major ortholog clusters. This allows users to select the top ortholog cluster (i.e. the ortholog with the most samples represented) across loci for analysis. Alternatively users can select the top X largest ortholog clusters to assess how incroporating additional paralogs into the alignment dataset affects analyses. Alignment filtering for stages 2 and 3 will be based off of the sample representation and ortholog clusters determined in Stage 1B in order to facilitate analysis comparisons across datasets.

Next, SORTER 2 Processor takes the alignments based on selected filtering parameters and generates consensus references. These consensus references are then used to call SNPs and generate a VCF file for all samples set up in the current Stage 1B run (i.e. The filtering of alignments and resulting VCF file are based on the samples last processed by Stage 1B, as represented by the sample assembly folders found in the project folder). This VCF file can be input into ADMIXTURE or STRUCTURE analyses to help guide and complement hypotheses about what samples represent 'pure' progenitors vs hybrid samples.

We recommend that users first run all in-group samples being studied including putative hybrids, with Stages 1A + 1B and use SORTER2 Processor to confirm hybrid/progenitor hypotheses with ADMIXTURE/STRUCTURE analyses (as well as phylogenetic analyses and other data) to have high confidence in the set of progenitor samples being used for phasing hybrids in Stage 3. Out-group samples for phylogenomic projects that are only used to root the phylogeny can increase computation time and admixture complexity; we recommend running Stage 1B with focal in-groups only by removing out-group sample assembly folders from the project directory (i.e. the output directory from Stage 1A), prior to running Progenitor Processor to facilitate the generation and analysis of VCF file. We have found that, in many cases, datasets can include mislabeled samples (i.e. samples corresponding to the wrong taxon due to mislabeling or contamination) and or samples that represent un-identified or cryptic hybrids. Including such samples as progenitors for Stage 3 phasing can affect the accuracy of hybrid phasing if hybrid haplotypes are erroneously associated with a wrongly labeled progenitor species, and or cryptic hybrids with conflicting genetic signals.

Script Name: SORTER2_Processor.py
Input: Output of ortholog data generated by Stage1B (i.e. you need to run Stage1B to generate ortholog clusters prior to running). Alternatively, users can also use the script to filter alignments output by Stages 2 or 3 to correspond to the filtered Stage 1B alignments.
Output:
- Filtered FASTA alignments based on filtering parameters that can be found in folders corresponding to each stage: 'diploid_clusters' (for Stage 1), 'diploids_phased' (for Stage 2), and or 'phaseset/diploidclusters_phased/' (for Stage 3)
- Consensus references and a VCF file with called SNPs for all samples based on the alignment filtering parameters output in the 'SORTER2Processor_VCF' folder found in the project directory.
Example Use:

python SORTER2_Processor.py -wd /path/to/projectfolder/ -rep 0.50 -mc 1 -al T -st2 T -st3 F -dovcf T

Flag Description:

-wd: path to working directory, should be the path to the project folder generated by Stage1A needs to end with '/'

-rep: sample representation required to keep alignment (i.e. 0.50 to keep alignments having atleast 50% of samples represented)

-mc: number of major ortholog clusters to keep (i.e. a value of 2 will keep the two ortholog clusters per targeted locus having the most samples represented)

-al (T/F): option to keep alignment files from stage1B, if set to F then it will only generate consensus references to generate a vcf

-st2 (T/F): option to filter alignments output by Stage 2 based on filtered alignments for Stage 1

-st3 (T/F): option to filter alignments output by Stage 3 based on filtered alignments for Stage 1

-dovcf (T/F): option to make a vcf file from Stage 1 consensus references based on filters, otherwise it will just filter alignments if those options were chosen.

Progenitor Processor

Stage 3 of SORTER2 outputs phased sequence sets for hybrid samples corresponding to putative progenitor taxa, but this often results in disparate sequence sets associated with several closely related taxa belonging to a shared clade that need to be combined in order to generate the proper haplotype sets. Additionally, smaller sequence sets associated with outgroup taxa can also arise and require filtering prior to analysis. Progenitor Processor allows users to designate outgroup taxa to be excluded from sequence sets and define progenitor clades that should be combined into a single sequence set to facilitate the processing of alignments output by Stage 3. The script also allows users to exclude progenitor sequence sets having less than X sequences across all input alignments (-min flag).This script allows users to re-filter differentiated sequences (using -dif T) to filter hybrid sequences so that only sequence sets that have at-least two disparate progenitor sequences are kept per locus. The script removes duplicate sequence ID's if combining sequence sets caused duplicate sequence ID's to be present.

Script Name: SORTER2_ProgenitorProcessor.py
Input:
- Set of fasta alignments with hybrid sequence sets output by Stage 3 of SORTER 2.
- A csv file with the following columns (headers must be included):
  - hybrid | progenitor | clade
  - hybrid column contains specific hybrid taxon names to be checked for collapsing
  - progenitor column contains set of progenitor taxa associated with hybrid
  - clade contains the clade set for the given progenitor that should be combined into a single set
- A text file containing the names of outgroup taxa (one per line) that should be excluded from progenitor sets
Output:
- FASTA alignments ending with 'cladecollapse.fasta' or 'cladecollapse_differentiated.fasta' that have progenitor sequences for hybrids renamed based on clade designations defined in the input csv file as well as additional outgroup and differentiated sequence set filters chosen by the user.
- A csv file named 'progenitor_distrubutions.csv' summarizing how many unqiue progenitor species are represented for each hybrid sample across all processed loci alignments
- A csv file named 'progenitor_summary.csv' summarizing the total amount of loci with X number of unique progenitors represented for each sample summed across all loci alignments.
Example Use:

python SORTER2_ProgenitorProcessor.py -wd /path/to/stage3fastafolder/ -map csv_mapfile.csv -og outgroup_exclude_textfile.txt -min 20 -dif T

Flag Description:

-wd: path to directory with fasta alignments to process, as well as input csv and text files.

-map: csv mapping file designating hybrids taxa to process and what progenitor taxa to combine into the same progenitor sequence set

-og: text file containing taxon names to exclude from sequence sets

-min: minimum number of phased sequence pairs required to keep a progenitor sequence set per hybrid sample, after combining. (i.e. a value of 20 would require atleast 20 phased sequence pairs, for a progenitor sequence set to be kept after combining based on shared clade)

-dif (T/F): filter hybrid sequences so that only sequences with at least two disparate progenitors are represented per locus per sample. Hybrid sample sequences having only one progenitor represented per locus are filtered from that alignment, but all other sequences are kept, given filtering parameters.

Step-by-Step Tutorials

SORTER2 Tutorial: Stages 1A + 1B

This tutorial goes over setting up dependencies, file formatting, and output files generated by stages 1A and 1B of SORTER2

SORTER2 Processor Tutorial: Identifying Hybrid vs. Non-Hybrid Samples

This tutorial reviews how to use the SORTER2 Processor script to test hypotheses about hybrid vs non-hybrid samples from stage 1B output

EndemicBio.Informatics

Sorter of Orthologous Regions
for Target-Enrichment Reads
v2.0 Toolkit

About SORTER2-Toolkit

Latest Updates

Dependencies

File Formatting

Stage 1A + 1B