
SORTER2 Processor Tutorial: Identifying Hybrid vs. Non-Hybrid Samples
0
94
0

Table of Contents
Tutorial Description
Introduction
Running SORTER2Processor.py
Running ADMIXTURE with AdmixturePipeline
Running CLUMPAK and Determining Best K
Comparing Phylogeny with CLUMPAK Results
Calculating Raw Consensus Allele Copies per Sample
Synthesizing Results
Tutorial Description
This tutorial will go over the following steps:
Use the SORTER2_Processor.py script to:
Filter alignments generated by stage 1B
Generate consensus references from stage 1B alignments
Use the consensus references to call SNPs and generate a VCF file
Use ADMIXTURE/CLUMPAK to get K proportion estimates
Determine optimal K value using Evanno's K
Combine K group proportions with consensus-allele copy number, phylogeny, and taxonomy to confirm hybrid vs non-hybrid samples for phasing in Stages 2+3 of SORTER2
To follow along with this tutorial we will be using the following software:
AdmixturePipeline to facilitate running admixture replicates and filter SNPs
Dependencies: Admixture, PLINK, VCFTools
CLUMPAK via the online portal at: clumpak.tau.ac.il
TriFusion to concatenate alignments output by stage 1B
IQTree to generate a Maximum Likelihood phylogeny
FigTree to visualize phylogenetic tree
Sublime Text to process CLUMPAK output
Excel or Google Sheets to compile data and plot admixture/clumpak output
Powerpoint or Google Slides to visualize results
To simplify set up, we have provided a .yml anaconda environment containing Admixture, PLINK, VCFTools, and IQTree dependencies that can be installed using:
conda env create -f SORTER2Processor.yml
or the software can be installed manually in a new conda envrionment.
You will need to download and set up the AdmixturePipeline.py scripts on your local machine or HPC. You can set bash_profile paths to the latter software or simply include the scripts in the SORTER2Processor_VCF folder where we will be performing admixture analyses.
Introduction
The SORTER2 processor script facilitates filtering of alignments by sample representation and major ortholog cluster for stages 1B, 2, and 3 of SORTER2. Additionally, the script makes consensus references from stage 1B alignments to generate a VCF file for samples processed by stage 1B that will be used to run ADMIXTURE (or STRUCTURE) analyses. This was incorporated to test and confirm hypotheses about hybrid vs. non hybrid samples in target capture datasets prior to phasing of hybrids using stage 3 of SORTER2. Unless you have a high degree of confidence of what samples comprise hybrid vs. non-hybrid taxa, we suggest performing the following analyses in-case there are misidentified or cryptic hybrids in your dataset. This can affect the accuracy of hybrid phasing with stage 3 if the latter results in mislabeling. This allows a secondary confirmation of hypothesized genotypes and lineages for datasets containing samples in non-model systems, systems that include sampling from novel geographic locations or samples with unusual morphology that could represent novel lineages or cryptic hybrids.
In this tutorial we will use a dataset of Polypodium s.s. samples consisting of diploid taxa as well as tetraploid and hexaploid hybrids with the goal of confirming hybrid polyploids from non-hybrid diploid samples.
Running SORTER2Processor.py
To begin we ran stage 1A+B on all of the samples in order to generate SORTER2 ‘assembly’ folders for each sample (See tutorial for stages 1A + 1B). Once we have the stage 1B 'assembly' folder for our samples we are ready to set up SORTER2_Processor.py in order to generate consensus-references and a VCF file. We'll use the script to apply filters limiting the consensus references to alignments that had >50% sample representation for the ortholog that had the most samples represented, so that we are only keeping one ortholog per targeted locus (i.e. if a targeted locus resulted in multiple paralogous copies, we are only keeping the ortholog having the highest number of samples). These consensus-references will serve as the references to call SNPs for our VCF file.
To do this, set up the following flags for the SORTER2_Processor.py script:
-wd: path to working directory, should be the path to the project folder generated by Stage1A needs to end with '/'
-rep: sample representation required to keep alignment (i.e. 0.50 to keep alignments having atleast 50% of samples represented)
-mc: number of major ortholog clusters to keep (i.e. a value of 1 will keep the ortholog cluster per targeted locus having the most samples represented)
-al (T/F): option to keep alignment files from stage1B, if set to F then it will only generate consensus references to generate a vcf
-st2 (T/F): option to filter alignments output by Stage 2 based on filtered alignments for Stage 1
-st3 (T/F): option to filter alignments output by Stage 3 based on filtered alignments for Stage 1
-dovcf (T/F): option to make a vcf file from Stage 1 consensus references based on filters, otherwise it will just filter alignments if those options were chosen.
To generate a VCF file from consensus references (derived from stage 1B) with >50% sample representation that only comprises the ortholog with the most samples our python command should look like:
python SORTER2_Processor.py -wd /path/to/projectfolder/ -rep 0.50 -mc 1 -al T -st2 F -st3 F -dovcf T
Set the -al flag to T to keep filtered consensus-reference alignments that we will use to generate a phylogeny which we will use to contextualize Admixture results. We set the -st2 and -st3 flags to F since we have not generated stage 2 or stage 3 alignments in this example. If you wanted to filter stage 2 and stage 3 alignments based on sample representation and major orthologs from stage1B without generating a vcf file we would set -st2 and -st3 flags to T and -dovcf to F.
SORTER2 Processor Output
Once SORTER2_Processor.py has finished running you should see a new folder in your SORTER2 project folder called ‘SORTER2Processor_VCF’ that contains bam files for each sample resulting from mapping reads to our consensus references, the resulting VCF file with SNP data for our samples, as well as read statistics for all samples which are summarized in the ‘VCF_readstats.csv’ file:

The output VCF file has a naming scheme as follows: xLoci_xrep_xMajorClusters_allsnps.vcf, where:
xLoci denotes the number of total loci represented in our consensus references
xrep denotes the sample representation threshold used to keep a locus (i.e. 50% in this tutorial example)
xMajorClusters denotes the number of major ortholog clusters to be kept (i.e. 1 in this tutorial example)
We will use this VCF file as input for admixture analysis which we will use in combination with consensus allele copy number, phylogeny, and taxonomic hypothesis to inform our final assessment of hybrid vs non-hybrid samples.
Another output from our run are the filtered alignments and resulting consensus references generated after applying our filters. These alignments are output into a set of folders found within the ‘diploidclusters’ folder:

The first output folder is named ‘50rep’ (denoting our >50% sample representation filter); this folder contains alignments with >50% sample representations. Within this folder there should be another folder named ‘MajorClusters_1’ (denoting that we only kept one major cluster per targeted locus) containing the final alignments and consensus references based on both sample representation and major ortholog filters. We will use the alignment files (files ending with ‘_al.fasta’) in this folder to generate a phylogeny we will use in concert with our admixture results and consensus allele copy number in order to make our final determination of hybrid vs. non-hybrid samples.
Running ADMIXTURE with AdmixturePipeline
In this example we will run admixture analyses with the AdmixturePipeline, followed by CLUMPAK, to determine the major genetic groups in our data (i.e. referred to as “K groups” in ADMIXTURE and STRUCTURE analyses) and provide evidence for hybrid samples, if they consist of two or more genetic groups reflecting the divergent lineages contributing to the hybrid. AdmixturePipeline facilitates running replicates of admixture analyses across a range of different K values and incorporates different filters to the SNPs in our VCF file. In this example, we are testing K values between 1 and 12, based on our expectations of the number of genetic groups represented in the data. From prior knowledge of major clades and species in our dataset, we hypothesize that the most likely K is between 3 and 5. However, we test up to K=12 to observe that likelihood and cross-validation (CV) error stabilize at higher Ks. This broader range ensures we identify the best-supported K while verifying that higher K values do not explain further variation found in the data.
Once you have activated the anaconda environment with proper software (or otherwise installed and set paths to required software) we are ready to run our ADMIXTURE replicates using AdmixturePipeline.py. Make sure you have set up AdmixturePipeline locally and set paths to its scripts before running, or you may simply import all the latter scripts into the SORTER2Processor_VCF folder where we will be performing our ADMIXTURE runs. To run the pipeline we will be using the following settings:
-m population mapping file = bampopmap.txt (see below)
-v input VCF file = 348Loci_50rep_1MajorClusters_allsnps.vcf
-k minimum K value = 1
-K maximum K value = 12
-c Number of ADMIXTURE cross validation replicates = 30
-R Number replicates for each K = 30
-a Minor allele frequency SNP filter = 0.05
-t minimum distance between SNPs = 100
The -m flag requires a tab delimited data frame containing a column with each bam file that was used to generate the final VCF file and another with the corresponding population (in our case it will be the species) for each sample. This file can easily be generated using the ls command in a UNIX system using the ls command; navigate to the resulting SORTER2Processor_VCF folder and type:
ls *.bam > popmap.txt
This will generate a new text file called ‘popmap.txt’ listing all files ending with the .bam extension. We will use this file to generate the population mapping file 'bampopmap.txt' by adding a tab after each bam file and then typing the corresponding species for each sample (See Example). We will be using the VCF file output by SORTER2Processor as the input for the -v flag and use the -a and -t flags to further filter our SNPs. The -a flag sets a minor allele frequency (MAF) threshold that we will set to 0.05 (or 5%). MAF is used to set a limit on SNPs where the minor allele represents less than 5% of the samples in order to reduce the number of low frequency or singleton SNPs. We set the -t flag to 100bp representing the minimum required distance between SNPs and is done to limit the effect of linkage. SNPs that are closer to each other have a higher likelihood of being correlated, therefore we want to limit the potential for pseudo-replicates.
The -k and -K flags are set to 1 and 12, respectively, to reflect the range of K values that will be inferred. Based on our previous understanding of clades and species, we expect a range of 4 -10 K groups reflecting clades and or species. How well K values will be resolved relative to species or clades will depend on how variable each species is and how well our sampling captures SNPs that are variable between samples of the same species and or clade. For example, we might expect a situation where we resolve 4 major K groups corresponding to variable sampling of : a single species, two sister species, or a clade comprising several species, each resulting in their own K group depending on how well each was sampled. This is also why incorporating a phylogeny of the sampling to complement the analysis helps contextualize the K groups into further subgroups that may have not been resolved using ADMIXTURE based methods.
We will set the -c and -R flags to 30 in order to generate replicates for cross-validation and K value replicates. The cross-validation replicates will be used to assess optimal K values represented in the data. For each K value being tested, their corresponding K value replicates generate a distribution K value frequencies for each sample that can vary between runs using different random seeds. The replicates for each K are then summarized with CLUMPAK software using a similarity clustering approach to represent the dominant signal(s) across replicates.
Make sure to load the required anaconda environment or software and then we can run the pipeline using:
Python AdmixturePipeline.py -m bampopmap.txt -v 348Loci_50rep_1MajorClusters_allsnps.vcf -k 1 -K 12 -c 30 -R 30 -a 0.05 -t 100
After the pipeline has run, our SORTER2Processor_VCF folder should contain several files corresponding to the K replicate runs as well as a results.zip file containing data from all of our replicates which we will use as input for CLUMPAK. Before doing this, let’s gather the CV and log-likelihood replicate data into text files that will help us determine optimal K values for visualization. This data is stored in admixture output files ending with .stdout, but gathering this data manually is tedious, so we wrote a python script called getlog.py that gathers this data for you and is available via the github repository. To run the script all you need to do is give the script the path to the SORTER2Processor_VCF folder using the -wd flag. For simplicity let’s place the getlog.py script in the SORTER2Processor_VCF folder and run it as below:
Python getlog.py -wd /path/to/SORTER2Processor_VCF/
This will result in two files named log_likelihood_results.txt and cv_errors.txt containing the log likelihood and cv data which we will use later to determine the optimal K value.
Running CLUMPAK and Determining Best K
The developers of CLUMPAK have provided a convenient online portal that we will use to summarize our replicate admixture runs to summarize the dominant signals shown across our replicates. We will use the results.zip that was output at the end of our previous Admixture analyses and use the bampopmap.txt file we generated as input for the CLUMPAK analysis:

Select ADMIXTURE for the format of the results file and enter your email in order to be notified when the CLUMPAK analysis has finished with the final results. After submitting you will be sent to a page showing the CLUMPAK progress and final results, allowing you to download the results in a zip file once it has finished, as well as summarizing the results with images for each K:

Download the zip file containing our CLUMPAK results that we will use in the next sections, but first, let’s determine the optimal K value by our reviewing our CV replicate data and then calculating Evanno’s best K based on our log likelihood data. For our CV data let’s open the cv_errors.txt file we got from the getlog.py output in Excel or google sheets to sort the data by K value:

After our data is sorted by K value, we can visualize the results by plotting replicates, select both columns and go to Insert>Chart and select 'Scatter Chart':

These results show that our CV values level off around K = 5 with subsequent K values showing similar error magnitudes. However, to refine our assessment, we turn to Evanno's ΔK method, a more conservative and widely accepted approach for determining the optimal K. To do this we will use the BestK module on the clumpak website: https://clumpak.tau.ac.il/bestK.html
Before submitting our log-likelihood data we need to process the log_likelihood_results.txt output to remove the headers (i.e. the K_Value and Log_Likelihood row) as well as the log-likelihood values for K = 1. This will give us a tab delimited file I have called evannoKloglike.txt (Example) containing log-likelihoods for K 2-12 which we will use as input into the CLUMPAK module:

Make sure you select the log probability table file option prior to submitting. Calculating Evanno’s ΔK should run quickly, resulting in a chart showing the difference in log-likelihood changes across K runs; where the biggest difference between K’s log likelihood values reflects the optimal K value:

In this case, the greatest ΔK value occurs at K=3, with a secondary peak at K=5. Evanno's ΔK method is preferred because it identifies the most significant hierarchical structure in the data, focusing on the rate of change in likelihood values rather than absolute CV error magnitudes, which can be less reliable for distinguishing fine-scale structure.
By combining the insights from both methods, we prioritize K=3 as the most likely value for the major population structure. Additionally, we will also examine K=4 as it captures finer sub-structuring hinted at by the secondary ΔK peak and the plateau in CV errors at higher K values. This dual approach provides a conservative interpretation of our results while allowing us to explore additional sub-structuring at higher K values.
Comparing Phylogeny with CLUMPAK Results
CLUMPAK reveals the proportion of ancestry from different genetic groups (K groups) for each individual that is useful for understanding gene flow, admixture, and the degree of genetic mixing between populations. Phylogeny reflects evolutionary relationships based on shared ancestry, focusing on lineage splits and the hierarchical arrangement of taxa over time. Comparing both results can reveal whether K groups (from ADMIXTURE) align with clades (from the phylogeny) or if there is evidence of admixture between clades that phylogenies do not capture. Concordance between ADMIXTURE clusters and phylogenetic clades strengthens confidence in both analyses while discrepancies may point to interesting biological phenomena such as hybridization, incomplete lineage sorting, or misclassification.
Making a Maximum Likelihood Phylogeny with IQTREE2
To generate a phylogeny of our samples we first need to generate a concatenated super-matrix of our filtered ortholog alignments with a corresponding partition file using TriFusion software (https://odiogosilva.github.io/TriFusion/) which simplifies this process.
Simply take the filtered alignment files that we previously generated with SORTER2 Processor (i.e. the alignment files ending with ‘duprem_al.fasta’ in the 1_MajorClusters output folder) and import them into TriFusion by dragging them into the software window and selecting ‘Alignment’ as the input file-type. After importing the alignment files go to the ‘Process’ tab and select ‘Concatenation’. We’ll select the PHYLIP format for our super-matrix output that also automatically generates a partition file for our alignment matrix:

Finally, set the output file name by hitting the ‘Select…” button next to it and name the output file. For easy reference lets name the output relative to our reference naming scheme as 50repMajorClusters1_348loci_stage1_allsamples.phy to denote the filter settings and number of loci as well as the fact that it is derived from stage 1 of SORTER2 containing all of our in-group samples. This will result in a phylip file as above, as well as a partition file with the same naming scheme, but having ‘_part.File’ as the file extension. We can then use these files as input for iqtree2 to generate a maximum likelihood tree with 1000 bootstrap replicates: iqtree2 -s 50repMajorClusters1_348loci_stage1_allsamples.phy -p 50repMajorClusters1_348loci_stage1_allsamples_part.File -m MFP -B 1000 --safe -T AUTO
The -m MFP option denotes assessing optimal substitution models for each partition, -s denotes the concatenated phylip alignments, -p denotes the parition file, and -B denotes bootstrap replicates (See http://www.iqtree.org/doc/ for in-depth documentation of all iqtree options). After running iqtree and obtaining a phylogeny of our data (the file with the .contree extension) we need to properly format the tree so that we can visualize it with our ADMIXTURE/CLUMPAK results.
Use FigTree software to open the .contree file containing our phylogeny. Before opening the tree, FigTree will ask you to set a label for node values which represent bootstrap support, this can be kept as the default ‘label’ option. Prior to outputting our tree for comparison to admixture results we want to apply some transformations on the tree in order to simplify our visualization. First, we will root the tree with our outgroup, in this case the sample labeled ‘WD06b_Pleurosoriopsism’, by selecting the branch leading up to this sample and then submitting the ‘Reroot’ button at the top of the software window:

Next we select the ‘Trees’ tab on the left to organize the tips so that they are ordered by increasing branch length and then transform the branches into a cladogram so that all tips are aligned equally. Also select the ‘Node Labels’ tab and select the 'label' item from the drop down menu in order to show our bootstrap values:

Formatting our tree in this way will facilitate plotting our ADMIXTURE and CLUMPAK results next to each tip with their corresponding ADMIXTURE K proportions. Lets export our reformatted tree as a PDF file by going to File>Export PDF…
Processing K Proportion Data from CLUMPAK Output
Now we need to organize our CLUMPAK results for K = 3 and K =4 so that they match the order of the samples in our phylogeny in order to visualize results for both side by side. Open the zip file that was output by our CLUMPAK run, resulting in a series of files and folders related to our analysis:

To get the raw K proportion matrices for K = 3 and 4 navigate to their corresponding folders as shown above, then navigate to MajorCluster>CLUMPP.files folders which will contain a series of files representing our K proportion matrices:

We are interested in the files named ‘ClumpIndFile.output’ for each K value that contain the K proportions for our samples based on each K value. Let’s open these files with SublimeText in order to retrieve the data, resulting in the following:

We are only interested in the proportion columns following the colons; in order to isolate these rows we will use Ctrl + F (or Command + F for mac) to find all semicolons in the file by typing ‘:’ in the search prompt and then selecting ‘Find All”:

This will select all the colons found in the file and will allow us to collectively select all of the data to the right of it. Once the colons are highlighted we can isolate the K proportions by first hitting the right arrow key to deselect the semicolons, placing our selections next to the first column of our K proportions. Next, hold Shift and use the right arrow key to highlight the proportions:

Lets copy this data and paste it to a new empty text file (shortcuts: ctrl/command+C to copy and ctrl/command+V to paste). In the current format the columns are delimited by a space, but to facilitate exporting this data into Excel/Google Sheets let’s replace spacebars with tabs (alternatively you can import space-delimited data and then have the software detect columns based on the delimiter). To do this use ctrl/command+F as before to select all of the spaces in the file and then press the tab button to replace the spacebars with a tab:

Do this for both the K=3 and K=4 files and either save these as .tsv files which can be imported into Excel/Google Sheets or we can simply select all of the data and directly paste it into an empty sheet to automatically place the data into their own columns. In order to see what samples correspond to each K proportion row, copy+paste the popbam.txt list we previously made into the sheet as its own column. The K proportions are ordered based on the order found on the popbam.txt bam list file that are ordered numerically and alphabetically:

I have included K proportions for the K=3 and K=4 results in the same data frame separated by an empty column to make sure they aren’t confused when plotting.
Calculating Raw Consensus Allele Copies per Sample
Before putting ADMIXTURE and phylogenetic results together the last metric we need into include in our data frame is the consensus-allele copy number for each sample. To calculate this we need to open the summary csv file output by stage 1B found in the ‘diploids’ folder named ‘ALLsamples_consensusallele_csl300_csn20_SUMMARY_TABLE.csv’.
Your own summary csv name might differ slightly depending on what values you chose for the -csl and -csn flags for assembling consensus alleles in stage 1B. This file summarizes many consensus alleles each sample has for each targeted locus. We are interested in using this data to calculate the total consensus alleles for each sample. If allopolyploids are in the dataset, they should have a higher average consensus-allele count than diploids based on having more than one diploid sub-genome. To calculate the total consensus-allele count for each sample open the summary csv file in Excel or Google Sheets and then use the SUM() function for each column to calculate the total consensus-allele count for each sample:

Create a new row that will contain our total count and then use the SUM() function to calculate the sum of values for all columns. The cell can then be highlighted and dragged to the right in order to calculate sums for the rest of the sample's columns. Luckily, the order from our summary csv file is also in numeric and alphabetical order matching the order of our admixture K proportions, we just need to transpose the data from a row into a column. Google sheets has a built in paste option that does this by right clicking (or Ctrl+click) choosing ‘Paste Special > Transposed’, resulting in a new column for the sum of consensus-alleles for each sample:

Taking a first look at our results, there seems to be a fairly consistent pattern for the tetraploid hybrid taxon, P. vulgare, having relatively higher consensus allele copies than the other taxa expected to be diploid. Similarly, the hexaploid P. interjectum reveals even higher consensus allele copies. We also have found a handful of samples belonging to P. californicum that have unusually high consensus allele copy numbers compared to the other samples which likely represent the closely related tetraploid taxon P. calirhiza that can be difficult to differentiate from P. californicum. Finally, there are also a couple of samples with unusually low number of consensus-alleles that we will remove from downstream analyses due to poor sequencing quality.
Synthesizing Results
The last thing we need to do in order to compare all of our results is to reorder the rows so that they match the order of the tips in our transformed phylogeny, allowing us to visualize admixture data next to the tips on the phylogeny. Currently the only option to do this is to manually re-order samples based on the order seen in our phylogeny (I am working on a script that can automate this given a dataframe and a phylogeny).

After reorganizing our data-frame so that samples match the tip order found in the phylogeny I created a new column called ‘PhyloOrder’ to index the phylogenetic order, this way we can switch between the raw numerical and alphabetical order and the phylogenetic order, if needed, by using the ‘sort sheet’ functionality. We are now ready to visualize Admixture K proportions and place them next to our phylogeny. To do this start with the K=3 proportions by highlighting them and then selecting ‘Insert > Chart’ and choosing a 100% stacked column chart. This can be repeated for K=4, giving us two admixture proportion charts:

Copy paste these charts onto a new PowerPoint/Google Slides file along with the PDF of the phylogeny we previously generated:

Let’s clean up the admixture charts by cropping them to remove the legend and x axis label by right clicking (or Ctrl+click) on the charts and selecting ‘Crop’, then we can rotate the charts so that they line up with the phylogeny. Finally, lets label the clades/species reflected on the phylogeny, along with their average consensus allele count:

Admixture K proportions generally correspond with clades represented on the phylogeny. Across both K = 3 and K = 4 we see a blue K group corresponding to P. sibiricum, a yellow K group corresponding to P. fauriei, and a red K group corresponding to both P. glycyrrhiza and P. californicum. Moving up to K = 4 we see a new green K group corresponding to taxa in the C + S clades.
Our expected allopolyploids show admixture patterns reflective of their subgenomes; the hexaploid P. interjectum has three relatively equally split K groups and two tetraploid clades corresponding to P. vulgare in Asia and Europe reflect two relatively evenly split K groups. These results indicate that P. vulgare in Asia and Europe share P. sibiricum as one of the progenitors but have different K groups representing the second progenitor, indicating P. vulgare have different parental origins across Euripe and Asia.
Average consensus allele counts correspond to higher average values for the tetraploids (658.75-778.76) and hexaploids (998) relative to our diploid samples (~500). We do see K proportion variation for some diploids; P. californicum, P. amorphum, and P. appalachianum reveal a small proportion of the green C+S clade K group at K = 4, and S + C clades show a mixture of the three K groups at K = 3, but they are not equally proportioned as we would expect for polyploid hybrids. This variation most likely reflects shared ancestral variation based on processes like Incomplete Lineage Sorting (ILS) where unsorted ancestral variation correspond to the different K groups. C+S clades are resolved by being subsumed in a new K group at K = 4. In the case of P. californicum, P. amorphum, and P. appalachianum the lower consensus allele count suggests these are diploids. The latter may still reflect some form of homoploid hybridization or introgression, but the fact that the proportions are consistent across samples suggests this is a consistent pattern in the variation of the species. Previous phylogenetic analyses of these species suggest rapid radiations leading to ILS patterns which would generate this type of variability in K groups.
Taken together with our previous taxonomic hypotheses these results confirm our allopolyploid vs diploid hypotheses. Using these results we can remove hypothesized hybrids ‘assembly’ folders from the SORTER2 project directory to re-run stage 1B and stage 2 for our non-hybrid diploid taxa. Then stage 3 can be run with the latter progenitor references in order to phase the allopolyploids we identified.