This tutorial provides guidelines for implementing and using the adjusted Bayes Factor (BFadj) to detect and mitigate the impact of sampling bias in discrete phylogeographic inference using CTMC modeling (Gámbaro et al., submitted).

Introduction

Discrete phylogeographic analysis integrates geographic information as discrete characters and infers lineage transition events among discrete locations. A discrete phylogeographic analysis is typically followed by a Bayes factor (BF) test to assess the statistical support for inferred transition links by comparing their posterior and prior expectations. In the standard BF (BFstd) test approach, the relative abundance of the involved trait states is not considered, which can be problematic in the case of unbalanced sampling among discrete locations. Although several strategies have been proposed to address sampling bias in discrete phylogeographic analyses that employ continuous-time Markov chain (CTMC) modeling, they might require additional epidemiological information. In this tutorial, we provide practical guidance on how to calculate and apply the adjusted Bayes factor (BFadj), a complementary metric to BFstd, which aims to account for the relative abundance of samples across locations.The BFadj helps users to evaluate whether observed patterns are likely due to the overrepresentation of genomes from certain locations without relying on additional external data.

To undertake this tutorial, you will need to download three software packages in a format that is compatible with your computer system (all three are available for Mac OS X, Windows and Linux/UNIX operating systems):

BEAST - this package contains the BEAST program, BEAUti and a couple of utility programs. At the time of writing, the current version is v10.5.0-beta5. BEAST releases are available for download from https://github.com/beast-dev/beast-mcmc/releases.
BEAGLE - this is a high-performance library that can perform the core calculations at the heart of most Bayesian and Maximum Likelihood phylogenetics packages. It can make use of highly-parallel processors such as those in graphics cards (GPUs) found in many PCs. Binary installers and installation instructions can be found at https://github.com/beagle-dev/beagle-lib.
Tracer - this program is used to explore the output of BEAST (and other Bayesian MCMC programs). It graphically and quantitively summarizes the empirical distributions of continuous parameters and provides diagnostic information. At the time of writing, the current version is v1.7.2. It is available for download from https://github.com/beast-dev/tracer/.
FigTree - this is an application for displaying and printing molecular phylogenies, in particular those obtained using BEAST. At the time of writing, the current version is v1.4.4. It is available for download from https://github.com/rambaut/figtree/.

Conducting the tip-state-swap discrete phylogeographic analysis

To calculate the BFadj we need to conduct two types of discrete phylogeographic analyses: (i) a “standard” discrete phylogeographic analysis and (ii) a “tip-state-swap” discrete phylogeographic analysis where the location states at the tips are randomly permuted during the run. For a step-by-step tutorial on how to conduct “standard” discrete phylogeographic analysis, please refer to the following resource: Phylogeographic diffusion in discrete space.

To perform the tip-state-swap discrete phylogeographic analysis, at least 100 evenly sampled post-burnin trees from the corresponding standard discrete phylogeographic analysis are needed for use as an empirical tree distribution. Because the analysis relies on empirical trees, the resulting XML file does not include certain xmls blocks such as alignment, patterns and treeModel nor the transition kernels acting on the tree or substitution model parameters. Instead, you will need to specify an empirical tree distribution and include a tip-state-swap operator block, which randomizes tip states during the MCMC run.

📥 Download a template XML file here

📥 Example dataset available here

Add empirical tree distribution and operator block
  <!-- Defining empirical tree distribution-->                            
  <empiricalTreeDistributionModel id="treeModel" fileName="template.trees">
    <taxa idref="taxa"/>
  </empiricalTreeDistributionModel>
	
  <!-- Add empirical tree distribution operator-->
  <empiricalTreeDistributionOperator weight="3">  
    <empiricalTreeDistributionModel idref="treeModel"/>  
  </empiricalTreeDistributionOperator>
Add tip-state-swap operator (or transition kernel)
  <!-- Add tipStateSwapOperator -->  
  <tipStateSwapOperator weight="" uniformRandomization="true">
    <ancestralTreeLikelihood idref="regions.treeLikelihood"/>
  </tipStateSwapOperator>

A key value to conduct the tip-state-swap discrete phylogeographic analysis is the relative weight of the tip-state-swap operator, which dictates the expected number of permutation events along a MCMC of given length. We recommend setting that at least 95% of tip states are expected to have been swapped between consecutively sampled posterior trees. To achieve this threshold, the weight of the tip-state-swap transition kernel can be manually adjusted, accounting for the (i) number of taxa, (ii) length of the MCMC, (iii) number of posterior samples and (iv) sum of all operator weights, as described in Gámbaro et al. (submitted).

Otherwise, you can refer to the tip-state-swap operator calculator below:

tip-state-swap operator calculator
Enter number of tips:

Enter MCMC chain length:

Enter number of posterior samples:

Enter sum of all operators weights:

R scripts and example files are available in the following Github Repo.

References

Tags: tutorial