Constructing Markov-modulated substitution model XML files for use with BEAST.

Markov-modulated Models Tutorial

A Markov-modulated model (MMM) is an extension of the standard nucleotide substitution model that allows the substitution process at a site to change over time. Such an MMM is constructed by using a combination of multiple continuous-time Markov chain (CTMC) models such as the HKY and GTR models. In this tutorial, we show how to construct MMMs by using combinations of 2 and 3 CTMC models of the same type, i.e. always HKY or always GTR, which will effectively yield an 8x8 and a 12x12 dimensional MMM.

We note that the increased dimensionality of MMMs will lead to increased computation times for the corresponding BEAST analysis (Suchard et al., 2018). However, this can largely be mitigated by running the analysis on a powerful GPU using the BEAGLE library (Ayres et al., 2019); we assess the computational performance of these MMMs on different hardware in our manuscript (Baele et al., 2019). Despite the increased computational demands of MMM analyses, these models can be compared to one another and to popular nucleotide substitution models using (log) marginal likelihood estimation.

We refer to Gascuel and Guindon (2007) and our own manuscript (Baele et al., 2019) for a detailed overview of MMMs in phylogenetics. We also provide additional useful reading material in the references section of this tutorial.

Prerequisites

This tutorial requires that you have a standard BEAST XML, as for example generated by BEAUti using an HKY or GTR model, at the ready. We will focus here on constructing fairly standard MMMs, in which the relative rate multipliers are all set to 1, as per Equation (7) in our manuscript (Baele et al., 2019).

Adjusting the data type of the analysis

When creating a BEAST XML through BEAUti, one or more <alignment> XML elements are created that contain the genetic data of your alignment. We here assume that only a single such XML element is created, but the same principles in this tutorial apply when creating multiple <alignment> blocks, for example in an alignment containing multiple genes. For nucleotide data, this XML element will look like this:

<alignment id="alignment" dataType="nucleotide">
...
</alignment>

When constructing an 8x8 MMM composed of two CTMC models, this XML element needs to be adjusted as follows:

<alignment id="alignment" dataType="hiddenNucleotide2">
...
</alignment>

Similarly, when constructing a 12x12 MMM composed of three CTMC models, the XML element needs to be adjusted as follows:

<alignment id="alignment" dataType="hiddenNucleotide3">
...
</alignment>

Specifying the Markov-modulated model

When constructing an XML using a standard GTR model, the following (classic) XML specification is generated:

<gtrModel id="gtr">
    <frequencies>
        <frequencyModel dataType="nucleotide" compress="true">
            <frequencies>
                <parameter id="frequencies" value="0.25 0.25 0.25 0.25"/>
            </frequencies>
        </frequencyModel>
    </frequencies>
    <rateAC>
        <parameter id="ac" value="1.0" lower="0.0"/>
    </rateAC>
    <rateAG>
        <parameter id="ag" value="1.0" lower="0.0"/>
    </rateAG>
    <rateAT>
        <parameter id="at" value="1.0" lower="0.0"/>
    </rateAT>
    <rateCG>
        <parameter id="cg" value="1.0" lower="0.0"/>
    </rateCG>
    <rateGT>
        <parameter id="gt" value="1.0" lower="0.0"/>
    </rateGT>
</gtrModel>

All that is required for an 8x8 MMM is two such XML blocks, with unique IDs for the models and the parameters within those models, i.e. exactly like when a unique CTMC model is used for each codon position. We here show two such GTR models being declared, which easily extends to when three GTR models would be needed (i.e. for constructing a 12x12 MMM):

<gtrModel id="gtr.1">
    <frequencies>
        <frequencyModel dataType="nucleotide" compress="true">
            <alignment idref="alignment"/>
            <frequencies>
                <parameter id="gtr.frequencies1" dimension="4"/>
            </frequencies>
        </frequencyModel>
    </frequencies>
    <rateAC>
        <parameter id="ac.1" value="1.0" lower="0.0"/>
    </rateAC>
    <rateAG>
        <parameter id="ag.1" value="1.0" lower="0.0"/>
    </rateAG>
    <rateAT>
        <parameter id="at.1" value="1.0" lower="0.0"/>
    </rateAT>
    <rateCG>
        <parameter id="cg.1" value="1.0" lower="0.0"/>
    </rateCG>
    <rateGT>
        <parameter id="gt.1" value="1.0" lower="0.0"/>
    </rateGT>
</gtrModel>

and

<gtrModel id="gtr.2">
    <frequencies>
        <frequencyModel dataType="nucleotide" compress="true">
            <alignment idref="alignment"/>
            <frequencies>
                <parameter id="gtr.frequencies2" dimension="4"/>
            </frequencies>
        </frequencyModel>
    </frequencies>
    <rateAC>
        <parameter id="ac.2" value="1.0" lower="0.0"/>
    </rateAC>
    <rateAG>
        <parameter id="ag.2" value="1.0" lower="0.0"/>
    </rateAG>
    <rateAT>
        <parameter id="at.2" value="1.0" lower="0.0"/>
    </rateAT>
    <rateCG>
        <parameter id="cg.2" value="1.0" lower="0.0"/>
    </rateCG>
    <rateGT>
        <parameter id="gt.2" value="1.0" lower="0.0"/>
    </rateGT>
</gtrModel>

We note that there is no need to provide each of the GTR models with their own <siteModel> at this point. Rather, we use these two GTR models to construct a single MMM as follows:

<markovModulatedSubstitutionModel id="mm" dataType="hiddenNucleotide2">
    <switchingRates>										
        <parameter id="switchingRates" value="1" dimension="2" lower="0.0"/>
    </switchingRates>						
    <gtrModel idref="gtr.1"/>
    <gtrModel idref="gtr.2"/>
</markovModulatedSubstitutionModel>

Note that the datatype attribute in this MMM XML specification needs to exactly match the one defined in the <alignment> XML element. The MMM specification requires defining the switching rates between the CTMC models, i.e. the two GTR models in the case of this tutorial. Both symmetric and asymmetric switching is available by setting the appropriate dimension of the switching rates parameter. If dimension equals the maximum number of off-diagonal elements in the switching rate matrix (see Equation (5) in our manuscript; Baele et al., 2019), an asymmetric switching process will be used; if it equals half that number, a symmetric switching process will be assumed. We now define a <siteModel> XML element for the MMM, exactly like for standard nucleotide substitution models:

<siteModel id="siteModel">
    <substitutionModel>
        <markovModulatedSubstitutionModel idref="mm"/>
    </substitutionModel>
    <gammaShape gammaCategories="4">
        <parameter id="alpha" value="0.5" lower="0.0"/>
    </gammaShape>
</siteModel>

Finally, we need to provide both the <siteModel> and the <markovModulatedSubstitutionModel> to the <treeLikelihood> XML element:

<treeLikelihood id="treeLikelihood" useAmbiguities="true">
    <patterns idref="patterns"/>
    <treeModel idref="treeModel"/>
    <siteModel idref="siteModel"/>
    <discretizedBranchRates idref="branchRates"/>
    <markovModulatedSubstitutionModel idref="mm"/>
</treeLikelihood>

Providing additional transition kernels

Given the MMM defined above, we need to include the following transition kernels in the <operators> block in the BEAST XML file:

<operators id="operators" optimizationSchedule="default">
    <scaleOperator scaleFactor="0.75" weight="2">
        <parameter idref="ac.1"/>
    </scaleOperator>
    <scaleOperator scaleFactor="0.75" weight="2">
        <parameter idref="ag.1"/>
    </scaleOperator>
    <scaleOperator scaleFactor="0.75" weight="2">
        <parameter idref="at.1"/>
    </scaleOperator>
    <scaleOperator scaleFactor="0.75" weight="2">
        <parameter idref="cg.1"/>
    </scaleOperator>
    <scaleOperator scaleFactor="0.75" weight="2">
        <parameter idref="gt.1"/>
    </scaleOperator>
    
    <scaleOperator scaleFactor="0.75" weight="2">
        <parameter idref="ac.2"/>
    </scaleOperator>
    <scaleOperator scaleFactor="0.75" weight="2">
        <parameter idref="ag.2"/>
    </scaleOperator>
    <scaleOperator scaleFactor="0.75" weight="2">
        <parameter idref="at.2"/>
    </scaleOperator>
    <scaleOperator scaleFactor="0.75" weight="2">
        <parameter idref="cg.2"/>
    </scaleOperator>
    <scaleOperator scaleFactor="0.75" weight="2">
        <parameter idref="gt.2"/>
    </scaleOperator>
    
    <deltaExchange delta="0.01" weight="2">
        <parameter idref="gtr.frequencies1"/>
    </deltaExchange>
    <deltaExchange delta="0.01" weight="2">
        <parameter idref="gtr.frequencies2"/>
    </deltaExchange>
    
    <scaleOperator scaleFactor="0.7" weight="2">
        <parameter idref="switchingRates"/>
    </scaleOperator>
    
    <scaleOperator scaleFactor="0.75" weight="2">
        <parameter idref="alpha"/>
    </scaleOperator>
...
</operators>

Many other transition kernels will be required to complete the estimation process, such as for estimating the tree topology and the branch lengths, but these can simply be copied from a standard XML file.

Setting all necessary priors

The priors corresponding to the MMM specification are also copies of the priors generated by BEAUti for standard nucleotide substitution models. An additional prior on the switching rates is required to complete the prior specification.

<prior id="prior">			
    <gammaPrior shape="0.05" scale="10.0" offset="0.0">
        <parameter idref="ac.1"/>
    </gammaPrior>
    <gammaPrior shape="0.05" scale="20.0" offset="0.0">
        <parameter idref="ag.1"/>
    </gammaPrior>
    <gammaPrior shape="0.05" scale="10.0" offset="0.0">
        <parameter idref="at.1"/>
    </gammaPrior>
    <gammaPrior shape="0.05" scale="10.0" offset="0.0">
        <parameter idref="cg.1"/>
    </gammaPrior>
    <gammaPrior shape="0.05" scale="10.0" offset="0.0">
        <parameter idref="gt.1"/>
    </gammaPrior>
    
    <gammaPrior shape="0.05" scale="10.0" offset="0.0">
        <parameter idref="ac.2"/>
    </gammaPrior>
    <gammaPrior shape="0.05" scale="20.0" offset="0.0">
        <parameter idref="ag.2"/>
    </gammaPrior>
    <gammaPrior shape="0.05" scale="10.0" offset="0.0">
        <parameter idref="at.2"/>
    </gammaPrior>
    <gammaPrior shape="0.05" scale="10.0" offset="0.0">
        <parameter idref="cg.2"/>
    </gammaPrior>
    <gammaPrior shape="0.05" scale="10.0" offset="0.0">
        <parameter idref="gt.2"/>
    </gammaPrior>

    <uniformPrior lower="0.0" upper="1.0">
        <parameter idref="gtr.frequencies1"/>
    </uniformPrior>
    <uniformPrior lower="0.0" upper="1.0">
        <parameter idref="gtr.frequencies2"/>
    </uniformPrior>

    <exponentialPrior offset="0.0" mean="1.0">
        <parameter idref="switchingRates"/>
    </exponentialPrior>
                
    <exponentialPrior mean="0.5" offset="0.0">
        <parameter idref="alpha"/>
    </exponentialPrior>
...
</prior>

Again, the XML snippet shown here doesn’t include all the prior necessary to run a proper analysis, as other model components such as the coalescent and clock models (for example) also require proper prior specification.

Don’t forget to log all the additional parameters

Finally, the only thing left is to make sure the parameter estimate are being written to file and/or to screen, which is easily achieved by including them in the relevant <log> XML elements:

<log ...>
    <parameter idref="ac.1"/>
    <parameter idref="ag.1"/>
    <parameter idref="at.1"/>
    <parameter idref="cg.1"/>
    <parameter idref="gt.1"/>
    
    <parameter idref="ac.2"/>
    <parameter idref="ag.2"/>
    <parameter idref="at.2"/>
    <parameter idref="cg.2"/>
    <parameter idref="gt.2"/>
			
    <parameter idref="gtr.frequencies1"/>
    <parameter idref="gtr.frequencies2"/>
 			
    <parameter idref="switchingRates"/>
            
    <parameter idref="alpha"/>
...
</log>

Model reduction / simplification

Given the (potentially much) increased parameter space corresponding to the MMM specification, these analyses may require a substantial number of iterations in BEAST. While the time the analysis takes can be shortened by using a powerful GPU using the BEAGLE library, the added parameters may not (all) be supported by the signal in the underlying data set, in which case their posterior distributions may not differ substantially from their prior distributions. We have here presented a full version of an 8x8 MMM, but simplifications to the model can easily be made to reduce the number of parameters. For example, an XML with a single set of GTR substitution model parameters may provide higher ESS values and higher (log) marginal likelihoods than a full parameter-rich MMM. Note that both models will still need to be defined in the XML however.

References

O. Gascuel and S. Guindon (2007) Modelling the variability of evolutionary processes. Reconstructing Evolution: new mathematical and computational advances 2:65-99.

G. Baele, M. S. Gill, P. Lemey and M. A. Suchard (2019) Markov-modulated continuous-time Markov chains to identify site- and branch-specific evolutionary variation. Submitted.

N. Galtier (2001) Maximum-likelihood phylogenetic analysis under a covarion-like model. Mol. Biol. Evol. 18:866–873.

N. Galtier and A. Jean-Marie (2004) Markov-modulated markov chains and the covarion process of molecular evolution. Journal of Computational Biology 11:727–733.

S. Guindon, A. G. Rodrigo, K. A. Dyer, and J. P. Huelsenbeck (2004) Modeling the site-specific variation of selection patterns along lineages. Proceedings of the National Academy of Sciences of the United States of America 101:12957–12962.

C. Tuffley and M. Steel (1998) Modeling the covarion hypothesis of nucleotide substitution. Mathematical Biosciences 147:63–91.

M. A. Suchard, P. Lemey, G. Baele, D. L. Ayres, A. J. Drummond, and A. Rambaut (2018) Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evol. 4:vey016.

D. L. Ayres, M. P. Cummings, G. Baele, A. E. Darling, P. O. Lewis, D. L. Swofford, J. P. Huelsenbeck, P. Lemey, A. Rambaut, and M. A. Suchard (2019) BEAGLE 3: Improved performance, scaling, and usability for a high-performance computing library for statistical phylogenetics. Syst. Biol. (in press).

Tags: tutorial