Data partitions are the basic unit of data in BEAST. They are a collection of sequence data (DNA, amino acid or other), discrete traits, or continuous traits for each of a set of taxa that are connected by a tree and are assumed to have a shared evolutionary process. Each data partition has a tree, a site model, and a molecular clock model. Data partitions can share these models (their models are linked) or have their own models (their models are unlinked). You can link and unlink these model components in the
Partitions panel of BEAUti — see Linking and unlinking models, below. Even if partitions have unlinked model components, individual parameters of the models can be linked either directly — see Linking parameters — or as a hierarchical phylogenetic model (HPM) — see Hierarchical phylogenetic models, below and this tutorial.
Importing data partitions
To load a sequence alignment, select the
Import Data... option from the
File menu. You can also click the
+ button at the bottom left of the window or just drag-and-drop the file into the main window. The data should be in either NEXUS or FASTA format. BEAUti will also import the data from a BEAST XML file (including all the data partitions, the dates of the tips, but not the models and settings).
Importing partitions from a NEXUS file
NEXUS files can define multiple partitions using the
charset command. For example, the example file in the BEAST package:
/examples/Data/H1N1_HA.nex (or download from here) divides the sequences into two partitions (note the sequences have been truncated here for brevity):
#NEXUS BEGIN DATA; DIMENSIONS NTAX=25 NCHAR=1695; FORMAT MISSING=? GAP=- DATATYPE=DNA; MATRIX hSCar1918 ATGGAGGCAAGACTACTGGTCTTGTTATGTGCATTTGCA... hCambr39 ATGAAGGCAAGACTACTGGTCCTGTTATGTACACTTGCA... hCHR83 ATGAAAGCAAAACTACTGGTCCTGTTATGTGCACTCTCA... hFortMon47 ATGAAAGCAAAACTACTGATCCTGTTATGTGCACTTACA... hKiev79 ATGAAAGCAAAACTACTGGTCCTGTTATGTGCACTTTCA... hLenin54 ATGAAAGCAAAACTACTGGTCCTGTTATGTGCACTTTCA... hMongol85 ATGAAAGCAAAACTACTGGTCCTGTTATGTGCACTTTCA... hMongol91 -----------CCTAGTGGTCCTGTTATGTGCATTTACA... hNWS33 ATGAAGGCAAAACTACTGGTCCTGTTATGTGCACTTGCA... hPR34 ATGAAGGCAAACCTACTGGTCCTGTTATGTGCACTTGCA... hScot94 ATGAAAGAAAAACTACTGGTCCTGTTATGTGCACTTTCA... hSuita89 ATGAAAGCAAAACTACTAGTCCTGTTATGTGCATTTACA... hUSSR77 ATGAAAGCAAAACTACTGGTCCTGTTATGTGCACTTGCA... sEhime80 ATGAAGGCAATACTATTAGTCTTGCTATGTACATTTGCA... sIllino63 ATGAAGGCAATACTATTAGTCTTGTTATGTGCATTTGCA... sIowa30 ATGAAGGCAATACTATTAGTCTTGTTATGTGCATTTGCA... sNebrask92 ATGAAGGCAATACCATTAGTCTTGCTATATACATTTACA... sNewJers76 ATGAAGGCAATACTATTAGTCTTGTTATGTACATTTGCA... sStHya91 ATGAAAGCAATACTATTAGTCTTGCTATATACATTTACA... sWiscons61 ATGAAGGCAATACTATTAGTCTTGTTATGTGCATTTGCA... sWiscons98 ATGAAGGCAATACTATTAGTCTTGCTATATACATTCACA... aDuckA76 ATGGAAGCAAAACTATTTGTACTATTCTGTACATTCACT... aDuckB77 ATGGAAGTAAAACTATTTGTATTATTCTGCACATTCACT... aItaly87 ATGGAAGCAAAACTGTTTGTATTATTCTGTACATTCACT... aMallard85 ATGGAAGTAAAACTATTTGTACTATTCTGTACATTCACT... ; END; begin assumptions; charset HA1 = 151-920; charset HA2 = 1-150,921-1695; end;
This alignment comprises haemagglutinin (HA) gene sequences from a collection of influenza viruses (the letter at the beginning of the labels denotes the host; a = avian, s = swine and h = human). The file contains two
charset commands which define two domains of the protein — HA1, the globular domain and HA2, the stem region. Note the HA1 domain lies within the HA2 so the latter is defined as two regions which are concatinated.
When imported into BEAUti this results in two partitions in the
The two partitions by default share the same substitution model, clock model and tree. You can see that the
Clock model and
Partition tree columns of the table are defined as
Importing partitions from multiple files
The other way to load multiple partitions is simply to import them from different alignment files (NEXUS, FASTA or XML). The partitions will appear in the table named after the file they were loaded from. In most cases you should load partitions with exactly the same set of taxa (the names should be identical). If the partitions’ set of taxa differ in any way from each other then BEAUti will give them a different clock model and partition tree.
The first thing to do is to specify the dates of sampling of the sequences using the
Tips panel. The year that each virus was isolate is given at the end of the labels. To extract the dates select the
Use tip dates box and click the
Parse Dates button and in the dialog box choose the
Defined just by its order,
Order: last, and
Parse as a number options. See this page for more details of how to use this dialog box to parse dates from tip labels.
Site models specify the evolutionary process that describes how the data was generated on the tree. For example, for DNA sequences this could be an HKY model with discrete gamma rate heterogeneity amongst sites. If you switch to the
Sites panel using the tabs at the top, you can specify the details of this model.
On the left hand side is a table listing the site models currently in use. At the moment it only has the
default which is being shared by both the data partitions:
The model settings on the right are for this site model. The default is the HKY model which has a parameter,
kappa, specifying the ratio of transition and transversion rates.
Molecular clock models
Switch to the
Clocks panel to set the clock model settings. In the same way as before, the current clock models are given in the table on the left and the settings for them on the right. Select the
Uncorrelated relaxed clock as the choice for
For the log normal model has two parameters,
ucld.stdev which are the mean and standard deviation of the lognormal distribution describing the distribution of rates on the branches in the tree. See this page for more details about the various clock models and their parameters.
Trees panel we again see the available trees on the left and the settings on the right. In this case the settings for the tree are actually the choice of tree prior:
The different tree priors available are described in more detail here.. For this, leave it at the default,
Coalescent: Constant Size.
Linking and unlinking models
We will now look at linking and unlinking the components of the model to constuct different scenarios for the evolution of these sequences.
To start with all the partitions share the same site model, molecular clock model and tree. So this is saying that all the sites in both partitions are evolving in the same way, at the same rate and on the same tree (Figure 1).
Unlinking site models
As it stands this means the model specified is as if a single partition with all the data had been imported. To relax this model, we can return to the
Paritions panel. Select both partitions in the table and click the
Unlink Subst. Models button. The result is that each partition gets its own site model (named after the partition):
The settings for the two new site models will be copied from the existing ‘default’ one but now there will be a separate
kappa parameter for each. In the
Priors panel you will now see
You will also see a new parameter,
allMus which is a vector of relative rates, one for each partition. These are constrained to have a mean of 1.0 and are multiplied by the overall molecular clock rate to get the rate for each partition. In this way the rate of evolution can vary between data partition but because they share the same clock model the relative rate on each branch of the tree is the same. A general diagram of the scenario is shown in Figure 2.
Unlinking clock models
Select both partitions in the data partition table and click the
Unlink Clock Models button. The result is that each partition gets its own clock model (again, named after the partition):
Returning to the
Priors panel, you will notice some changes:
allMus parameter has been removed and replaced with a
ucld.stdev for each partition. This is because each partition now has its own molecular clock with its own independent rate so not longer needs a relative rate. Each partition also has its own distribution from which the relative rates for each branch are drawn (again, independently for each partition). A general diagram is shown in Figure 3.
Data partitions comprising protein-coding nucleotide sequences can be subdivided into codon position partitions. This is done in the
HA1 partition and then select from the
Partition into codon positions menu. Select
3 partitions: positions 1, 2, 3. Do the same for the
HA2 partition. In the
Priors panel you will see that each of the 3 codon positions for each of the partitions now has its own
kappa parameter (
HA1.CP1.kappa is the kappa for theHA1 partition, codon position 1, etc.):
You will also that each partition now has its own
allMu vector parameter. These are the set of relative rates for the codon positions for each partition. This is the equivalent of compiling each of the first, second, and third codon positions for each gene region into separate partitions but is generally more convenient. The codon positions of a particular partition share the same clock model and substitution model but this is probably the appropriate assumption.
At this stage we have a very general, parameter rich, model and we may wish to constrain it by linking particular parameters. This links two or more independent parameters that they are a single entity that always have the same value. This allows you to jointly estimate the parameter across different partitions even though other aspects of their models are independent. For example, we can link the overall rate of evolution for our two partitions. In the
Priors select the two clock rate parameters (
HA2.ucld.mean) and click the
Link parameters together button. A dialog box will appear showing the parameters being linked, the name of the new ‘controlling’ parameter and the prior on this parameter:
Name the new parameter but leave the other options at their defaults, click
OK and you see the table has been updated so the linked parameters are in grey and the new controlling parameter is at the bottom:
You could also link the
HA2.ucld.stdev parameters similarly. Linking both the ucld.mean and ucld.stdev parameters of the clock models for the two partitions together is saying that the rates of evolution for each branch is drawn from exactly the same distribution for each partition, but which actual branch is a fast or slow one is different.
Or you could link
HA2.CP2.kappa, etc. This would say that codon position 1 in ‘HA1’ has the same underlying ratio of transitions to tranversions as ‘HA2’ even if the rates differ between them.
Hierarchical phylogenetic models
You can also link parameters together in a Hierarchical phylogenetic model (HPM). This takes the approach that the individual parameters are not exactly the same but rather are drawn from a common underlying distribution (say a normal distribution) and the parameters of that distribution (the mean and stdev for the normal) are then estimated. For more information about HPMs see this tutorial.
The ultimate, most general, model is to also unlink the trees between partitions (using the
Unlink Trees button in the
Partitions panel). Doing this results in each partition not only have its own subtitution model, molecular clock model and parameters but also a completely different tree. Just doing this essentially performs independent analyses in a single BEAST run (probably not the most efficient way of doing this).
However there are a number of situations for which unlinking the trees may be appropriate. If the different partitions are independent, unlinked, loci from the same individuals then can be used in a multi-locus coalescent analysis to jointly estimate the demographic history.
If the data partitions are samples taken from different populations (i.e., viruses from separate outbreaks or from different patients) then they will, by definition, have different trees but parameter linking or hierarchical modelling could jointly estimate parameters which they do have in common (such as rate of evolution).