Lqad 077

Published online 21 August 2023 NAR Genomics and Bioinformatics, 2023, Vol. 5, No.
3 1
https://doi.org/10.1093/nargab/lqad077
Machine learning based lineage tree reconstruction

improved with knowledge of higher level relationships
between cells and genomic barcodes
Alisa Prusokiene1 , Augustinas Prusokas2 and Renata Retkute 3 ,*
1
School of Natural and Environmental Sciences, Newcastle University, Newcastle upon Tyne NE1 7RU, UK,
2
Independent researcher, London, SW7 2BX, UK and 3 Department of Plant Sciences, University of Cambridge,
Downing Street, Cambridge CB2 3EA, UK
Downloaded from https://academic.oup.com/nargab/article/5/3/lqad077/7246553 by guest on 21 August 2023

Received January 24, 2023; Revised June 26, 2023; Editorial Decision August 08, 2023; Accepted August 11, 2023
ABSTRACT reversible random modification of DNA recording arrays

that can be read out using fluorescence in situ hybridiza-
Tracking cells as they divide and progress through tion imaging (8). Fluorescent reporter assays allow the rapid
differentiation is a fundamental step in understand- characterization of these recording units (9). This experi-
ing many biological processes, such as the devel- mental system is coupled with a time-lapse movie of the cells
opment of organisms and progression of diseases. as they divide to provide a ground truth lineage tree (8). An-
In this study, we investigate a machine learning ap- other technique, substitution mutation-aided lineage trac-
proach to reconstruct lineage trees in experimental ing (SMALT) system, used a 3-kb readout sequence with 16
systems based on mutating synthetic genomic bar- iSceI binding motifs to map single-cell resolution cell phylo-
codes. We refine previously proposed methodology genies during organ development (10). Additional strategies
by embedding information of higher level relation- to achieve lineage tracing at single-cell resolution have been
ships between cells and single-cell barcode values developed in the past few years: integration barcodes (de-
signed as short DNA fragments placed in an expressed lo-
into a feature space. We test performance of the al-
cus) and polylox barcodes (comprising a DNA cassette with
gorithm on shallow trees (up to 100 cells) and deep multiple loxP sites in alternating orientations) (11). These,
trees (up to 10 000 cells). Our proposed algorithm can and future, technical advancements have to be matched by
improve tree reconstruction accuracy in comparison progress in relevant computational methods (12).
to reconstructions based on a maximum parsimony Various statistical methods for reconstructing gene trees
method, but this comes at a higher computational and species trees (13,14) have been developed over the last
time requirement. few decades, but these methods have not been widely used
on time and individual-cell resolved datasets. Mutations
induced by cell barcoding methods are irreversible, which
INTRODUCTION is different from the somatic mutations accumulated dur-
Single-cell lineage tracing, reconstructing the relationship ing mitotic cell division (8). Previously, we have developed
between individual dividing cells in a tissue or organism, has a machine learning (ML) approach for cell lineage recon-
the potential to improve our understanding of many biolog- struction, with results submitted to the Allen Institute Cell
ical processes, including the major transitions in evolution Lineage Reconstruction DREAM Challenge (15). This was
(1), development from the founding zygote to a complex or- the best-performing algorithm for the reconstruction of
ganism (2), stem cell properties in tissue (3), cancer cell dif- in vitro cell lineages of trees with <100 cells. It reached
ferentiation (4) or metastasis (5), and pathways of tumour higher accuracy than other methods, such as distance-
evolution (6). based method DCLEAR (16), maximum parsimony based
The first full development lineage tree was traced for the method Cassiopeia-integer linear programming (17) and
embryonic cells of the nematode Caenorhabditis elegans by Cassiopeia-Greedy (17). Our proposed framework was
sketching the events of cell division and development his- based on embedding single-cell data into a feature space in
tories observed directly through light microscopy (7). Re- order to train an ML algorithm to predict the probability
cently, platforms based on cell barcoding were proposed that cells are siblings.
for tracking the lineage of individual cells at a high reso- In this study, we expanded this method by training an ML
lution. An integrase-based synthetic barcode system, int- model to predict higher level relationships between cells. We
MEMOIR, uses the serine integrase Bxb1 to perform ir- refined our methodology to be applied to any number of
* To whom correspondence should be addressed. Tel: +44 1223 333900; Email: rr614@cam.ac.uk

C The Author(s) 2023. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
2 NAR Genomics and Bioinformatics, 2023, Vol. 5, No. 3
possible edit states. To evaluate the performance of the al- The probability that a recording unit has not been edited
gorithm, we used two sets of in silico data: (i) shallow trees after d divisions is equal to Pnot edited (d, μ, α) = (1 − μ)d .
(up to 100 cells), simulated using parametrization based on The ground truth lineage could provide necessary informa-
in vitro dataset for mouse embryonic stem cells (15); and (ii) tion on what is an expected lower number of divisions each
deep trees (up to 10 000 cells), simulated using parametriza- cell has undergone in a colony, for example by counting the
tion based on in vitro dataset for fly organ development (10). number of ancestors for phylogenetic nodes.
Further, we reconstructed trees of in vitro dataset for mouse The probability of an edited unit going to state s is de-
embryonic stem cells (15). We measured the accuracy of lin- | S|
noted by α s , such that s=1 = 1, where S is a set of all
eage reconstruction using the normalized Robinson–Foulds possible edits and |S| denotes the size of set S. In the case
(RF) score. For trees with up to 100 cells we additionally of the mouse embryonic stem cell colonies dataset, |S| =
calculating four metrics: normalised Robinson Foulds score 2 and S = {0, 2}, and in the case of the fly organ de-
(RF), triplet score (TRP), quartet score (QRT), and cluster- velopment dataset, |S| = 1 and S = {1}. The probability
ing information score (CLI). We compared our reconstruc-

that a recording unit is in state s at division d, given that
tion with a maximum parsimony method. We found that it was in the unedited state at division d − 1, is equal to
our proposed algorithm has an advantage over the maxi- Pedited (d, μ, α) = (1 − μ)d−1 μαs . As the state values of lin-
mum parsimony method in terms of reconstruction quality eage tree internal nodes are not available, we do not know
for both shallow trees and deep trees. which division incurred the editing. Therefore, the proba-
bility of observing the recording unit in state s after d gen-
MATERIALS AND METHODS
eration becomes Pedited (d, μ, α) = dk=1 (1 − μ)k−1 μαs .
The in vitro datasets The total likelihood is equal to
Mouse embryonic stem cell colonies. The data were ob-
C
nc
L
tained as part of the Allen Institute Cell Lineage Recon- L(μ, α) = Poc,i,l (dc,i , μ, α), (1)
struction Challenge through Synapse ID syn20692755 (15). c=1 i =1 l=1
In the experiment, the recording array (barcode) consisted where C is the total number of colonies, nc is the number of
of L = 10 recording units (8). Each recording unit was in one cells in a colony c, dc,i is the number of divisions for cell i
of three states: ground state (represented as ‘1’), a deletion in colony c, oc,i,l is the observed state for unit l in cell i and
(represented as ‘0’) or an inversion (represented as ‘2’) of the colony c, and oc,i,l = {not edited, edited}.
DNA sequence. Each colony was started from an individual For parameter estimation, we used the R package
cell and colony growth was observed for 48 h. The experi- AMISEpi, which has an implementation of adaptive multi-
mental data comprise (i) an array of intMEMOIR readouts ple importance sampling for Bayesian analysis (20). We set
as a text file (also called matrix) and (ii) the ground truth the priors to be uniformly distributed: μ ∼ U[0, 1] and α i ∼
lineage for the colony. Tree-like data structures were pro- U[0, 1].
vided as a Newick file (18). The ground truth cell lineage
trees were obtained from video-microscopy data (8).
Simulation of shallow trees (up to 100 cells). We used pa-
Fly organ development. A SMALT system used a barcode rameters (mutation rate and probability of mutations) esti-
with 16 iSceI binding motifs present with equal distance mated from the in vitro mouse embryonic stem cell dataset
throughout the sequence (10). The recording array con- (15). We used data on all colonies for parameter estimation.
sisted of L = 2943 recording units. Each recording unit was
in one of two states: ground state (represented as ‘0’) or mu- Simulation of deep trees (up to 10 000 cells). We used pa-
tated state (represented as ‘1’). Binary sequence data were rameters (mutation rate) estimated from the in vitro fly or-
available for two specimens: one had 5002 cells and other gan development dataset (10). We have used both datasets
5420 cells. The datasets correspond to fly organ develop- to estimate the mutation rate per division/target. We as-
ment from embryo to late-third instar larvae (10). sumed that there were 13 divisions, which was closest to the
number of cells available for dataset (213 = 8192).
The in silico datasets
Lineage tree reconstruction using the ML approach
Stochastic simulation of cell division and barcode editing.
(AMbeRland-TR)
Accumulation of stochastic mutations during cell division
was simulated similarly to (8,19). Each cell colony starts Feature engineering. For illustration purpose, we assume
from an individual cell carrying an unedited array with that the barcode units can be three possible states: ground
length of L units. The model assumes that cells divide syn- state, a deletion or an inversion (Figure 1B). We are propos-
chronously and at a constant rate. The initial cell then un- ing the following classes of features based on pairwise com-
dergoes a series of cell divisions (Figure 1A). After d divi- parison of barcodes: number of units that have not been
sions, the colony consists of N cells, where N = 2d . edited (F1); number of units that have the same edit (F4
After every cell division, each unedited target can mutate and F6); number of units that have single edit (F2 and
with a given probability μ to one of several possible edited F3); and number of units that have different edits (F5)
states s ∈ S. The state is chosen according to a probabil- (Figure 1C).
ity ␣s . The process of recording is irreversible, and once a This approach can be extended to any number of possible
recording unit is edited, it can no longer change (Figure 1B). edited states by including all possible pairwise combinations
NAR Genomics and Bioinformatics, 2023, Vol. 5, No. 3 3

Figure 1. The ML approach for lineage tree reconstruction. (A) Ground truth lineage. Here, we have a colony with nine cells; each cell corresponds to a tip
of the lineage tree (yellow circles). All cell pairs have a common node (blue circles), showing their level of relatedness. (B) Corresponding barcode states:
rows correspond to cells and columns to recorder units. A unit can be at the ground state (encoded as ‘1’), deletion (encoded as ‘0’) or inversion (encoded
as ‘2’). (C) Feature construction is based on pairwise comparison of barcodes: number of units that have not been edited (F1); number of units that have
the same edit (F4 and F6); number of units that have single edit (F2 and F3); and number of units that have different edits (F5). (D) Two examples of
embedding single-cell data into a feature space for ML training. The upper row corresponds to level 1 (sibling cells) and the lower row corresponds to level
2 (cousin cells). State indicates whether relationship is true or false.
of the extended record set {not mutated, S} as predictors for the predicted response and a feature for individual datasets
the ML model. This should produce (|S| + 1)! predictors. (24). Partial dependence plots show the relationship aver-
We have used the R package phangorn (21) to extract aged over all observations, which makes it easier to extract
information on relatedness between cells from the ground expected trends (24).
truth cell lineage trees. For each level of the tree, this pro-
duced two lists of cell pairs for each colony: cells that share Clustering. We applied a custom hierarchical clustering
an ancestor at level t and cells that do not share an ances- method for building a cell lineage tree from predicted prob-
tor at level t. Two examples of embedding single-cell data abilities. Clustering begins at the lowest tree level, where all
into a feature space for ML training are shown in Figure clusters contain an individual cell. Each possible cell pair
1D. The upper row corresponds to level 1 (sibling cells) and is then ranked according to the predicted probability that
the lower row corresponds to level 2 (cousin cells). State in- they share an ancestor at this level. At consecutively increas-
dicates whether relationship is true or false. ing levels, pairwise comparison is performed between each
lower level cluster, where the calculated probability is the
ML training, prediction and interpretation. We used a gra- maximum between any elements of the two clusters. Clus-
dient boosting machine (GBM) to implement the outlined ter pairs are ordered again according to this probability and
ML approach. All calculations were performed in R using are assumed to have the same parent node if its value is
package gbm (22). The following options were used to train above the estimated threshold for this level. This process is
the GBM model: distribution = ‘bernoulli’; n.trees = 1000; repeated until one or two clusters are left. We assume only
interaction.depth = 10; n.minobsinnode = 5; cv.folds = 5; binary trees.
and train.fraction = 0.5.
We use the relative importance of features, partial depen-
Lineage tree reconstruction using a maximum parsimony tree
dence plots and individual conditional expectation plots for
reconstruction method
model interpretation. Relative importance is based on the
number of times a predictor is selected when training the Maximum parsimony based methods try to find the mini-
model (23). Higher values of relative importance indicate mum number of changes necessary to describe the data for
larger influence on the response. Individual conditional ex- a given tree. We performed a maximum parsimony recon-
pectation curves visualize the partial relationship between struction using the R package phangorn (21). Initial tree
was required to start the maximum parsimony tree search.

This was done using the Hamming distance between bar-
codes [package DescTools (25)] and unweighted pair group
method with arithmetic mean clustering (26). We set the
method as ‘fitch’ and minimum number of iterations in the
ratchet as 100.
Scores
We assessed the accuracy of lineage tree reconstruction us-
ing four metrics: normalized RF score, TRP score, QRT
score and CLI score. All scores have values between 0 and
1, with smaller values indicating larger similarity between

ground truth lineage tree and reconstructed lineage tree.
Normalized RF score. The RF distance counts the number

of splits that are unique to one of the two trees (27). An
RF distance of 0 indicates that all splits in both trees are
the same. We used the RF.dist function from the R package
phangorn to compute the normalized RF score (21).
Figure 2. Comparison between scores for a colony with eight cells. (A)
TRP score. The TRP distance counts the number of sub- Ground truth lineage tree and six possible reconstructed tree topologies
trees of three taxa that are different in the two trees (28). (i–vi). (B) Corresponding scores for each reconstruction from panel (A):
The TRP distance was calculated using the tqDist algo- normalized RF score, TRP score, QRT score and CLI score.
rithm (29) implemented in the R package Quartet (30). The
TRP score was calculated by dividing the TRP distance by
the total number of triplets shared between the two trees, In silico deep trees. We simulated division and target edit-
N ing for 10 and 12 divisions, which produced colonies with
i.e. 3 (29).
1024 and 4096 cells, respectively. For trees with 1024 cells,
QRT score. The QRT distance enumerates all subsets of we trained an ML model using data obtained from 10 in-
leaves of size 4 and counts how often the topologies induced dividual trees, but a single simulated tree was used to train
by the four leaves agree in the two trees (31). QRT diver- the ML model for the case with 4096 cells.
gence was calculated using the tqDist algorithm (32,33) im-
plemented in the R package Quartet (30). The QRT score
RESULTS
was calculated as one minus QRT divergence.
Comparison of tree reconstruction accuracy scores
CLI score. CLI distance is a generalized RF metric based
on the information content of the largest split (34). To com- First, we investigated performance of scores under condi-
pute the CLI score, we use the function ClusteringInfoDis- tions where a lineage tree is reconstructed erroneously. Fig-
tance with the option normalize = TRUE from the R pack- ure 2 shows an example how scores compare for a tree with
age TreeDist (35). eight cells and few possible reconstructions. In case (i), cells
A and C (which are cousins) have been assigned incorrectly.
This resulted in the normalized RF score equal to 0.4, but
Set-up for lineage tree reconstruction had a small effect on the TRP score (0.07). In case (ii), we
In silico shallow trees. To determine the effect of varying assign cells A and E incorrectly, which increases all scores,
the number of possible states and the number of record- including the TRP score, to have values between 0.53 and
ing units on the accuracy of lineage reconstruction, we per- 0.6. If two pairs of cells were assigned incorrectly [case (iii)],
formed simulations with the number of states varied be- the normalized RF score is reduced by a factor of 3 and the
tween 2 and 6 and a recorder carrying either 10 or 20 units. QRT score by a factor of 2, but the value of the TRP score
For each configuration, we simulated 1000 lineage trees that stayed the same as in case (ii). Next, we assumed that the
were used for ML training, and an additional 100 lineage number of divisions was estimated incorrectly for a pair of
trees that were used for the testing of lineage reconstruc- cells (G, H) [case (iv)]. Although this results in a different
tion methods. For each simulation, we sampled the depth of structure of the tree, only the TRP score detected the dis-
lineage from the empirical distribution associated with the crepancy, giving a value of 0.29, with all other metrics at
mouse embryonic stem cell dataset and editing rate from the value 0. So, if the quality of reconstruction would be judged
fitted posterior distribution. solely by the normalized RF score, QRT score or CLI score,
this would erroneously suggest that case (iv) is a perfectly
Mouse embryonic stem cell colonies. We used the same reconstructed tree. If further two pairs of sibling cells are
partition of the data as in (15), i.e. array readout data from swapped [case (v)], this increases scores by 0.2–0.23 points.
76 colonies along with the corresponding ground truth lin- Finally, if all cells are reconstructed as sibling cells [case
eages as the training set and array readout data from 30 cell (vi)], the QRT score would be equal to 0.5, indicating that
colonies as a testing set for accuracy evaluation. half of all possible quartets were reconstructed correctly.
Estimating mutation rates Partial dependence plots and individual conditional ex-
pectation plots can be used to analyse the relationship be-
The mouse embryonic stem cell dataset. The resulting
tween features and the response. It can be seen from par-
colonies had from 4 to 39 cells; in total, there were 1453
tial dependence plots (red line) in Figure 6 (upper row) that
individual cells in the dataset. The distribution of the num-
the probability of cells being siblings decreases when the
ber of divisions in the ground truth lineage trees is shown in
value of feature ‘F3’ increases up to 3. This has a biolog-
Figure 3A.
ical meaning: the number of pairwise barcodes where one
Posterior distributions for fitted editing rate and the
cell stayed in the ground state and the other had under-
probability that an edited unit has a state ‘2’ are shown in
gone editing cannot be high if the cells are siblings, and the
Figure 3B and C, respectively. We estimated the mean of
highest probability is when there are no such barcodes. The
marginal posterior distributions to be μ = 0.15 and α =
same relationship is true for the feature ‘F5’: the probability
0.48. For simulation of cell division and record editing, we
of cells being siblings decreases with increasing numbers of
assume that all states have the same probability, i.e. α s =
pairwise barcodes where one cell had undergone deletion

1/|S|.
and the other inversion. There is a linear relationship be-
tween the probability of cells being siblings and the number
The fly organ development dataset. Figure 3D shows the of pairwise barcodes where both cells have undergone dele-
posterior distribution of mutation rate. We estimated the tions (feature ‘F4’). These dependences for ‘F3’ and ‘F4’ be-
mean of marginal posterior distributions to be 0.0005. This come weaker when cells get further apart on the lineage tree
is in agreement with (10), where it was concluded that ∼0.8– (lower rows). The conditional expectation plots (grey lines)
1.3 mutations were recorded on the readout sequence per demonstrate that at the individual level, there are more com-
cell generation. plex relationships between features.
Lineage reconstruction of in silico dataset: shallow trees Lineage reconstruction of in silico dataset: deep trees
Results of simulations are shown in Figure 4. Our ML- The performance of the AMbeRland-TR algorithm was
based algorithm outperformed the maximum parsimony consistent between reconstruction of trees with 1024 cells
based method in all four metrics. Comparing mean perfor- and 4096 cells, and in both cases outperformed the recon-
mance, we found a 43–50% improvement in the normalized struction by the maximum parsimony method (Figure 7A).
RF score, a 19–32% improvement in the QRT score and a On average, there was a 50% improvement in performance
36–45% improvement in the CLI score. for the normalized RF score in comparison to the maxi-
We found that increasing the number of target units had a mum parsimony method. Improvement in reconstruction
stronger effect on lineage reconstruction accuracy than in- quality comes at higher computational time requirements.
creasing the number of states. For a recording array using It takes 54 min on average to reconstruct a tree with 4096
20 targets instead of 10 targets, there was a 62% improve- tips using the AMbeRland-TR algorithm, which is two
ment in performance for the QRT score, a 56% improve- times longer than the time required for the maximum par-
ment in performance for both the normalized RF score and simony method (Figure 7B).
CLI score, and a 28% improvement in performance for the
TRP score. When the number of editing states was increased
from 2 to 5, there was a 17% improvement in the normal- Computational requirements
ized RF score for a recording array using 10 targets. There The computational cost of the AMbeRland-TR algorithm
was a 68% improvement in the normalized RF score for a has the following components: extracting features from bar-
recording array using 20 targets. There was a 30% improve- code data; extracting relationships between cells from the
ment in the QRT score for a recording array using 10 targets ground truth trees; training the ML model for each tree
and a 49% improvement in the QRT score for a recording level; and reconstructing trees for testing data. We have
array using 20 targets. combined all procedure into two tasks: (i) training data
preparation and ML model training; and (ii) testing data
preparation and tree reconstruction. For the purpose of
Lineage tree reconstruction of the mouse embryonic stem cell
evaluating time requirements, we assume that training and
dataset
testing data have only a single cell colony. For deep trees,
A reconstruction was computed from the test dataset con- as shown in the previous section, it is enough to train an
sisting of 30 cell colonies using only the intMEMOIR ar- ML model on single tree. For shallow trees, the training
ray readout (8). Figure 5A shows a pairwise comparison dataset should contain enough trees to accommodate vari-
between the two methods. Most of the points are above ate of barcode combination, but this should not be a burden
the diagonal, indicating that our ML-based algorithm out- as training data preparation for shallow trees is computa-
performed the maximum parsimony based method for all tionally fast. Overall, time required for training and testing
four metrics. Comparing mean performance, we found a tasks was similar for a range of number of cells we inves-
17–23% improvement in the normalized RF score, a 7–37% tigate (Figure 8). For example, for a tree with 10 000 cells,
improvement in the TRP score, a 7–11% improvement in it would take ∼24 h to train the models and 24 h for re-
the QRT score and a 12–15% improvement in the CLI score constructing a tree. All calculations were performed on a
(Figure 5B). MacBook Pro with 2.4-GHz 8-core processor.
A B
0.4 200
Density
Frequency
0.3 150
100
0.2 50
0.1 0
0.1475
0.1500
0.1525
0.1550
0.0
2 3 4 5 6
Number of divisions μ
C D
60 400

Density
300
Density
40 200
100
20 0
−3.266
−3.264
−3.262
0
0.47
0.48
0.49
α log10 μ
Figure 3. Parameter estimation. (A) Distribution of the number of divisions in the mouse embryonic stem cell experiment dataset. (B) Posterior distribution
of editing rate (μ) in the mouse embryonic stem cell experiment dataset. (C) Posterior distribution of the probability that an edited unit has the state ‘2’
(α) in the mouse embryonic stem cell experiment dataset. (D) Posterior distribution of editing rate (μ) in the fly organ development dataset.
Targets: 10 Targets: 20
0.75
0.50
RF
0.25
0.00
0.6
TRP
0.4
0.2
Score
0.0
0.4
QRT
0.2
0.0
0.6
CLI
0.4
0.2
0.0
2 3 4 5 6 2 3 4 5 6
Number of states
Method AMbeRland−TR MaxParsymony
Figure 4. Lineage reconstruction accuracy for the in silico dataset, with L = 10 or L = 20 recording units, and the number of possible edits ranging
from 2 to 6. Each bar summarizes the score of 100 lineage tree reconstruction tests. We denote our ML-based approach as ’AMbeRland-TR’ and lineage
reconstruction using the maximum parsimony approach as ’MaxParsymony’. Accuracy metrics are normalized RF score, TRP score, QRT score and CLI
score.
DISCUSSION construction. In this study, we introduced the framework

for constructing features and training an ML algorithm for
Machine learning is becoming an important tool that has
experimental systems based on mutating synthetic genomic
considerable potential in biology (36), genetics and ge-
barcodes. We explored how the barcoding configuration in-
nomics, including the annotation of sequence elements and
fluenced the performance of the algorithm by using a col-
epigenetic, proteomic and metabolomic data (37), or solv-
lection of simulated and biological datasets.
ing problems arising in population and evolutionary genet-
There is no universal method to quantify topological sim-
ics (38). So far, it has been underutilized for lineage tree re-
ilarities between lineage trees (39). The most widely used

Figure 5. Lineage tree reconstruction accuracy for the mouse embryonic stem cell dataset: (A) pairwise comparisons and (B) distribution of scores. We
denote our ML-based approach as ’AMbeRland-TR’ and lineage reconstruction using the maximum parsimony approach as ’MaxParsymony’. Accuracy
metrics are normalized RF score, TRP score, QRT score and CLI score.
Figure 6. Interpretation of the model for the mouse embryonic stem cell dataset. (A) Relative importance of features. (B) Average partial dependence (red
line) and individual conditional expectation (grey line) of cell relatedness probability on features. Rows correspond to levels of lineage tree hierarchy, with
the lowest level (siblings) on the top.
score is the RF score, but it is very insensitive to discrep- reconstructed lineage trees lacked any structure [Figure
ancies at higher levels of lineage tree structures [Figure 2 2 (vi)]. Therefore, calculating a combination of accuracy
(iv)]. Assigning the correct pathway cells undergo during scores is necessary in assessing the quality of lineage tree
organism or cancer development is necessary in order to reconstruction.
understand tissue-type differentiation. The QRT score sys- Our in silico lineage reconstruction experiments showed
tematically showed lower values indicating higher similar- that the ML-based approach is able to take advantage
ity between ground truth and reconstruction, even when of high complexity of relationships between processes
Figure 7. Lineage reconstruction for deep trees. (A) Normalized RF score. (B) Time requirements for the algorithm.

A B 2 of pairwise barcodes where one cell was in the ground state
24 hours 24 hours and the other was inverted. This feature had lower influence
0
1 hour 0 for higher relatedness levels.
1 hour
Our lineage reconstruction of the intMEMOIR dataset
log10(Time)
log10(Time)
−2 1 minute −2
submitted to the Allen Institute Cell Lineage Reconstruc-
1 minute
tion DREAM Challenge was the best-performing algo-
−4 1 second rithm (mean RF score of 0.53 and TRP score of 0.52) (15).
−4 1 second
It was the best ranking method when benchmarked against
a Bayesian phylogenetic framework (40). By training the
−6
−6 ML model to predict higher level relationships between
0 5000 10000 15000 0 5000 10000 15000
Number of cells Number of cells cells, we have been able to further improve performance. We
achieved a mean RF score of 0.31 and a mean TRP score of
Figure 8. Time requirements for the algorithm. (A) Training data prepara- 0.41 (Figure 5).
tion and ML model training. (B) Testing data preparation and tree recon- The area of lineage tracing is expanding fast, with new
struction. tools being developed at the experimental and computa-
tional levels. Jointly profiling DNA methylation, chromatin
accessibility, gene expression and lineage information in
governing cell division and barcode editing. Under ideal single cells was made possible by developing an inducible
conditions, i.e. no noise in observations and a large train- lineage tracing mouse model with extremely large lineage
ing dataset, they have outperformed other statistical ap- barcode diversity (41). A computational pipeline allow-
proaches by 19–62% for scores evaluating various aspects ing to predict cell lineages over several cell divisions solely
of lineage reconstruction (Figure 4). It has also showed that from transcriptomic data alone was devised by leverag-
when engineering the barcode system, it is more desirable ing genes displaying conserved expression levels over cell
to increase the number of recording units on barcode ar- divisions (42). Another important direction is to under-
rays than to increase the number of possible mutation states. stand cell fate transitions during development. An inter-
However, increasing the number of edits from two states (i.e. nal cellular clock could be recovered by integrating single-
single possible mutation) to three states (two possible mu- cell transcriptomics with lineage tracing (43,44). The ML
tations) could improve lineage reconstruction accuracy by approach has the potential to integrate barcode record-
at least 50% on average. ings with additional information, such as population dy-
The two in vitro datasets represent different lineage trac- namic parameters (40), single-cell gene expression (45),
ing configurations: number of states (3 versus 2) and num- proteomics (46), microsatellite mutations (47) or clonal
ber of recording units (10 versus 2943). The ML approach correlations (48).
performed better than the Hamming distance approach on A key limitation of our proposed approach is that it re-
both datasets. The difference was not as striking as for the quires the ground truth dataset, i.e. recorded barcodes and
simulated data. Possible explanations include the follow- tracked lineage histories. Except in the mouse embryonic
ing: much smaller training dataset, larger heterogeneity be- stem cell experiment (15), such data are not available. One
tween colonies or the presence of noise from experimental solution would be to train ML on simulated data. Having a
readouts. global database with data on different lineage tracing con-
From the model fitted to the mouse embryonic stem cell figurations and results could be a starting point for accumu-
dataset, we can get insight into the functional relationship lating knowledge as required for simulations. Under time
between experimentally observed barcode values and cell and budget restrictions, having ground truth data on a sin-
relatedness. For all levels of lineage tree hierarchy, we found gle lineage tree makes it possible to train ML models, as
that the features had a relative influence in the range from our analysis on deep trees indicates. Future work should
11% to 24%; i.e. none of the six predictors had zero influence explore these possibilities and evaluate how to practically
(Figure 6A). However, the ranking of features varied with go about the process of training and improving ML models
the relationship level. When predicting whether cells are sib- for reconstructing whole organ or even whole body lineage
lings (level 1), the highest relative influence was the number trees.
DATA AVAILABILITY 18. Cardona,G., Rossello,F. and Valiente,G. (2008) Extended Newick: it
is time for a standard representation of phylogenetic networks. BMC
Analysis code used in this study can be accessed at the fol- Bioinformatics, 9, 532.
lowing URL: https://github.com/rretkute/AMbeRlandTR 19. Salvador-Martinez,I., Grillo,M., Averof,M. and Telford,M.J. (2019)
(permanent DOI: 10.5281/zenodo.8227619). Is it possible to reconstruct an accurate cell lineage using CRISPR
recorders?eLife, 8, e40292.
20. Retkute,R., Touloupou,P., Basanez,K.G., Hollingsworth,D. and
ACKNOWLEDGEMENTS Spencer,S.E.F. (2021) Integrating geostatistical maps and infectious
disease transmission models using adaptive multiple importance
We thank the anonymous reviewers for valuable comments sampling. Ann. Appl. Stat., 15, 1980–1998.
21. Schliep,K.P. (2010) phangorn: phylogenetic analysis in R.
that improved the quality of the paper. Bioinformatics, 27, 592–593.
22. Greenwell,B., Boehmke,B., Cunningham,J. and GBM
DevelopersGBM Developers (2020) gbm: generalized boosted
FUNDING regression models. R package version 2.1.8.

No external funding. 23. Friedman,J.H., Hastie,T. and Tibshirani,R. (2000) Additive logistic
regression: a statistical view of boosting. Ann. Stat., 28, 337–407.
Conflict of interest statement. None declared. 24. Goldstein,A., Kapelner,A., Bleich,J. and Pitkin,E. (2015) Peeking
inside the black box: visualizing statistical learning with plots of
individual conditional expectation. J. Comput. Graph. Stat., 24,
REFERENCES 44–65.
1. Kapli,P., Yang,Z. and Telford,M.J. (2020) Phylogenetic tree building 25. Doran,H.C. (2010) MiscPsycho: an R package for miscellaneous
in the genomic age. Nat. Rev. Genet., 21, 428–444. psychometric analyses.
2. McKenna,A. and Gagnon,J.A. (2019) Recording development with 26. Gronau,I. and Moran,S. (2007) Optimal implementations of
single cell dynamic lineage tracing. Development, 146, dev169730. UPGMA and other common clustering algorithms. Inf. Process.
3. Kretzschmar,K. and Watt,F.M. (2012) Lineage tracing. Cell, 148, Lett., 104, 205–210.
33–45. 27. Robinson,D.F. and Foulds,L.R. (1981) Comparison of phylogenetic
4. Ceto,S., Sekiguchi,K.J, Takashima,Y., Nimmerjahn,A. and trees. Math. Biosci., 53, 131–147.
Tuszynski,M.J. (2020) Neural stem cell grafts form extensive synaptic 28. Critchlow,D.E., Pearl,D.K. and Qian,C. (1996) The triples distance
networks that integrate with host circuits after spinal cord injury. Cell for rooted bifurcating phylogenetic trees. Syst. Biol., 45, 323–334.
Stem Cell, 27, 430–440. 29. Brodal,G.S., Fagerberg,R., Mailund,T., Pedersen,C.N.S and Sand,A.
5. Quinn,J.J., Jones,M.G., Okimoto,R.A., Nanjo,S., Chan,M.M., (2013) Efficient algorithms for computing the triplet and quartet
Yosef,N., Bivona,T.G. and Weissman,J.S. (2021) Single-cell lineages distance between trees of arbitrary degree. In: SODA ’13: Proceedings
reveal the rates, routes, and drivers of metastasis in cancer xenografts. of the Twenty-Fourth Annual ACM–SIAM Symposium on Discrete
Science, 371, eabc1944. Algorithms. pp. 1814–1832.
6. Yang,D., Jones,M.G., Naranjo,S., Rideout,W.M., Min,K.H., Ho,R., 30. Smith,M.R. (2019) Bayesian and parsimony approaches reconstruct
Wu,W., Replogle,J.M., Page,J.L., Quinn,J.J. et al. (2022) Lineage informative trees from simulated morphological datasets. Biol. Lett.,
tracing reveals the phylodynamics, plasticity, and paths of tumor 15, 20180632.
evolution. Cell, 185, 1905–1923. 31. Estabrook,G.F., McMorris,F.R. and Meacham,C.A. (1985)
7. Sulston,J.E., Schierenberg,E., White,J.G. and Thomson,J.N. (1983) Comparison of undirected phylogenetic trees based on subtrees of
The embryonic cell lineage of the nematode Caenorhabditis elegans. four evolutionary units. Syst. Biol., 34, 193–200.
Dev. Biol., 100, 64–119. 32. Sand,A., Holt,M.K., Johansen,J., Brodal,G.S., Mailund,T. and
8. Chow,K.H.K., Budde,M.W., Granados,A.A., Cabrera,M., Yoon,S., Pedersen,C.N.S. (2014) tqDist: a library for computing the quartet
Cho,S., Huang,T.H., Koulena,N., Frieda,K.L., Cai,L. et al. (2021) and triplet distances between binary or general trees. Bioinformatics,
Imaging cell lineage with a synthetic digital recording system. 30, 2079–2080.
Science, 372, eabb3099. 33. Smith,M.R. (2019) Quartet: comparison of phylogenetic trees using
9. Frieda,K.L., Linton,J.M., Hormoz,S., Choi,J., Chow,K.-H.K., quartet and split measures. R package version 1.2.2.
Singer,Z.S., Budde,M.W., Elowitz,M.B. and Cai,L. (2017) Synthetic 34. Smith,M.R. (2020) Information theoretic generalized
recording and in situ readout of lineage information in single cells. Robinson–Foulds metrics for comparing phylogenetic trees.
Nature, 541, 107–111. Bioinformatics, 36, 5007–5013.
10. Liu,K., Deng,S., Ye,C., Yao,Z., Wang,J., Gong,H., Liu,L. and He,X. 35. Smith,M.R. (2020) TreeDist: distances between phylogenetic trees. R
(2021) Mapping single-cell-resolution cell phylogeny reveals cell package version 2.2.0. Comprehensive R Archive Network.
population dynamics during organ development. Nat. Methods, 18, 36. Greener,J.G., Kandathil,S.M., Moffat,L. and Jones,D.T. (2022) A
1506–1514. guide to machine learning for biologists. Nat. Rev. Mol. Cell Biol., 23,
11. Chen,C., Liao,Y. and Peng,G. (2022) Connecting past and present: 40–55.
single-cell lineage tracing. Protein Cell, 13, 790–807. 37. Libbrecht,M. and Noble,W. (2015) Machine learning applications in
12. Stadler,T., Pybus,O.G. and Stumpf,M.P.H. (2021) Phylodynamics for genetics and genomics. Nat. Rev. Genet., 16, 321–332.
cell biologists. Science, 371, 6526. 38. Schrider,D.R. and Kern,A.D. (2018) Supervised machine learning for
13. Paradis,E. (2012) In: Analysis of Phylogenetics and Evolution with R. population genetics: a new paradigm. Trends Genet., 34, 301–312.
Springer, NY. 39. Kim,J., Rosenberg,N.A. and Palacios,J.A. (2020) Distance metrics for
14. Felsenstein,J. (2004) In: Inferring Phylogenies. Sinauer Associates, ranked evolutionary trees. Proc. Natl Acad. Sci. U.S.A., 117,
Sunderland, MA. 28876–28886.
15. Gong,W., Granados,A.A., Hu,J., Jones,M.G., Raz,O., 40. Seidel,S. and Stadler,T. (2022) TiDeTree: a Bayesian phylogenetic
Salvador-Martinez,I., Zhang,H., Chow,K.K., Kwak,I.Y., Retkute,R. framework to estimate single-cell trees and population dynamic
et al. (2021) Benchmarked approaches for reconstruction of in vitro parameters from genetic lineage tracing data. Proc. R. Soc. B, 289,
cell lineages and in silico models of C. elegans and M. musculus 20221844.
developmental trees. Cell Syst., 18, 810–826. 41. Li,L., Bowling,S., Yu,Q., McGeary,S.E., Alcedo,K., Lemke,B.,
16. Gong,W., Kim,H.J., Garry,D.J. and Kwak,I.J. (2022) Single cell Ferreira,M., Klein,A.M., Wang,S.H. and Camargo,F.D. (2023) A
lineage reconstruction using distance-based algorithms and the R mouse model with high clonal barcode diversity for joint lineage,
package, DCLEAR. BMC Bioinformatics, 23, 103. transcriptomic, and epigenomic profiling in single cells. bioRxiv doi:
17. Jones,M.G., Khodaverdian,A., Quinn,J.J., Chan,M.M., https://doi.org/10.1101/2023.01.29.526062, 31 January 2023, preprint:
Hussmann,J.A., Wang,R., Xu,C., Weissman,J.S. and Yosef,N. (2020) not peer reviewed.
Inference of single-cell phylogenies from lineage tracing data using 42. Eisele,A.S., Tarbier,M., Dormann,A.A., Pelechano,V. and
Cassiopeia. Genome Biol., 21, 92. Suter,D.M. (2022) Barcode-free prediction of cell lineages from
scRNA-seq datasets. bioRxiv doi: 46. Pan,X., Li,H. and Zhang,X. (2022) TedSim: temporal dynamics
https://doi.org/10.1101/2022.09.20.508646, 20 September 2022, simulation of single-cell RNA sequencing data and cell division
preprint: not peer reviewed. history. Nucleic Acids Res., 50, 4272–4288.
43. Wang,S.W., Herriges,M.J., Hurley,K., Kotton,D.N. and Klein,A.M. 47. Chapal-Ilani,N., Maruvka,Y.E., Spiro,A., Reizel,Y., Adar,R.,
(2022) CoSpar identifies early cell fate biases from single-cell Shlush,L.I. and Shapiro,E. (2013) Comparing algorithms that
transcriptomic and lineage information. Nat. Biotechnol., 40, reconstruct cell lineage trees utilizing information on microsatellite
1066–1074. mutations. PLoS Comput. Biol., 9, e1003297.
44. Wang,K., Hou,L., Lu,Z., Wang,X., Zi,Z., Zhai,W., He,X., Curtis,C., 48. Weinreb,C. and Klein,A.M. (2020) Lineage reconstruction from
Zhou,D. and Hu,Z. (2022) Cell division history encodes directional clonal correlations. Proc. Natl Acad. Sci. U.S.A., 117, 17041–17048.
information of fate transitions. bioRxiv doi:
https://doi.org/10.1101/2022.10.06.511094, 07 October 2022, preprint:
not peer reviewed.
45. Giecold,G., Marco,E., Garcia,S.P., Trippa,L. and Yuan,G.C. (2016)
Robust lineage reconstruction from high-dimensional single-cell data.
Nucleic Acids Res., 44, e122.

C The Author(s) 2023. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Lqad 077

Uploaded by

Copyright:

Available Formats

Lqad 077

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lqad 077

Uploaded by

Copyright:

Available Formats

Published online 21 August 2023 NAR Genomics and Bioinformatics, 2023, Vol. 5, No.

Machine learning based lineage tree reconstruction

Downloaded from https://academic.oup.com/nargab/article/5/3/lqad077/7246553 by guest on 21 August 2023

ABSTRACT reversible random modification of DNA recording arrays

Downloaded from https://academic.oup.com/nargab/article/5/3/lqad077/7246553 by guest on 21 August 2023

Downloaded from https://academic.oup.com/nargab/article/5/3/lqad077/7246553 by guest on 21 August 2023

was required to start the maximum parsimony tree search.

Downloaded from https://academic.oup.com/nargab/article/5/3/lqad077/7246553 by guest on 21 August 2023

Normalized RF score. The RF distance counts the number

Downloaded from https://academic.oup.com/nargab/article/5/3/lqad077/7246553 by guest on 21 August 2023

Downloaded from https://academic.oup.com/nargab/article/5/3/lqad077/7246553 by guest on 21 August 2023

Method AMbeRland−TR MaxParsymony

DISCUSSION construction. In this study, we introduced the framework

Downloaded from https://academic.oup.com/nargab/article/5/3/lqad077/7246553 by guest on 21 August 2023

Downloaded from https://academic.oup.com/nargab/article/5/3/lqad077/7246553 by guest on 21 August 2023

Downloaded from https://academic.oup.com/nargab/article/5/3/lqad077/7246553 by guest on 21 August 2023

Downloaded from https://academic.oup.com/nargab/article/5/3/lqad077/7246553 by guest on 21 August 2023

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.