Lqad 077
Lqad 077
Lqad 077
3 1
https://doi.org/10.1093/nargab/lqad077
1
School of Natural and Environmental Sciences, Newcastle University, Newcastle upon Tyne NE1 7RU, UK,
2
Independent researcher, London, SW7 2BX, UK and 3 Department of Plant Sciences, University of Cambridge,
Downing Street, Cambridge CB2 3EA, UK
* To whom correspondence should be addressed. Tel: +44 1223 333900; Email: rr614@cam.ac.uk
C The Author(s) 2023. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
2 NAR Genomics and Bioinformatics, 2023, Vol. 5, No. 3
possible edit states. To evaluate the performance of the al- The probability that a recording unit has not been edited
gorithm, we used two sets of in silico data: (i) shallow trees after d divisions is equal to Pnot edited (d, μ, α) = (1 − μ)d .
(up to 100 cells), simulated using parametrization based on The ground truth lineage could provide necessary informa-
in vitro dataset for mouse embryonic stem cells (15); and (ii) tion on what is an expected lower number of divisions each
deep trees (up to 10 000 cells), simulated using parametriza- cell has undergone in a colony, for example by counting the
tion based on in vitro dataset for fly organ development (10). number of ancestors for phylogenetic nodes.
Further, we reconstructed trees of in vitro dataset for mouse The probability of an edited unit going to state s is de-
embryonic stem cells (15). We measured the accuracy of lin- | S|
noted by α s , such that s=1 = 1, where S is a set of all
eage reconstruction using the normalized Robinson–Foulds possible edits and |S| denotes the size of set S. In the case
(RF) score. For trees with up to 100 cells we additionally of the mouse embryonic stem cell colonies dataset, |S| =
calculating four metrics: normalised Robinson Foulds score 2 and S = {0, 2}, and in the case of the fly organ de-
(RF), triplet score (TRP), quartet score (QRT), and cluster- velopment dataset, |S| = 1 and S = {1}. The probability
ing information score (CLI). We compared our reconstruc-
tained as part of the Allen Institute Cell Lineage Recon- L(μ, α) = Poc,i,l (dc,i , μ, α), (1)
struction Challenge through Synapse ID syn20692755 (15). c=1 i =1 l=1
In the experiment, the recording array (barcode) consisted where C is the total number of colonies, nc is the number of
of L = 10 recording units (8). Each recording unit was in one cells in a colony c, dc,i is the number of divisions for cell i
of three states: ground state (represented as ‘1’), a deletion in colony c, oc,i,l is the observed state for unit l in cell i and
(represented as ‘0’) or an inversion (represented as ‘2’) of the colony c, and oc,i,l = {not edited, edited}.
DNA sequence. Each colony was started from an individual For parameter estimation, we used the R package
cell and colony growth was observed for 48 h. The experi- AMISEpi, which has an implementation of adaptive multi-
mental data comprise (i) an array of intMEMOIR readouts ple importance sampling for Bayesian analysis (20). We set
as a text file (also called matrix) and (ii) the ground truth the priors to be uniformly distributed: μ ∼ U[0, 1] and α i ∼
lineage for the colony. Tree-like data structures were pro- U[0, 1].
vided as a Newick file (18). The ground truth cell lineage
trees were obtained from video-microscopy data (8).
Simulation of shallow trees (up to 100 cells). We used pa-
Fly organ development. A SMALT system used a barcode rameters (mutation rate and probability of mutations) esti-
with 16 iSceI binding motifs present with equal distance mated from the in vitro mouse embryonic stem cell dataset
throughout the sequence (10). The recording array con- (15). We used data on all colonies for parameter estimation.
sisted of L = 2943 recording units. Each recording unit was
in one of two states: ground state (represented as ‘0’) or mu- Simulation of deep trees (up to 10 000 cells). We used pa-
tated state (represented as ‘1’). Binary sequence data were rameters (mutation rate) estimated from the in vitro fly or-
available for two specimens: one had 5002 cells and other gan development dataset (10). We have used both datasets
5420 cells. The datasets correspond to fly organ develop- to estimate the mutation rate per division/target. We as-
ment from embryo to late-third instar larvae (10). sumed that there were 13 divisions, which was closest to the
number of cells available for dataset (213 = 8192).
The in silico datasets
Lineage tree reconstruction using the ML approach
Stochastic simulation of cell division and barcode editing.
(AMbeRland-TR)
Accumulation of stochastic mutations during cell division
was simulated similarly to (8,19). Each cell colony starts Feature engineering. For illustration purpose, we assume
from an individual cell carrying an unedited array with that the barcode units can be three possible states: ground
length of L units. The model assumes that cells divide syn- state, a deletion or an inversion (Figure 1B). We are propos-
chronously and at a constant rate. The initial cell then un- ing the following classes of features based on pairwise com-
dergoes a series of cell divisions (Figure 1A). After d divi- parison of barcodes: number of units that have not been
sions, the colony consists of N cells, where N = 2d . edited (F1); number of units that have the same edit (F4
After every cell division, each unedited target can mutate and F6); number of units that have single edit (F2 and
with a given probability μ to one of several possible edited F3); and number of units that have different edits (F5)
states s ∈ S. The state is chosen according to a probabil- (Figure 1C).
ity ␣s . The process of recording is irreversible, and once a This approach can be extended to any number of possible
recording unit is edited, it can no longer change (Figure 1B). edited states by including all possible pairwise combinations
NAR Genomics and Bioinformatics, 2023, Vol. 5, No. 3 3
of the extended record set {not mutated, S} as predictors for the predicted response and a feature for individual datasets
the ML model. This should produce (|S| + 1)! predictors. (24). Partial dependence plots show the relationship aver-
We have used the R package phangorn (21) to extract aged over all observations, which makes it easier to extract
information on relatedness between cells from the ground expected trends (24).
truth cell lineage trees. For each level of the tree, this pro-
duced two lists of cell pairs for each colony: cells that share Clustering. We applied a custom hierarchical clustering
an ancestor at level t and cells that do not share an ances- method for building a cell lineage tree from predicted prob-
tor at level t. Two examples of embedding single-cell data abilities. Clustering begins at the lowest tree level, where all
into a feature space for ML training are shown in Figure clusters contain an individual cell. Each possible cell pair
1D. The upper row corresponds to level 1 (sibling cells) and is then ranked according to the predicted probability that
the lower row corresponds to level 2 (cousin cells). State in- they share an ancestor at this level. At consecutively increas-
dicates whether relationship is true or false. ing levels, pairwise comparison is performed between each
lower level cluster, where the calculated probability is the
ML training, prediction and interpretation. We used a gra- maximum between any elements of the two clusters. Clus-
dient boosting machine (GBM) to implement the outlined ter pairs are ordered again according to this probability and
ML approach. All calculations were performed in R using are assumed to have the same parent node if its value is
package gbm (22). The following options were used to train above the estimated threshold for this level. This process is
the GBM model: distribution = ‘bernoulli’; n.trees = 1000; repeated until one or two clusters are left. We assume only
interaction.depth = 10; n.minobsinnode = 5; cv.folds = 5; binary trees.
and train.fraction = 0.5.
We use the relative importance of features, partial depen-
Lineage tree reconstruction using a maximum parsimony tree
dence plots and individual conditional expectation plots for
reconstruction method
model interpretation. Relative importance is based on the
number of times a predictor is selected when training the Maximum parsimony based methods try to find the mini-
model (23). Higher values of relative importance indicate mum number of changes necessary to describe the data for
larger influence on the response. Individual conditional ex- a given tree. We performed a maximum parsimony recon-
pectation curves visualize the partial relationship between struction using the R package phangorn (21). Initial tree
4 NAR Genomics and Bioinformatics, 2023, Vol. 5, No. 3
Scores
We assessed the accuracy of lineage tree reconstruction us-
ing four metrics: normalized RF score, TRP score, QRT
score and CLI score. All scores have values between 0 and
1, with smaller values indicating larger similarity between
Estimating mutation rates Partial dependence plots and individual conditional ex-
pectation plots can be used to analyse the relationship be-
The mouse embryonic stem cell dataset. The resulting
tween features and the response. It can be seen from par-
colonies had from 4 to 39 cells; in total, there were 1453
tial dependence plots (red line) in Figure 6 (upper row) that
individual cells in the dataset. The distribution of the num-
the probability of cells being siblings decreases when the
ber of divisions in the ground truth lineage trees is shown in
value of feature ‘F3’ increases up to 3. This has a biolog-
Figure 3A.
ical meaning: the number of pairwise barcodes where one
Posterior distributions for fitted editing rate and the
cell stayed in the ground state and the other had under-
probability that an edited unit has a state ‘2’ are shown in
gone editing cannot be high if the cells are siblings, and the
Figure 3B and C, respectively. We estimated the mean of
highest probability is when there are no such barcodes. The
marginal posterior distributions to be μ = 0.15 and α =
same relationship is true for the feature ‘F5’: the probability
0.48. For simulation of cell division and record editing, we
of cells being siblings decreases with increasing numbers of
assume that all states have the same probability, i.e. α s =
pairwise barcodes where one cell had undergone deletion
Lineage reconstruction of in silico dataset: shallow trees Lineage reconstruction of in silico dataset: deep trees
Results of simulations are shown in Figure 4. Our ML- The performance of the AMbeRland-TR algorithm was
based algorithm outperformed the maximum parsimony consistent between reconstruction of trees with 1024 cells
based method in all four metrics. Comparing mean perfor- and 4096 cells, and in both cases outperformed the recon-
mance, we found a 43–50% improvement in the normalized struction by the maximum parsimony method (Figure 7A).
RF score, a 19–32% improvement in the QRT score and a On average, there was a 50% improvement in performance
36–45% improvement in the CLI score. for the normalized RF score in comparison to the maxi-
We found that increasing the number of target units had a mum parsimony method. Improvement in reconstruction
stronger effect on lineage reconstruction accuracy than in- quality comes at higher computational time requirements.
creasing the number of states. For a recording array using It takes 54 min on average to reconstruct a tree with 4096
20 targets instead of 10 targets, there was a 62% improve- tips using the AMbeRland-TR algorithm, which is two
ment in performance for the QRT score, a 56% improve- times longer than the time required for the maximum par-
ment in performance for both the normalized RF score and simony method (Figure 7B).
CLI score, and a 28% improvement in performance for the
TRP score. When the number of editing states was increased
from 2 to 5, there was a 17% improvement in the normal- Computational requirements
ized RF score for a recording array using 10 targets. There The computational cost of the AMbeRland-TR algorithm
was a 68% improvement in the normalized RF score for a has the following components: extracting features from bar-
recording array using 20 targets. There was a 30% improve- code data; extracting relationships between cells from the
ment in the QRT score for a recording array using 10 targets ground truth trees; training the ML model for each tree
and a 49% improvement in the QRT score for a recording level; and reconstructing trees for testing data. We have
array using 20 targets. combined all procedure into two tasks: (i) training data
preparation and ML model training; and (ii) testing data
preparation and tree reconstruction. For the purpose of
Lineage tree reconstruction of the mouse embryonic stem cell
evaluating time requirements, we assume that training and
dataset
testing data have only a single cell colony. For deep trees,
A reconstruction was computed from the test dataset con- as shown in the previous section, it is enough to train an
sisting of 30 cell colonies using only the intMEMOIR ar- ML model on single tree. For shallow trees, the training
ray readout (8). Figure 5A shows a pairwise comparison dataset should contain enough trees to accommodate vari-
between the two methods. Most of the points are above ate of barcode combination, but this should not be a burden
the diagonal, indicating that our ML-based algorithm out- as training data preparation for shallow trees is computa-
performed the maximum parsimony based method for all tionally fast. Overall, time required for training and testing
four metrics. Comparing mean performance, we found a tasks was similar for a range of number of cells we inves-
17–23% improvement in the normalized RF score, a 7–37% tigate (Figure 8). For example, for a tree with 10 000 cells,
improvement in the TRP score, a 7–11% improvement in it would take ∼24 h to train the models and 24 h for re-
the QRT score and a 12–15% improvement in the CLI score constructing a tree. All calculations were performed on a
(Figure 5B). MacBook Pro with 2.4-GHz 8-core processor.
6 NAR Genomics and Bioinformatics, 2023, Vol. 5, No. 3
A B
0.4 200
Density
Frequency
0.3 150
100
0.2 50
0.1 0
0.1475
0.1500
0.1525
0.1550
0.0
2 3 4 5 6
Number of divisions μ
C D
60 400
40 200
100
20 0
−3.266
−3.264
−3.262
0
0.47
0.48
0.49
α log10 μ
Figure 3. Parameter estimation. (A) Distribution of the number of divisions in the mouse embryonic stem cell experiment dataset. (B) Posterior distribution
of editing rate (μ) in the mouse embryonic stem cell experiment dataset. (C) Posterior distribution of the probability that an edited unit has the state ‘2’
(α) in the mouse embryonic stem cell experiment dataset. (D) Posterior distribution of editing rate (μ) in the fly organ development dataset.
Targets: 10 Targets: 20
0.75
0.50
RF
0.25
0.00
0.6
TRP
0.4
0.2
Score
0.0
0.4
QRT
0.2
0.0
0.6
CLI
0.4
0.2
0.0
2 3 4 5 6 2 3 4 5 6
Number of states
Figure 4. Lineage reconstruction accuracy for the in silico dataset, with L = 10 or L = 20 recording units, and the number of possible edits ranging
from 2 to 6. Each bar summarizes the score of 100 lineage tree reconstruction tests. We denote our ML-based approach as ’AMbeRland-TR’ and lineage
reconstruction using the maximum parsimony approach as ’MaxParsymony’. Accuracy metrics are normalized RF score, TRP score, QRT score and CLI
score.
Figure 6. Interpretation of the model for the mouse embryonic stem cell dataset. (A) Relative importance of features. (B) Average partial dependence (red
line) and individual conditional expectation (grey line) of cell relatedness probability on features. Rows correspond to levels of lineage tree hierarchy, with
the lowest level (siblings) on the top.
score is the RF score, but it is very insensitive to discrep- reconstructed lineage trees lacked any structure [Figure
ancies at higher levels of lineage tree structures [Figure 2 2 (vi)]. Therefore, calculating a combination of accuracy
(iv)]. Assigning the correct pathway cells undergo during scores is necessary in assessing the quality of lineage tree
organism or cancer development is necessary in order to reconstruction.
understand tissue-type differentiation. The QRT score sys- Our in silico lineage reconstruction experiments showed
tematically showed lower values indicating higher similar- that the ML-based approach is able to take advantage
ity between ground truth and reconstruction, even when of high complexity of relationships between processes
8 NAR Genomics and Bioinformatics, 2023, Vol. 5, No. 3
Figure 7. Lineage reconstruction for deep trees. (A) Normalized RF score. (B) Time requirements for the algorithm.
log10(Time)
−2 1 minute −2
submitted to the Allen Institute Cell Lineage Reconstruc-
1 minute
tion DREAM Challenge was the best-performing algo-
−4 1 second rithm (mean RF score of 0.53 and TRP score of 0.52) (15).
−4 1 second
It was the best ranking method when benchmarked against
a Bayesian phylogenetic framework (40). By training the
−6
−6 ML model to predict higher level relationships between
0 5000 10000 15000 0 5000 10000 15000
Number of cells Number of cells cells, we have been able to further improve performance. We
achieved a mean RF score of 0.31 and a mean TRP score of
Figure 8. Time requirements for the algorithm. (A) Training data prepara- 0.41 (Figure 5).
tion and ML model training. (B) Testing data preparation and tree recon- The area of lineage tracing is expanding fast, with new
struction. tools being developed at the experimental and computa-
tional levels. Jointly profiling DNA methylation, chromatin
accessibility, gene expression and lineage information in
governing cell division and barcode editing. Under ideal single cells was made possible by developing an inducible
conditions, i.e. no noise in observations and a large train- lineage tracing mouse model with extremely large lineage
ing dataset, they have outperformed other statistical ap- barcode diversity (41). A computational pipeline allow-
proaches by 19–62% for scores evaluating various aspects ing to predict cell lineages over several cell divisions solely
of lineage reconstruction (Figure 4). It has also showed that from transcriptomic data alone was devised by leverag-
when engineering the barcode system, it is more desirable ing genes displaying conserved expression levels over cell
to increase the number of recording units on barcode ar- divisions (42). Another important direction is to under-
rays than to increase the number of possible mutation states. stand cell fate transitions during development. An inter-
However, increasing the number of edits from two states (i.e. nal cellular clock could be recovered by integrating single-
single possible mutation) to three states (two possible mu- cell transcriptomics with lineage tracing (43,44). The ML
tations) could improve lineage reconstruction accuracy by approach has the potential to integrate barcode record-
at least 50% on average. ings with additional information, such as population dy-
The two in vitro datasets represent different lineage trac- namic parameters (40), single-cell gene expression (45),
ing configurations: number of states (3 versus 2) and num- proteomics (46), microsatellite mutations (47) or clonal
ber of recording units (10 versus 2943). The ML approach correlations (48).
performed better than the Hamming distance approach on A key limitation of our proposed approach is that it re-
both datasets. The difference was not as striking as for the quires the ground truth dataset, i.e. recorded barcodes and
simulated data. Possible explanations include the follow- tracked lineage histories. Except in the mouse embryonic
ing: much smaller training dataset, larger heterogeneity be- stem cell experiment (15), such data are not available. One
tween colonies or the presence of noise from experimental solution would be to train ML on simulated data. Having a
readouts. global database with data on different lineage tracing con-
From the model fitted to the mouse embryonic stem cell figurations and results could be a starting point for accumu-
dataset, we can get insight into the functional relationship lating knowledge as required for simulations. Under time
between experimentally observed barcode values and cell and budget restrictions, having ground truth data on a sin-
relatedness. For all levels of lineage tree hierarchy, we found gle lineage tree makes it possible to train ML models, as
that the features had a relative influence in the range from our analysis on deep trees indicates. Future work should
11% to 24%; i.e. none of the six predictors had zero influence explore these possibilities and evaluate how to practically
(Figure 6A). However, the ranking of features varied with go about the process of training and improving ML models
the relationship level. When predicting whether cells are sib- for reconstructing whole organ or even whole body lineage
lings (level 1), the highest relative influence was the number trees.
NAR Genomics and Bioinformatics, 2023, Vol. 5, No. 3 9
DATA AVAILABILITY 18. Cardona,G., Rossello,F. and Valiente,G. (2008) Extended Newick: it
is time for a standard representation of phylogenetic networks. BMC
Analysis code used in this study can be accessed at the fol- Bioinformatics, 9, 532.
lowing URL: https://github.com/rretkute/AMbeRlandTR 19. Salvador-Martinez,I., Grillo,M., Averof,M. and Telford,M.J. (2019)
(permanent DOI: 10.5281/zenodo.8227619). Is it possible to reconstruct an accurate cell lineage using CRISPR
recorders?eLife, 8, e40292.
20. Retkute,R., Touloupou,P., Basanez,K.G., Hollingsworth,D. and
ACKNOWLEDGEMENTS Spencer,S.E.F. (2021) Integrating geostatistical maps and infectious
disease transmission models using adaptive multiple importance
We thank the anonymous reviewers for valuable comments sampling. Ann. Appl. Stat., 15, 1980–1998.
21. Schliep,K.P. (2010) phangorn: phylogenetic analysis in R.
that improved the quality of the paper. Bioinformatics, 27, 592–593.
22. Greenwell,B., Boehmke,B., Cunningham,J. and GBM
DevelopersGBM Developers (2020) gbm: generalized boosted
FUNDING regression models. R package version 2.1.8.
scRNA-seq datasets. bioRxiv doi: 46. Pan,X., Li,H. and Zhang,X. (2022) TedSim: temporal dynamics
https://doi.org/10.1101/2022.09.20.508646, 20 September 2022, simulation of single-cell RNA sequencing data and cell division
preprint: not peer reviewed. history. Nucleic Acids Res., 50, 4272–4288.
43. Wang,S.W., Herriges,M.J., Hurley,K., Kotton,D.N. and Klein,A.M. 47. Chapal-Ilani,N., Maruvka,Y.E., Spiro,A., Reizel,Y., Adar,R.,
(2022) CoSpar identifies early cell fate biases from single-cell Shlush,L.I. and Shapiro,E. (2013) Comparing algorithms that
transcriptomic and lineage information. Nat. Biotechnol., 40, reconstruct cell lineage trees utilizing information on microsatellite
1066–1074. mutations. PLoS Comput. Biol., 9, e1003297.
44. Wang,K., Hou,L., Lu,Z., Wang,X., Zi,Z., Zhai,W., He,X., Curtis,C., 48. Weinreb,C. and Klein,A.M. (2020) Lineage reconstruction from
Zhou,D. and Hu,Z. (2022) Cell division history encodes directional clonal correlations. Proc. Natl Acad. Sci. U.S.A., 117, 17041–17048.
information of fate transitions. bioRxiv doi:
https://doi.org/10.1101/2022.10.06.511094, 07 October 2022, preprint:
not peer reviewed.
45. Giecold,G., Marco,E., Garcia,S.P., Trippa,L. and Yuan,G.C. (2016)
Robust lineage reconstruction from high-dimensional single-cell data.
Nucleic Acids Res., 44, e122.
C The Author(s) 2023. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.