Introduction

Single-cell RNA-seq (scRNA-seq) has proved to be a powerful tool for probing cell states [15], defining cell types [69], and describing cell lineages [1013]. These applications of scRNA-seq all rely on two computational steps: quantification of gene or transcript abundances in each cell and clustering of the data in the resulting abundance × cell expression matrix [14, 15]. There are a number of challenges in both of these steps that are specific to scRNA-seq analysis. While methods for transcript/gene abundance estimation from bulk RNA-seq have been extensively tested and benchmarked [16], the wide variety of assay types in scRNA-seq [1725] have required a plethora of customized solutions [2, 6, 7, 9, 1113, 24, 2637] that are difficult to compare to each other. Furthermore, the quantification methods used all rely on read alignment to transcriptomes or genomes, a time-consuming step that will not scale well with the increasing numbers of reads predicted for scRNA-seq [15, 38]. Clustering based on scRNA-seq expression matrices can also require domain-specific information, e.g., temporal information [33] or functional constraints [37], so that in some cases hand curation of clusters is performed after unsupervised clustering [7].

In [39] a method of collapsing bulk read alignments into “equivalence classes” of reads was introduced for the purpose of estimating alternative splicing isoform frequencies from bulk RNA-seq data. Each equivalence class consists of all the reads that are compatible with the same set of transcripts (See Fig. 1 for an example). The collapsing of reads into equivalence classes was initially introduced to allow for significant speedup of the E-step in the expectation-maximization (EM) algorithm used in some RNA-seq quantification programs [40, 41], as the read counts in the equivalence classes, or transcript-compatibility counts (TCCs), correspond to the sufficient statistics for a standard RNA-seq model [42]. In other words, the use of transcript-compatibility counts was an intermediate computation step towards quantifying transcript abundances. In this paper we instead consider the direct use of such counts for the comparison and clustering of scRNA-seq cells. Figure 2 shows an outline of a method we have developed for clustering and analyzing scRNA-seq data; the key idea is to base clustering not on the quantification of transcripts or genes but on the transcript-compatibility counts for each cell. We note that equivalence classes have also been used in [43, 44] to define similarity scores between de novo assembled transcripts.

Fig. 1
figure 1

Equivalence class and transcript-compatibility counts. This figure gives an example of how reads are collapsed into equivalence classes. Each read is mapped to one or more transcripts in the reference transcriptome; these are transcripts that the read is compatible with, i.e., the transcripts that the read could possibly have come from. For example, read 1 is compatible with transcripts t1 and t3, read 2 is compatible with transcripts t1 and t2, and so on. An equivalence class is a group of reads that is compatible with the same set of transcripts. For example, reads 4,5,6,7,8 are all compatible with t1, t2, and t3, and they form an equivalence class. Since the reads in an equivalence class are all compatible with the same set of transcripts, we simply represent an equivalence class by that set of transcripts. For example, the equivalence class consisting of reads 4,5,6,7,8 is represented by {t1,t2,t3}. Aggregating the number of reads in each equivalence class yields the corresponding transcript-compatibility counts. Note that in order to estimate the transcript abundances from the transcript-compatibility counts, a read-generation model is needed to resolve the multi-mapped reads

Fig. 2
figure 2

Overview of the method. This figure illustrates our transcript-compatibility count (TCC) clustering method in a very simple, yet instructive example and highlights its major differences with respect to the conventional single-cell clustering approach. Here, we consider an scRNA-seq example with K cells (only the reads coming from Cell1 and Cell2 are shown here) and a reference transcriptome consisting of three transcripts, t 1, t 2, and t 3. Conventional approach: Single cells are clustered based on their transcript or gene abundances (here we only focus on transcripts for concreteness). This widely adopted pipeline involves computing a (#transcripts × #cells) expression matrix by first aligning each cell’s reads to the reference. The corresponding alignment information is next to each read, which for the purpose of illustration only contains the mapped positions (the aligned reads of Cell1 are also annotated directly on the transcripts). While reads 1 and 5 are uniquely mapped to transcripts 1 and 3, reads 2, 3, and 4 are mapped to multiple transcripts (multi-mapped reads). The quantification step must therefore take into account a specific read-generating model and handle multi-mapped reads accordingly. Our proposed method: Single cells are clustered based on their transcript-compatibility counts. Our method assigns the reads of each cell to equivalence classes via the process of pseudoalignment and simply counts the number of reads that fall in each class to construct a (# eq.classes × #cells) matrix of transcript-compatibility counts. Then, the method proceeds by directly using the transcript-compatibility counts for downstream processing and single-cell clustering. The underlying idea here is that even though equivalence classes may not have an explicit biological interpretation, their read counts can collectively provide us with a distinct signature of each cell’s gene expression; transcript-compatibility counts can be thought of as feature vectors, and cells can be identified by their differential expression over these features. Compared to the conventional approach, our method does not attempt to resolve multi-mapped reads (no need for an assay-specific read-generating model) and only requires transcript compatibility information for each read (no need for exact read alignment)

To better understand the relevance of transcript-compatibility counts, consider their relationship to the “gene-level” counts used in many RNA-seq analyses. In the same way that “genes” represent groupings of transcripts [45], equivalence classes as introduced by [39] are also groups of transcripts. However, while the former is a biologically motivated construction, the latter is technical, consisting of groupings that capture the extent of ambiguous multiple mappings among reads. The lack of direct biological interpretation of equivalence classes makes transcript-compatibility counts less intuitive. However, as we will show, there are two significant advantages to working with them: (1) unlike transcript or gene-level quantifications, transcript-compatibility counts can be computed without a read-generating model, and hence a single clustering pipeline based on transcript-compatibility counts can be used across a wide range of scRNA-seq assays; (2) transcript-compatibility counts can be computed by pseudoalignment, a process that does not require read alignment and can be done extremely efficiently [41].

To demonstrate both the general applicability of our method as well as its accuracy, we reanalyzed data from the topics of two recently published scRNA-seq papers: the pseudotemporal ordering of primary human myoblasts [12] and the cell classification in the mouse cortex and hippocampus [7]. We show that not only are we able to recapitulate the analyses of the papers two orders of magnitude faster than previously possible, but we also provide a refinement of the published results, suggesting that our approach is both fast and accurate. The speedup of our method makes single-cell RNA-seq analysis interactive for the first time: sensitivity of results to parameters and annotations can be easily explored and analyses can be easily reproduced by individuals without access to significant computing resources. Furthermore, the efficiency of our methods will take on increasing significance as single-cell RNA sequencing scales to experiments with hundreds of thousands of cells and improved technologies make the acquisition of single-cell data easier and faster (see, for example, [46]). We also illustrate the advantages of the broad applicability of our approach via its suitability to a multitude of assays. Existing pipelines must be tailored to specific assays, making it difficult to perform meta-analyses and to compare results across experiments.

Results

Transcript-compatibility counts from pseudoalignments

To demonstrate the effectiveness of transcript-compatibility counts for scRNA-seq analysis, we first examined how efficiently they can be computed. While transcript-compatibility counts can be extracted from read alignments (e.g., in SAM/BAM format), they do not require the full information contained in alignments. Instead, we examined the speedup possible with pseudoalignment [41], which obtains for each read the set of transcripts it is compatible with and therefore can be directly used to obtain transcript-compatibility counts.

Figure 3 shows the speed of obtaining transcript-compatibility counts via pseudoalignment in comparison to the time required to quantify RNA-seq data with other approaches. The key result relevant for single-cell analysis is the scalability of pseudoalignment for obtaining transcript-compatability counts (Fig. 3 and Additional file 1: Figure S1). The fixed extra cost for aligning (rather than pseudoaligning) reads for each cell is small, but when extrapolated to hundreds of thousands of cells it becomes a significant (computational) cost.

Fig. 3
figure 3

Runtime comparison of alignment methods. a The time required to process 3005 cells from mouse brain cell dataset [7] in core hours is shown here. The time taken for read alignment with Bowtie and HISAT is much larger than the time taken for kallisto pseudoalignment (which is used by our method to obtain the transcript-compatibility counts). kallisto pseudoalignment and HISAT were run on 32 cores. Bowtie and Word Count were each timed on 1 core with 10 randomly selected cells. The bars shown here are estimates obtained by multiplying these times by 300.5. Because we do not account for the overhead associated with parallelizing, the Bowtie and Word Count estimates are lower bounds on their runtimes in practice. After preprocessing the UMIs, each of the 5,914,602,849 single-end reads in the dataset were less than 50 bp long. b The time required to process 271 cells of the dataset of [12] in core hours is shown here. As before, the time taken for read alignment with HISAT is significantly larger than the time taken for kallisto pseudoalignment. Both methods were run on 32 cores. The dataset has 814,344,693 paired-end reads, and each mate in a pair is 100 bp long

Pseudotime for differentiating human myoblasts

The recently published Monocle software [12] that builds on the Cufflinks program [47] is rapidly becoming a standard tool for scRNA-seq analysis. We therefore sought to compare our approach to Monocle, and in order to do so began with a reanalysis of the data in [12]. Figure 4 shows the temporal ordering of differentiating primary human myoblasts using transcript-compatibility counts clustering based on the Jensen-Shannon metric and the affinity propagation algorithm (see Methods). We note that unlike Cufflinks, which consists of an explicit model of RNA-seq suitable for the data in [12] but not necessarily for other assays, our transcript-compatibility counts make no assumption about the nature of the data. Furthermore, while the reanalysis appears to match that of [12], affinity propagation with different parameters provided a more refined clustering, possibly capturing seven stages of myoblast differentiation (see also Additional file 1: Figure S2).

Fig. 4
figure 4

Temporal ordering of differentiating primary human myoblasts using transcript-compatibility counts. a A minimum spanning tree (MST) was drawn through the 271 cells using Jensen-Shannon distances computed between 1,101,805-dimensional vectors of TCCs of cells in the dataset. Following the longest path does not show a clear cell differentiation pattern. b Affinity propagation clustering generated seven clusters (after collapsing spurious clusters with less than five cells into their nearest neighboring cluster), and an MST was drawn through the centroids of the clusters. Using the labels from Trapnell et al. [12], the longest path shows a differentiation pattern from proliferating cell (red) to differentiating myoblast (blue). The MST also shows how some proliferating cells alternatively differentiate into interstitial mesenchymal cells (green). c The cells were then clustered into three groups based on their transcript-compatibility counts, and the MST from b was relabeled using these new cell types. d The expression levels of the genes MYOG, CDK1, and PDGFRA were analyzed for the three TCC clusters. MYOG, CDK1, and PDGFRA show greater expression for centroids from clusters with greater proportions of TCC cell types 1, 2, and 3, respectively. For each gene, a histogram over each centroid shows how expression level evolves with the differentiation process. CDK1, MYOG, and PDGFRA being markers for proliferating cells, differentiating myoblasts, and interstitial mesenchymal cells indicates that the clustering and centroid-ordering based on TCC captures intermediate steps of the human myoblast differentiation trajectory

A central idea in the pseudo-temporal ordering of cells relies upon the construction of a minimum spanning tree (MST) over the pairwise distances of their corresponding gene expression vectors [48]. This attempts to capture the trajectory of a hypothetical cell that gradually “moves” through different cellular states or differentiation stages in a high-dimensional gene expression space. Our results show that the same concept can be applied to transcript-compatibility counts. A key step in Monocle is to first reduce the dimensionality of the data by independent component analysis (ICA) and then compute the MST based on Euclidean distances on the plane. Here we take a different approach (see Fig. 4 ad) and compute the MST on “cluster centers” in high dimensions (see Methods). Both approaches aim to battle the biological and technical noise that is inevitably introduced in scRNA-seq experiments. Even though we could have used Monocle directly on transcript-compatibility counts, the design and comparison of specialized tools is beyond the scope of this paper.

Figure 4 d validates the three primary clusters and the pseudo-temporal ordering obtained by our method based on three key myoblast differentiation markers, MYOG, CDK1, and PDGFRA (see Additional file 1: Figure S3 for an additional set of genes taken from [12]). Interestingly, the expression of these genes gradually evolves over the pseudo-temporally ordered clusters, capturing both the underlying differentiation trajectory of proliferating cells to myoblasts and the corresponding branching towards mesenchymal cells, as was observed in [12].

Finally, we should point out that although the three primary clusters of [12] are evident in our results, they are not identical. This naturally raises the question of whether clustering on (high-dimensional) transcript-compatibility counts could possibly lead to cell misclassification. Our results show that this is not the case. In Fig. 5 we investigated one cell that seemed to have been severely misclassified by our method as a differentiating myoblast while it was identified as a proliferating cell by Monocle. However, an analysis of the expression levels of 12 marker genes obtained from [12] shows that this cell displays more similarity to differentiating myoblasts than proliferating cells. Overall our results seem to suggest that transcript-compatibility counts, being directly obtained from sequenced reads, might constitute a less noisy representation of the “transcriptomic state” of a cell compared to the one obtained by quantifying its gene expression.

Fig. 5
figure 5

Clustering primary human myoblasts based on transcript-compatibility counts. a The transcript-compatibility counts matrix for 271 primary human myoblasts from [12] is visualized using a diffusion map. Three clusters obtained using affinity propagation are shown along with the distribution of these cells across the four cell-collection timepoints (0, 24, 48, and 72 hours). b The diffusion map obtained using transcript compatibility counts is relabeled using the cells reported by [12]. Clusters 1,2,3 generated by the transcript compatibility based method map to proliferating cells, differentiating myoblasts, and interstitial cells, respectively. According to Trapnell et al.’s labels, the transcript compatibility based method seems to have severely misclassified cell T48_CT_G10 (SRR1033183) as a differentiating myoblast. c Comparing the expressions of 12 differentiating genes in T48_CT_G10 with those of the average proliferating cell and the average differentiating myoblast, 8 out of the 12 genes show expressions similar to what one would expect from a differentiating myoblast. MYOG seems to show an FPKM of 14, which while more than the mean expression of proliferating cells (around 0.28) is much less than the mean expression of differentiating myoblasts (around 61.33). We note that this cell has the highest expression of MYOG among all cells labeled by Trapnell et al. as proliferating cell (and the second highest cell has expression around 5.4). However there are 88 differentiating myoblasts with MYOG expression less than 15 FPKM. Hence it is reasonable to think that this MYOG expression is more typical of differentiating myoblasts than proliferating cells. Only genes CDK1 and CCMB2 show expressions close to what one would expect from a proliferating cell. Even though CDK1 is a highly specific marker for proliferating cells, the above gene profile indicates that classifying cell T48_CT_G10 as a differentiating myoblast seems reasonable

Cell classification in the mouse cortex and hippocampus

The reanalysis of [12] shows that clustering of transcript-compatibility counts can be useful on a single dataset, but we believe that the true power of our approach lies in its broad applicability to multiple single-cell assays. In contrast to the standard quantification pipeline, obtaining transcript-compatibility counts does not require a read-generating model; our method can be directly applied to a wide range of scRNA-seq datasets, and transcript-compatibility counts can be used to analyze sequenced reads without any assay-specific information. To make this point, we reanalyzed a recent large scRNA-seq experiment published earlier this year [7] that uses an assay based on unique molecular identifiers (UMIs). In contrast to [12], where paired-end reads were sampled from fragments covering the entire length of the transcripts, [7] used single-end reads that were only obtained from the 3’-end of the transcripts.

Zeisel et al. [7] examined a very diverse population of 3005 cells obtained from the cortical and hippocampal regions of the mouse brain. In order to analyze this complex dataset, the authors developed a state-of-the-art hierarchical biclustering method called BackSPIN (based on SPIN [30]) and were able to identify 47 distinct subpopulations of cells within nine major brain cell types. This fine-grained analysis also revealed a previously unknown post-mitotic oligodendrocyte subclass, referred to as Oligo1 in [7].

Figure 6 shows the clusters obtained by applying our method to the above dataset and compares our method’s clustering accuracy to various quantification-based methods. In order to systematically assess the clustering accuracy, we iteratively subsampled cells from two different cell types at random and evaluated the ability of each method to distinguish between these types. Since the development of specialized clustering algorithms is orthogonal to our paper, we compared based on the same clustering algorithm throughout (see Methods). Our results indicate that transcript-compatibility counts can be more accurate than standard model-based RNA-seq quantification tools (such as eXpress) that try to estimate the underlying read-generating model from the data. Our transcript-compatibility counts based method is in fact able to achieve similar accuracy with the assay-specific quantification approach used in [7] (which explicitly takes into account the significant 3’-end bias in this dataset). Clustering transcript abundance quantifications output by kallisto results in lower accuracy due to the mismatch between kallisto’s read-generating model and this dataset, further emphasizing the importance of using transcript-compatibility counts, which are computed without using any such model.

Fig. 6
figure 6

Clustering mouse brain cells based on transcript-compatibility counts. a The TCC distribution and gene expression matrices for the 3005 mouse brain cells are visualized using t-SNE (based on Jensen-Shannon distances between TCC distributions and gene expression distributions of cells, respectively) and colored with the cell type determined by Zeisel et al. [7]. We note that the transcript-compatibility based t-SNE also visually maintains the cluster structure of the nine major clusters, even though it can be computed two orders of magnitude faster than the gene expression matrix. b Cells from each of two cell types determined by Zeisel et al. were randomly selected, and then the clustering accuracy of multiple methods was tested. The clustering accuracy was measured as the error rate of the clustering. First, we note that the 3’-end bias in this dataset significantly affects the accuracy of kallisto and eXpress that have been chosen here as representative methods for model-based quantification (see Methods). For both methods clustering was performed on gene expression profiles obtained by summing the corresponding transcript abundances. For each point in the eXpress and kallisto curves, we took the minimum of the error rates obtained with bias modeling turned on and off. By avoiding estimation of the read model, transcript-compatibility based methods were indeed more accurate. We see that transcript-compatibility based clustering achieves similar accuracy to the gene-level UMI counting method implemented by the authors for this dataset without explicitly accounting for PCR biases. Refining transcript-compatibility counting to correct for PCR biases (by counting only the distinct UMIs of reads in each equivalence class) leads to a marginal improvement of our method. c Running affinity propagation on the TCC distribution matrix (using negative Jensen-Shannon distance as similarity metric) produced a cluster of 28 cells, 24 of which were labeled Oligo1. Zeisel et al. [7] classified 45 of the 3005 cells as this new class of cells. The bar plot compares the mean expression of selected oligodendrocyte marker genes in the TCC cluster to their mean expression in Zeisel et al.’s Oligo1. As reported in [7], Oligo1 cells are characterized by their distinct expression of genes such as Itpr2, Rnf122, Idh1, and Gpr17. The similarity of the bars seems to suggest that clustering on TCC can capture this fine-grained information. Note that although single-cell clustering was entirely performed based on transcript-compatibility counts, the gene expression data used to evaluate this figure were obtained from Zeisel et al

Quite remarkably, our method (via affinity propagation on all cells) was further able to recover the Oligo1 cluster of cells, showing that transcript-compatibility counts can indeed capture distinct cell signatures without actually quantifying their gene expression (Fig. 6, Methods). Overall, in our experiments we observed that unsupervised clustering of transcript-compatibility counts typically yielded more than 47 clusters, which was also the case in [7]. Some of our clusters were very small, probably capturing outlier cells, while others seemed to be further splitting the 47 cell subtypes identified in [7].

To further investigate this, we focused on another oligodendrocyte subpopulation, referred to as Oligo3 in [7]. As reported in [7], Oligo3 cells were almost exclusively observed in the somatosensory cortex and were identified by the authors as being in an intermediate stage of maturation — in between premyelinating and myelinating oligodendrocytes. Even though the Oligo3 cells appear to be well clustered together, as visualized by t-distributed stochastic neighbor embedding (t-SNE) (Fig. 7 a), affinity propagation on transcript-compatibility counts with various parameters consistently separated them into two subclusters. Our results in Fig. 7 b seem to suggest that a subpopulation of Oligo3 cells (captured by one of our subclusters) expresses an unusual signature of endothelial/vascular genes on top of the expected myelin-related genes. Interestingly, similar findings have been reported recently in [37], suggesting a possible (experimental) contamination of several oligodendrocyte cells in the dataset at hand.

Fig. 7
figure 7

Analysis on transcript-compatibility counts refines the classification of mouse brain cells. a Runs of affinity propagation with different propagation and damping parameters were carried out on the TCC matrix for the 3005 mouse brain cells of [7]. The Oligo3 subclass discovered by Zeisel et al. was consistently split into two subclasses A and B. b Cells in the Oligo3 B class showed greater expression of endothelial/vascular genes and lower expression of myelinating oligodendrocyte genes. The opposite was true for the cells in Oligo3 A. This result may corroborate the potential contamination of oligodendrocytes in the Zeisel et al. dataset that has recently been reported in Fan et al. [37]

Discussion

In this paper we introduced a novel method that uses transcript-compatibility counts — instead of gene expression profiles — as distinct cell signatures for clustering single-cell data. Note, however, that the main focus of our method is not about how to cluster (i.e., the particular choice of clustering algorithms) but rather what to cluster on. To emphasize this point, we considered simple, “off-the-shelf” clustering methods that directly use the corresponding TCCs as their input. Interestingly, while these methods may not be able to recover accurate clusters when applied to gene expression vectors (see Additional file 1: Figure S4 or ([7], Figure S3), for example), our results showed that TCCs maintain all the necessary information to recover the analyses of [12] and [7].

Even though clustering alone can reveal important information about a single-cell RNA-seq experiment, further biological interpretation of the results (marker gene identification or differential expression) requires some form of quantification of expression profiles within and in between clusters. So it is natural for one to think that eventually the quantification bottleneck will still manifest itself in single-cell analysis. A key observation, however, is that given an accurate clustering of the cells, each and every individual cell’s gene expression profile is no longer needed; one can extract an accurate statistical representation of the gene expression within each cluster — without having to quantify all cells separately. In particular, one can quantify the aggregate gene expression in each cluster (cluster centers) by pooling single-cell TCCs together and further estimate the corresponding gene variability by subsampling and quantifying only a few cells per cluster. For example, in Additional file 1: Figure S5 we used kallisto to quantify subsampled cells and the corresponding cluster centers (after clustering on TCCs) for Trapnell et al.’s dataset and generated results that are very similar to the ones obtained in Fig. 4 (where the corresponding gene expression profiles were obtained from [12]). Our method can therefore be used to effectively reverse the quantification and clustering steps in the conventional pipeline and potentially provide further end-to-end processing gains, depending on the needs/goals of each scRNA-seq experiment. Overall, we believe that clustering before quantifying is a promising future direction for scRNA-seq analysis which may lead to more robust and accurate quantification algorithms.

Conclusions

The extraordinary developments in single-cell RNA-seq technology over the past few years have demonstrated that “single-cell resolution” is not just a gimmick but an unprecedented tool for probing transcriptomes that can reveal the inner workings of developmental programs and their resulting tissues. However, the computational challenges of scRNA-seq analysis, already very high due to the large number of cells to analyze, have been further exacerbated by the smorgasbord of assays, each of which introduces unique technical challenges.

The new method we have proposed and evaluated in this paper, namely analysis of scRNA-seq based on transcript-compatibility counts, offers an efficient, accurate, and broadly applicable solution for extracting information from scRNA-seq experiments. In the same way that single-cell analysis can be viewed as the ultimate resolution for transcriptomics, transcript-compatibility counts are the most direct way to “count” reads. While we have focused on clustering of cells in this paper, we believe that transcript-compatibility counts may have applications in many other sequencing-based assays, and that further development of methods based on such counts offers a fruitful avenue of exploration.

The ability to obtain transcript-compatibility counts by pseudoalignment is a benefit that has its own implications and applications. For example, the speed of pseudoalignment facilitated quick experimentation with our method, and in assessing our accuracy on different datasets, one discovery was that much less sampling than is currently performed is necessary to cluster cells. In the reanalysis of [7], we found that the main results, namely the clustering of cells and identification of cell types, were achievable with only 1 % of the data (see Fig. 8 and Additional file 1: Figure S6). This observation has significant implications for scRNA-seq as it suggests that, for clustering of cells, low-coverage sequencing may be sufficient, thus allowing for larger experiments with more cells. Moreover, this low-coverage clustering performance can be achieved using our method, which is not tailored to the specific scRNA-seq assay.

Fig. 8
figure 8

Coverage requirements for clustering based on transcript-compatibility counts. As an intermediate between raw reads and quantified transcript abundances, transcript-compatibility counts intuitively have more information than the reconstructed transcripts and less noise than the raw reads. a As the read coverage of a cell of the dataset of [7] decreases from approximately 627k mapped reads, different methods have varying robustness to the loss of coverage. Each method was evaluated on its ability to cluster 200 randomly selected neurons mixed with 200 randomly selected non-neurons into the two cell types (the clustering of [7] being considered as the ground truth). Among methods which do not explicitly account for PCR bias, TCC based clustering performed much better than kallisto and eXpress and was quite close in performance to the UMI counting method of [7]. For both kallisto and eXpress, clustering was performed on gene expression profiles obtained by summing the corresponding transcript abundances. For each point in the eXpress and kallisto curves, we took the minimum of the error rates obtained with bias modeling turned on and off. By counting the number of unique UMIs rather than reads (TCC with UMI in the plot), transcript-compatibility based clustering was adapted to account for PCR bias, resulting in similar performance to that of gene-level UMI counting used in [7]. b Even at significantly decreased coverage depths, our method maintains clusters corresponding to the nine major cell types identified by Zeisel et al. The transcript-compatibility distribution matrices at varying coverage depths are visualized using t-SNE. c At various coverage depths, transcript-compatibility counting with UMIs disagrees slightly with the cells the authors labeled as Oligo1 (cyan cell IDs). As the coverage decreases, transcript-compatibility based affinity propagation still identifies a cluster that captures the vast majority of Oligo1 cells in the 3005-cell population

Methods

The code used to generate the results presented in this paper is available online on GitHub [49]. The Mus musculus transcriptome assembly used was GRCm38. The Homo sapiens transcriptome assembly used was GRCh38. The reference genome used for HISAT was build 10 of the mouse genome (mm10) from the UCSC genome browser. The genome annotation used in the analysis is the release 83 of Ensembl.

Computation of transcript-compatibility counts

In our implementation of the method, we use kallisto to compute transcript-compatibility counts via pseudoalignment (avoiding the quantification step that is usually performed when running kallisto altogether). In particular, we utilized the “pseudo” option of the kallisto RNA-seq program, which computes equivalence classes of reads after pseudoalignment. We used kallisto version 0.42.3 with k set to kallisto’s default value of 31. Even though kallisto pseudoalignment is a natural approach to obtain transcript-compatibility counts, one can in principle extract the same information from exact read alignments. To explore this alternative, we used HISAT (with the no-spliced-alignment option enabled) to align reads on the mouse transcriptome (GRCm38) in the case of Zeisel’s dataset and the human transcriptome (GRCh38) in the case of Trapnell’s dataset. Then, we generated the corresponding “alignment-based TCCs” by directly counting the number of multi-mapped reads aligned to each set of transcripts, and evaluated their performance in Additional file 1: Figure S7.

Transcript-compatibility counts based on UMI information

The dataset of [7] has reads with unique molecular identifiers (UMIs). UMIs are typically used in scRNA-seq to correct for PCR bias; biological copies of a transcript (distinct molecules) can be identified based on their UMIs. This information can be utilized in generating the transcript-compatibility counts from equivalence classes. Instead of counting all the reads in each equivalence class, we only count the reads with distinct UMIs. Transcript-compatibility counts with UMIs are shown in Figs. 6 b and 8 a (represented as “TCC with UMI” in the figures).

Clustering methodology

On obtaining the transcript-compatibility counts for each cell, we normalize by the total number of mapped reads to obtain a probability distribution called the transcript-compatibility count distribution or TCC distribution. We then compute the square root of the Jensen-Shannon divergence [50] between the TCC distributions for each pair of cells. As a distance metric which satisfies the triangle [51] inequality, the square root of the Jensen-Shannon divergence is a natural choice for computing pairwise distances between two probability distributions. However, the results obtained here are not contingent on using the square root of Jensen-Shannon divergences as the measure of distances, and quite similar results are obtained when we use other distances between probability distributions, such as the 1 distance to compute the pairwise distance matrix (see Additional file 1: Figure S6). The 1 distance (which is just twice the total-variation distance) in fact seems to perform better than the Jensen-Shannon distance for low coverage (Additional file 1: Figure S6b). In contrast, Euclidean distance ( 2 distance) seems to perform much worse (see Additional file 1: Figure S6). The fact that Euclidean distance is not a good distance metric to measure distances between probability distributions is widely documented (see, for instance, [52]).

All clustering carried out in this paper was done using off-the-shelf clustering methods.

We used spectral clustering using the pairwise distance matrices when we knew the number of clusters in the data. This includes Figs. 6 b, 8 a, and Additional file 1: Figure S6b with two clusters for the pairwise distance matrix (from TCC distributions) obtained for the data from [7].

The clustering method used when the number of clusters is not known is affinity propagation [53]. This is an unsupervised clustering algorithm based on message passing, which needs a pairwise similarity matrix as input. The pairwise similarity matrix is computed as the negative of the pairwise distance matrix that was computed.

To evaluate the clustering accuracy of our method in Fig. 6 b, we performed binary classification tests using the labels reported in [7] as the ground truth. In particular, we randomly subsampled two different types of cells and evaluated the ability of each pipeline to separate them into two clusters via spectral clustering. We performed these binary classification tests between (1) the subclasses Oligo1 (45 cells) and Oligo4 (106 cells), (2) the cell types Astrocytes (198 cells) and Interneurons (290 cells), and (3) the more general cell types neurons (1628 cells) and non-neurons (1377 cells). The error rates for each test were obtained by randomly sampling 22, 99, and 200 cells from each of the two labels respectively, averaged over 10 Monte Carlo iterations.

For clustering the dataset of [7], we used affinity propagation with the default parameters which set the preference value equal to the median of the similarity scores and the damping parameter equal to 0.5. On doing this, we obtained 89 clusters. Of the 89 clusters obtained, cluster number 22 had the largest match with the set of cells the authors labeled as Oligo1 (which was the new type of cells discovered in [7]). A total of 24 out of the 28 cells in the cluster were labeled Oligo1 by [7]. There were a total of 45 cells labeled Oligo1 in [7] out of the total of 3005 considered. This is investigated in Fig. 6 c.

Also, affinity propagation with different parameters seems to split the class labeled Oligo3 in [7] into two classes. This is investigated in Fig. 7, where the two classes considered were classes obtained with parameters set as before.

For clustering the dataset of [12], we used affinity propagation with preference parameter set to 1.3 and damping parameter set to 0.95 to obtain three clusters in Fig. 4. To obtain eight clusters on the dataset of [12], we used affinity propagation with preference parameter set to 0.6 and damping parameter set to 0.95 to obtain eight clusters. Then after collapsing any cluster with less than five cells into the cluster closest to it, we obtain the seven clusters investigated in Fig. 5. More details regarding our parameter choices for affinity propagation in this dataset are provided in Additional file 1: Figure S8.

Partial order on clusters

On the [12] dataset, for the seven clusters obtained, we first find the centroid TCC distribution of each cluster as the mean TCC distribution of all cells in the cluster. Then, we compute the pairwise Jensen-Shannon distances between the centroid TCC distributions (cluster centers). We then run a minimum weight spanning tree on the complete graph between the cluster centers with weights given by the computed pairwise distances. This gives us a partial order on the clusters, which is investigated in Fig. 4 and Additional file 1: Figure S2.

Quantification

In this paper, we used kallisto and eXpress as representative methods for model-based quantification as they demonstrate similar accuracy to other available quantification tools [41]. In particular, for Zeisel et al.’s dataset (Figs. 6 b and 8 a) we used these tools as a “negative control” to demonstrate the importance of the read-generating model when using quantified transcript abundances to obtain gene expression profiles and cluster single-cell data. Although the significant mismatch between the assumed model (i.e., full transcript length coverage) and the 3’-end bias in this dataset is expected to affect the quantification accuracy for each individual cell, our results show that such a mismatch further impacts the accuracy of clustering and cell-type classification. To evaluate the cell-type classification performance of kallisto and eXpress, we took the minimum of the error rates obtained with bias modeling turned on and off.

For the Trapnell et al. dataset, we used kallisto to quantify transcript abundances and obtain the gene expression profiles within the clusters obtained from TCCs. Note that the read-generating model in this dataset is similar to the standard RNA-seq model that kallisto uses for quantification. More specifically, in Additional file 1: Figure S5a we quantified the corresponding cluster centers by running kallisto’s EM algorithm on the pooled TCCs of each cluster. Using kallisto in this setting resembles bulk RNA-seq quantification applied to the pooled reads coming from each individual cluster (instead of the entire population of cells). In Additional file 1: Figure S5b we further quantified randomly subsampled cells (20 cells per cluster) to obtain an accurate estimate of the gene expression variability within each cluster.

Visualization of cells and clusters

We used t-SNE [54] to visualize the cells and clusters in Figs. 6 a, 8 b, and Additional file 1: Figure S6a.

The left panel of Figs. 4 d, 5 a, b and Additional file 1: Figure S3a was created using an implementation [55] of the diffusion map algorithm of [56].

Availability of data and materials

The code used to generate the results presented in this paper is available online on GitHub [49]. All sequencing reads for the Zeisel et al. dataset [7] are available through Gene Expression Omnibus [GEO:GSE60361] and for the Trapnell et al. dataset [12] through [GEO:GSE52529]. The method is publically available on GitHub (https://github.com/govinda-kamath/clustering_on_transcript_compatibility_counts) under the MIT license.

Ethics

No ethics approval was required for this study.