Skip to content

Evaluation framework for exploratory modelling in cytometry

License

Notifications You must be signed in to change notification settings

saeyslab/SingleBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SingleBench

SingleBench is an R framework for optimising exploratory workflows for cytometry data. It introduces a unified way of building and running workflows with one or both of two modules:

  • projection: transformation of inputs by one or more denoising/embedding (dimension reduction) steps
  • clustering

It makes it easy to test the stability of a workflow, run parameter sweeps for tuning, and compare effects of different denoising techniques. Clustering is automatically scored using interpretable evaluation metrics.

Clustering evaluation details 📊

Evaluation of clustering is difficult, with notable disagreements between metrics (see 10.1038/nmeth.3583). In an elegant design, 10.1002/cyto.a.23030 compared the performance of clustering algorithms on labelled cytometry data, by matching clusters to cell populations (labels) automatically.

To do this, each cluster was matched to a label using a bijective (one-to-one) matching scheme, implemented using the Hungarian method for maximising the total F1-score of the matches.

However, this leaves unmatched labels if 'underclustering' with respect to the number of annotated populations, or unmatched clusters if 'overclustering'. Those are then omitted from the overall score, which is misleading if the clustering gives a higher-resolution partitioning of the data that reveals previously unknown sub-populations.

To address this, we use 3 cluster-label matching schemes, all maximising total F1 across matches:

  • bijective (one-to-one) scheme: 1 cluster to 1 population
  • fixed-cluster scheme: 1 cluster possibly matched to multiple populations
  • fixed-label scheme: 1 population possibly matched to multiple clusters

For exploratory modelling, the fixed-label scheme is generally appropriate. SingleBench uses 'similarity heatmaps' to display the results of all 3 matching schemes, to understand and interpret the clustering results better (see PlotSimilarityHeatmap below).

The following clustering evaluation metrics are computed by SingleBench:

  • Davies-Bouldin index
  • adjusted Rand index (ARI)
  • F1, precision and recall per cluster-label match and on average

Some of the reasons you should use SingleBench:

Interpretability of clustering for discovery of new cell types 👩‍🔬

Automated clustering of cell types is used massively for discovery in cytometry, scRNA-seq and other single-cell data. Leveraging annotations of known cell populations, SingleBench gives a complex picture of how labels match up with clusters, using multiple matching schemes. This not only evaluates quality of the clustering, but helps discover new subpopulations or mistakes in manual annotation.


Saving up on time and RAM

Implementing a lightweight workflow management system, intermediate results are saved using rhdf5 and recycled. This frees up memory, decreases running time and prevents data loss. When running a workflow, parallel processing can be used for further speed-up, using doParallel.

Additionally, algorithms relying on k-nearest-neighbour graphs (k-NNGs) can use a recycled pre-computed k-NNG.


Extensibility 🧩

Denoising, embedding and clustering algorithms in R/Python are integrated via wrappers, which have some minimal formal requirements. New wrappers can easily be added, with instructions and examples provided.


Early detection of errors ‼️

When designing a workflow, SingleBench runs validity checks to detect potential errors before execution. These sanity checks help detect mistakes more easily and save time.

Installing

Install from R using devtools:

devtools::install_github('davnovak/SingleBench')

Note

Individual denoising, embedding or clustering methods might need to be installed separately.

Interfacing with Python

If planning to call Python tools from SingleBench, you will need to install reticulate and make sure required Python modules are available in an environment that reticulate can work with. This means that the environment can be loaded with reticulate::use_condaenv or reticulate::use_virtualenv, or that the use of the correct Python binary can be forced by specifying Sys.setenv(RETICULATE_PYTHON=path_to_python_binary) from within R before loading reticulate.

If you need to switch between multiple Anaconda environments from within the same R session, try RCondaRun.

Usage

Session info for reproducibility of Usage section
R version 4.3.3 (2024-02-29)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.1.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Brussels
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reticulate_1.37.0 SingleBench_1.0  

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5       cli_3.6.4         rlang_1.1.5       png_0.1-8         purrr_1.0.4       generics_0.1.3    jsonlite_1.9.0   
 [8] glue_1.8.0        colorspace_2.1-1  rprojroot_2.0.4   scales_1.3.0      grid_4.3.3        munsell_0.5.1     tibble_3.2.1     
[15] foreach_1.5.2     lifecycle_1.0.4   compiler_4.3.3    dplyr_1.1.4       codetools_0.2-20  Rcpp_1.0.14       pkgconfig_2.0.3  
[22] here_1.0.1        rstudioapi_0.17.1 lattice_0.22-6    R6_2.6.1          tidyselect_1.2.1  pillar_1.10.1     magrittr_2.0.3   
[29] Matrix_1.6-5      withr_3.0.2       tools_4.3.3       gtable_0.3.6      iterators_1.0.14  ggplot2_3.5.1 

When loaded, method wrappers for individual tools are exported to the global namespace. On start-up, available tools and missing packages are listed.

library(SingleBench)

The wrappers can be reloaded at any time using LoadWrappers()


In the example below, we create a pipeline, plug it into a benchmark instance, evaluate the benchmark and inspect results.

Building a pipeline

A pipeline is composed of one or more subpipelines: combinations of methods and their input parameters. Each subpipeline can have one or both of two modules: projection and clustering.

To create a subpipeline for smoothing (k-NN-based denoising) followed by FlowSOM clustering:

pipeline <- list()
pipeline[[1]] <- 
  Subpipeline(
    projection =
      Module(Fix('smooth', k = 100, n_iter = 1), n_param = 'lam'),
    clustering =
      Module(Fix('FlowSOM', grid_width = 10, grid_height = 10), n_param = 'n_clusters')
  )

Here, Fix assigns fixed values of input parameters For smooth, k (neighbourhood size) is fixed at 100 and n_iter (number of iterations) at 1. Both grid_width and grid_height (SOM dimensions) are fixed at 10 for FlowSOM.

Any Module can specify an optional variable numeric parameter: n-parameter. Here, lam (denoising strength coefficient) of Smooth and n_clusters (number of resulting metaclusters) of FlowSOM are assigned as variable.

Note

The projection module can also be a list of multiple Module objects, applied in succession.

Note that n-parameters can be specified for neither, one or both of the modules. To specify n-parameter values:

n_params <- list()
n_params[[1]] <- list(
  projection = rep(c(NA, 0.5, 1.0), each = 2),
  clustering = c(30, 40)
)

If both modules have an n-parameter, the vectors of values for each will be aligned automatically, as long as the length of one is a multiple of the length of the other. Setting the projection n-parameter to NA skips the projection step.

Note

If a projection n-parameter value is repeated, the projection is only computed once and re-cycled when needed.

To create a second subpipeline with the same projection step, use CloneFrom:

pipeline[[2]] <-
  Subpipeline(
    projection = CloneFrom(1),
    clustering = Module(Fix('kmeans'), n_param = 'n_clusters')
  )

n_params[[2]] <- list(
  projection = rep(c(NA, .5, 1.), each = 2),
  clustering = c(30, 40)
)

Here, the second subpipeline evaluates the same projection steps as the first one, followed by a k-means clustering.

Running a benchmark

A Benchmark object links a pipeline to input data in the format of a SummarizedExperiment object, a vector paths to FCS files, or a flowSet object with some form of labels per cell.

In the example below, we

  • download a CyTOF mouse bone marrow dataset (Samusik_01) using HDCytoData
  • only keep 'type' markers as encoded in the input's colData
  • asinh-transform the expression values, with co-factor 5
  • denote 'unassigned' as label for cells without annotation
  • specify to repeat 5 times with different random seed, to account for stochasticity
  • set 'benchmark.h5' as an HDF5 file name for writing data and results
library(HDCytoData)
d <- Samusik_01_SE()
b <- Benchmark(
  input              = d,
  transform          = 'asinh',
  transform_cofactor = 5,
  input_marker_types = 'type',
  unassigned_labels  = 'unassigned',
  pipeline           = pipeline,
  n_params           = n_params,
  stability_repeat   = 5,
  h5_path            = 'benchmark.h5'
)

Before evaluating the benchmark:

  • use print(b) to get a read-out of settings and check for mistakes
  • for parallel processing, check number of available CPU cores using parallel::detectCores() and decide how many to use

Then run:

Evaluate(b, verbose = TRUE) # specify `n_cores` for parallel processing

With verbose = TRUE, progress messages are shown during evaluation.

2-d embedding for visualisation (OPTIONAL)
A 2-dimensional embedding can help view the data and results. Some dimensionality reduction (DR) tool wrappers are provided by default for this purpose. To generate a layout and bind it to the data:
AddLayout(b, method = Fix('UMAP', latent_dim = 2))

Extracting results

A read-out summarising results of an evaluated benchmark can be printed using print(b). Dedicated functions are provided for extracting data and results in numeric form.

Extractor functions
  • GetExpressionMatrix for single-cell expression values
  • GetProjection for output of the projection module
  • GetAnnotation for vector of labels per cell
  • GetPopulationSizes for cell counts per labelled population
  • GetRMSDPerPopulation for population variability
  • GetLayout for 2-dimensional layout of data (if generated)
  • GetClustering for vector of cluster assignments per cell

For the clustering step, additional getter functions are available:

  • GetClusterSizes for cell counts per cluster
  • GetRMSDPerCluster for cluster variability

Each of these is documented: for instance, run ?GetRMSDPerPopulation for guidance.


Interpreting clustering results

A range of informative plots are helpful for interpretation.

  • Importantly, PlotSimilarityHeatmap displays cluster-label matches for a chosen iteration of clustering, comparing results from different matching schemes.
PlotSimilarityHeatmap(b, idx.subpipeline = 1, idx.n_param = 6, idx.run = 1)

Annotated imilarity heatmap

Then to inspect the expression profile of a cluster/population of interest, use WhatIs:

WhatIs(b, idx.subpipeline = 1, idx.n_param = 6, idx.run = 1, cluster = 15)

This shows the distribution of signal per marker from the subset of interest, and how it compares to the background distributions (all cells). Spearman correlations between marker signal are also shown, to give an overview of single-cell-level patterns in marker expression that identify this compartment:

Output of WhatIs without specified markers of interest

These patterns can then be inspected further by plotting a biaxial expression plot for two markers of interest at a time, using the same function with the markers additionally specified:

WhatIs(b, idx.subpipeline = 1, idx.n_param = 6, idx.run = 1, cluster = 15,
       marker1 = 'B220', marker2 = 'CD16_32', pointsize_fg = 1.5)

Output of WhatIs with specified markers of interest

  • Alternatively, use PlotClusterHeatmap or PlotPopulationHeatmap to look at all clusters/populations.
  • PlotCompositionMap shows the composition of a cluster in terms of labelled populations.
  • To visualise the clusters or annotated populations using a previously generated 2-dimensional layout, use PlotClustersOverlay or PlotLabelsOverlay.

For hyperparameter tuning, the resulting scores per n-parameter value can be compared in a line plot or heatmap (if two n-parameters were varied) using PlotNParameterMap_Clustering:

PlotNParameterMap_Clustering(b, idx.subpipeline = 1, nrow = 1)

n-parameter map

Extending

Usage of projection/clustering tools is specified by method wrappers: functions with a prescribed signature that allow SingleBench to use them. The recipes for wrappers are R scripts located in inst/extdata/ in the wrappers_projection and wrappers_clustering directories. Wrappers are generated using the WrapTool function. For instructions on wrapping new tools, check the function documentation using ?WrapTool and existing wrappers that are already provided.

About

Evaluation framework for exploratory modelling in cytometry

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy