SCimilarity revolutionizes single-cell data analysis with rapid comparisons between tissues
5 mins read

SCimilarity revolutionizes single-cell data analysis with rapid comparisons between tissues

Unlocking the secrets of cell similarity: how SCimilarity transforms single-cell data into insights into disease, development and tissue biology.

SCimilarity revolutionizes single-cell data analysis with rapid comparisons between tissues

SCimilarity search engine. Image credit: SCimilarity

In a recent study published in the journal Natureresearchers in Canada and the US developed Single-Cell Similarity (SCimilarity), a framework for fast, interpretable searches of single-cell or single-nucleus Ribonucleic Acid-seq (sc/snRNA-seq) data. This framework enables the discovery of similar cell states in the human cell atlas.

Background

Over 100 million cells have been profiled using sc/snRNA-seq across different states, providing unprecedented opportunities to link cell states across development, tissues and diseases. However, large-scale analyzes remain limited due to challenges in dataset harmonization, definition of shared representations, and lack of robust similarity measures or scalable search methods.

Current approaches often fail to generalize across datasets and cannot efficiently query massive atlases for similar cell profiles. Further research is needed to develop basic models that enable accurate, scalable and interpretable searches, unlocking the full potential of single-cell atlases to advance biological discovery.

About the study

scRNA-seq has profiled millions of individual cells across different tissues, conditions and diseases, offering transformative opportunities to link cell states across contexts.

However, effective comparisons between datasets remain limited due to challenges in harmonizing disparate data, defining common representations, and developing accurate metrics to quantify cellular similarity.

While preserving dataset-specific information, existing models often fail to generalize or efficiently search large atlases for comparable cell states.

Metric learning, a technique successfully applied in areas such as image processing, offers a promising solution. By embedding cell profiles in a shared low-dimensional space, it becomes possible to identify biologically similar cells across large data sets. Such representations could enable scalable, interpretable searches for cells in different contexts, facilitating comparisons between datasets and biological discovery

Study results

SCimilarity showed generalization across different single-cell profiling platforms. Although primarily trained on 10x Genomics Chromium data, it effectively embedded and annotated cell profiles from multiple platforms, including scRNA-seq and snRNA-seq datasets.

For example, human peripheral blood mononuclear cell (PBMC) samples profiled across seven platforms demonstrated consistent annotation accuracy across multiple platforms, except for rare cell types such as conventional dendritic cells (cDCs) and plasmacytoid dendritic cells (pDCs).

While minor differences in embedding distance were observed, especially for non-10x platforms such as Switching Mechanism At 5′ End of RNA Template Sequencing (SMART-Seq2), SCimilarity maintained high performance, demonstrating its adaptability to different data sources.

A key advantage of SCimilarity is its ability to integrate datasets without explicit batch correction. By quantifying representational confidence for individual cells, the model identifies outliers and assesses its generalization to new data. For example, low-confidence annotations were associated with poorly represented tissues in the training data, such as stomach and bladder. This capability enabled the construction of an atlas spanning 30 human tissues and facilitated pan-tissue comparisons.

The model also excelled in annotating cell types through its embedding-based similarity measure. SCimilarity annotated individual cells independently, bypassing the need for clustering and retrieving the most similar cells efficiently. It achieved competitive accuracy with existing methods such as single-cell annotation with Variational Inference (scANVI) and CellTypist, even matching fine-grained annotations supported by protein markers. For example, SCimilarity correctly annotated 86.5% of cells in healthy kidney samples compared to author-provided labels, performing on par with tissue-specific models.

SCimilarity’s interpretability was validated using Integrated Gradients, which identified critical gene contributions to cell type annotations. These gene attributions were in good agreement with known markers of major cell types, such as surfactant genes that distinguish pulmonary alveolar type 2 (AT2) cells. This demonstrates the ability of SCimilarity to capture biologically meaningful features without prior knowledge of cell type-specific signatures.

The query functions of the model were tested with fibrosis-associated macrophages (FMΦs) and myofibroblasts in interstitial lung disease (ILD). SCimilarity identified FMΦ-like cells across ILD, cancer and other fibrotic disease datasets, revealing shared cellular states. Notably, it revealed FMΦs in rare contexts, such as pancreatic ductal adenocarcinoma (PDAC), suggesting their broader relevance in fibrosis.

To further explore its utility, SCimilarity searched for FMΦ-like cells in vitro. Surprisingly, it identified cells cultured in a 3D hydrogel system as transcriptionally similar to FMΦs. Experimental validation confirmed SCimilarity’s prediction, demonstrating its potential to identify new experimental conditions and model disease-relevant cell states in vitro.

Conclusions

To summarize, SCimilarity advances single-cell analysis by enabling scalable and efficient searches across diverse scRNA-seq and snRNA-seq datasets.

Built on metric learning, it provides annotation and searches of cell profiles, leveraging full expression profiles to reduce bias from curated gene signatures. SCimilarity excels at identifying transcriptionally similar cells, facilitating discoveries of novel conditions such as FMΦs and myofibroblasts across diseases.

Its ability to generalize to unseen datasets and its open-source availability make it a fundamental tool for exploring the human cell atlas, supporting diverse biological investigations, and revealing insights into human biology and disease mechanisms.

Source:

Journal reference: