A new model to continuously learn cell identities in single-cell data

Single-cell RNA sequencing is increasingly used to map cellular heterogeneity of complex tissues. For example, there are more than 170 studies that profiled in total more than 44 million cells across different brain regions, ages and species. Our understanding of cellular diversity can be greatly enhanced by integrating these datasets to obtain a comprehensive atlas. Such an atlas can be used for instance to transfer cell annotations to new unlabeled datasets, facilitating reproducible analysis. However, there are two major challenges to this approach. First, identifying relationships between cell populations across datasets is challenging. How does a neuron from study A match a neuron from study B? These identities are assigned by different researchers and the process is very subjective. Even more problematic is the fact that there is no universal convention on how to name cells and researchers label their data at different levels. For example, while one study can label cells as neurons, another differentiates between excitatory and inhibitory neurons. Second, as more single-cell data becomes available, this atlas needs to be updated which renders previous analysis using an older version of the atlas obsolete.

scHPL can construct cell trees from multiple datasets and use the constructed tree to automatically label new single-cell datasets

In a recent Nature Communications paper from Brainscapes, Michielsen et al. developed a machine learning method, called scHPL for single-cell Hierarchical Progressive Learning, to build cellular hierarchies by combining the annotations of multiple datasets. From a given set of labeled datasets, scHPL builds a cellular hierarchy by combining the annotations of multiple datasets using classifiers. This hierarchy can progressively be updated when new labeled datasets become available. For a new unlabeled dataset, scHPL uses the previously learned cellular hierarchy to automatically assign cell labels. scHPL was applied to multiple immune and brain single-cell RNA sequencing datasets and was particularly useful in matching cell populations that could not be matched by their assigned labels. For example, when combining two datasets from the mouse primary visual cortex, scHPL assigned ‘L6b VISp Col8a1 Rprm’ cells as a subpopulation of ‘L6a Sla’. In the original study, this population was manually assigned as a subpopulation of ‘L6b Rgs12’ cells, yet the expression of marker genes supports the match identified by scHPL. All in all, scHPL presents a powerful tool to combine cell annotations from multiple reference atlases which can be particularly relevant to efforts aimed at building a comprehensive atlas for which new data is continuously generated.