In this example workflow, we demonstrate a single cell classifier we recently developed in our preprint.

For illustration, we’ve chosen a T cell dataset that we recently published to get started. The TPM expression matrix can be downloaded here.

Library

suppressMessages(library(ggplot2))
suppressMessages(library(tidyverse))
suppressMessages(library(scibet))
suppressMessages(library(viridis))
suppressMessages(library(ggsci))

Load the data

path_da <- "~/test.rds.gz"
expr <- readr::read_rds(path = path_da) 

For expression matrix (TPM), rows should be cells and the last column should be "label".

expr[1:10, 1:10]

E(ntropy)-test for supervised gene selection

Based on our unified model, we developed E-test for supervised gene selection. This step is implemented with SelectGene function. Our use of E-test involves an assumption that there is no heterogeneity within each population and hence 𝑆 could be directly calculated by feeding its corresponding 𝐸 into the 𝑆-𝐸 formula.

etest_gene <- SelectGene(expr, k = 50)
etest_gene
##  [1] "CXCL13"   "CCR7"     "FGFBP2"   "SELL"     "CCL4"     "GZMH"    
##  [7] "GZMB"     "IFNG"     "CCL3"     "FCGR3A"   "GPR15"    "RGS1"    
## [13] "GZMK"     "CX3CR1"   "KLRC2"    "CXCR6"    "GZMA"     "RGS2"    
## [19] "MAL"      "TMIGD2"   "LEF1"     "KLRC1"    "PLEK"     "HAVCR2"  
## [25] "KLRF1"    "GNLY"     "S100B"    "CD160"    "NR4A2"    "KLRG1"   
## [31] "ITGAE"    "NR4A1"    "FOS"      "FCER1G"   "LDLRAP1"  "NKG7"    
## [37] "HLA-DRB5" "VCAM1"    "CCL5"     "CST7"     "PDCD1"    "FCRL6"   
## [43] "C1orf162" "CD82"     "TNFAIP3"  "GPR183"   "LAT2"     "CD69"    
## [49] "S1PR5"    "PLAC8"

To verify these genes, we can examine their expression patterns across different cell types with Marker_heatmap.

Marker_heatmap(expr, etest_gene)

SciBet: Single Cell Identifier Based on Entropy Test

For accurate, fast, and robust single cell identification. We developed SciBet using a multinomial-distribution model. This step is implemented with SciBet function.

1. For reference set, rows should be cells, column should be genes and the last column should be "label" (TPM). 2. For query set, rows should be cells and column should be genes (TPM).

tibble(
  ID = 1:nrow(expr),
  label = expr$label
) %>%
  dplyr::sample_frac(0.7) %>%
  dplyr::pull(ID) -> ID

train_set <- expr[ID,]      #construct reference set
test_set <- expr[-ID,]      #construct query set

prd <- SciBet(train_set, test_set[,-ncol(test_set)])

We can evaluate how well our predicted cell type annotations match the reference with Confusion_heatmap. For this dataset, we find that there is a high agreement in cell type identification.

Confusion_heatmap(test_set$label, prd)

False positive control

Due to the incomplete nature of reference scRNA-seq data collection, cell types excluded from the reference dataset may be falsely predicted to be a known cell type. By applying a null dataset as background, SciBet controls the potential false positives while maintaining high prediction accuracy for cells with types covered by the reference dataset (positive cells).

For illustration, we’ve chosen a recent melanoma dataset with immunde cells as “positive cells” and the other cells (CAF, maligant cells and endothelial cells) as “negative cells”.

For the purposes of this example, these three datasets are used to get started.

null <- readr::read_rds('~/null.rds.gz')
reference <- readr::read_rds('~/reference.rds.gz')
query <- readr::read_rds('~/query.rds.gz')

For query set, “negative cells” account for more than 60%.

ori_label <- query$label
table(ori_label)
## ori_label
##     B.cell Macrophage   Neg.cell         NK      T.CD4      T.CD8 
##        245        126       2228         28        257        528

The confidence score of each query cell is calculated with the function conf_score.

query <- query[,-ncol(query)]
c_score <- conf_score(ref = reference, query = query, null_expr = null, gene_num = 500)

The visualization of above result could be implemented with the function Confusion_heatmap_negctrl.

tibble(
  ori = ori_label,
  prd = SciBet(reference, query),
  c_score = c_score
) -> res

Confusion_heatmap_negctrl(res, cutoff = 0.45)

Integrated dataset of human immune cells

In Fig.3g and Fig.3h of our manuscript, we made a qualitative exploration of the feasibility to visulize the UMAP embeddings using genes selected by E-test by integrating ​multiple datasets from different publications without batch-removal. Because we could not aquaire a published avaliable human immune cell atlas then in 2019, we made a mini-atals of human immune cells which could be used as the SciBet reference for immune cells.
The dataset used in Fig.3g/h was re-processed including integration, normalization, clustering, embedding and re-annotation, because the datasets from different publications could not be directly analyzed due to the inconsistancy in normalization methods, gene expression unit, controlled vocabulary for cell type annotation and so on. The pre-processed and re-annotated dataset can be downloaded here.
We applogize for the reference errors in Supplementary Table 6 of our paper, and the detailed information of this "mini-atlas" can be seen in the following table.