In this example workflow, we demonstrate a single cell classifier we recently developed in our preprint.
For illustration, we’ve chosen a T cell dataset that we recently published to get started. The TPM expression matrix can be downloaded here.
suppressMessages(library(ggplot2))
suppressMessages(library(tidyverse))
suppressMessages(library(scibet))
suppressMessages(library(viridis))
suppressMessages(library(ggsci))
path_da <- "~/test.rds.gz"
expr <- readr::read_rds(path = path_da)
For expression matrix (TPM), rows should be cells and the last column should be
"label"
.
expr[1:10, 1:10]
Based on our unified model, we developed E-test for supervised gene selection. This
step is implemented with SelectGene
function. Our use of E-test involves an assumption that there
is no heterogeneity within each population and hence 𝑆 could be directly calculated by feeding its
corresponding 𝐸 into the 𝑆-𝐸 formula.
etest_gene <- SelectGene(expr, k = 50)
etest_gene
## [1] "CXCL13" "CCR7" "FGFBP2" "SELL" "CCL4" "GZMH"
## [7] "GZMB" "IFNG" "CCL3" "FCGR3A" "GPR15" "RGS1"
## [13] "GZMK" "CX3CR1" "KLRC2" "CXCR6" "GZMA" "RGS2"
## [19] "MAL" "TMIGD2" "LEF1" "KLRC1" "PLEK" "HAVCR2"
## [25] "KLRF1" "GNLY" "S100B" "CD160" "NR4A2" "KLRG1"
## [31] "ITGAE" "NR4A1" "FOS" "FCER1G" "LDLRAP1" "NKG7"
## [37] "HLA-DRB5" "VCAM1" "CCL5" "CST7" "PDCD1" "FCRL6"
## [43] "C1orf162" "CD82" "TNFAIP3" "GPR183" "LAT2" "CD69"
## [49] "S1PR5" "PLAC8"
To verify these genes, we can examine their expression patterns across different
cell types with Marker_heatmap
.
Marker_heatmap(expr, etest_gene)
For accurate, fast, and robust single cell identification. We developed SciBet using
a multinomial-distribution model. This step is implemented with SciBet
function.
1. For reference set, rows should be cells, column should be genes and the last
column should be "label"
(TPM).
2. For query set, rows should be cells and column should be genes (TPM).
tibble(
ID = 1:nrow(expr),
label = expr$label
) %>%
dplyr::sample_frac(0.7) %>%
dplyr::pull(ID) -> ID
train_set <- expr[ID,] #construct reference set
test_set <- expr[-ID,] #construct query set
prd <- SciBet(train_set, test_set[,-ncol(test_set)])
We can evaluate how well our predicted cell type annotations match the reference
with Confusion_heatmap
. For this dataset, we find that there is a high agreement in cell type
identification.
Confusion_heatmap(test_set$label, prd)
Due to the incomplete nature of reference scRNA-seq data collection, cell types excluded from the reference dataset may be falsely predicted to be a known cell type. By applying a null dataset as background, SciBet controls the potential false positives while maintaining high prediction accuracy for cells with types covered by the reference dataset (positive cells).
For illustration, we’ve chosen a recent melanoma dataset with immunde cells as “positive cells” and the other cells (CAF, maligant cells and endothelial cells) as “negative cells”.
For the purposes of this example, these three datasets are used to get started.
null <- readr::read_rds('~/null.rds.gz')
reference <- readr::read_rds('~/reference.rds.gz')
query <- readr::read_rds('~/query.rds.gz')
For query set, “negative cells” account for more than 60%.
ori_label <- query$label
table(ori_label)
## ori_label
## B.cell Macrophage Neg.cell NK T.CD4 T.CD8
## 245 126 2228 28 257 528
The confidence score of each query cell is calculated with the function
conf_score
.
query <- query[,-ncol(query)]
c_score <- conf_score(ref = reference, query = query, null_expr = null, gene_num = 500)
The visualization of above result could be implemented with the function
Confusion_heatmap_negctrl
.
tibble(
ori = ori_label,
prd = SciBet(reference, query),
c_score = c_score
) -> res
Confusion_heatmap_negctrl(res, cutoff = 0.45)
In Fig.3g and Fig.3h of our manuscript, we made a qualitative
exploration of the feasibility to visulize the UMAP embeddings using genes selected by E-test by
integrating multiple datasets from
different publications without batch-removal. Because we could not aquaire a published avaliable human immune
cell atlas then in 2019, we made a mini-atals of human immune cells which could be used as the SciBet reference for immune cells.
The dataset used in Fig.3g/h was re-processed including integration, normalization, clustering, embedding and
re-annotation, because the datasets from different publications could not be directly analyzed due to the
inconsistancy in normalization methods, gene expression unit, controlled vocabulary for cell type annotation
and so on. The pre-processed and re-annotated dataset can be downloaded here.
We applogize for the reference errors in Supplementary Table 6 of our paper, and the detailed information of this "mini-atlas" can be seen in the following table.