10X单细胞（10X空间转录组）进行细胞定义的分析策略

hello，大家好，从今天开始，我们开始走上正轨，分享10X单细胞或者10X空间转录组的分析内容，今天我们分享的内容就是做细胞定义的分析策略，文章在Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods，2021年6月发表于nature protocols，影响因子10分，还是我们一贯的思路，先分享文献内容，最后看看示例代码。

Abstract（我们总结一下）

We recommend a three-step workflow including automatic cell annotation (wherever possible), manual cell annotation and verification.（自动细胞注释（应该是软件），人工细胞注释和确认）。

Frequently encountered challenges are discussed, as well as strategies to address them.（希望对我们有所帮助）。

Guiding principles and specific recommendations for software tools and resources that can be used for each step are covered, and an R notebook is included to help run the recommended workflow.(看来作者已经写好了示例,当然，需要我们有一定的基础)。

Introduction

（1）细胞注释，我们单细胞分析得到的注释类似于下图，注释的结果必须是可解释的并且支持生物学的发现。

图片.png

To interpret this map biologically, it is necessary to determine which cell types or cell states are represented by clusters or other patterns observed in the data。These interpretations can then be labeled on the map, which helps place them in a conceptual framework useful for better understanding tissue biology.

简单回顾一下三种降维的优缺点

t-SNE：preserves local groups of similar cells，while equalizing the density of cells within each group（同时均衡每组内的细胞密度）；global relationships between cell types are not maintained, and thus
cluster-to-cluster relationships cannot be inferred and may be misleading。
UMAP：UMAP is typically regarded as better for visualizing global relationships and gradients than t-SNE（这个大家应该熟悉）。
PCA：PCA can be useful for visualizing cell gradients and states，线性降维

（2）细胞注释的一般步骤automatic annotation, manual annotation and verification，这一部分大家应该都很熟悉了。

图片.png

First, automatic annotation uses a predefined set of ‘marker genes’ or reference single-cell data to identify and label individual cells or cell clusters by matching their gene expression patterns (signatures) to those of known cell types.（这个应该是所有做细胞定义的分析过程，当然，也很困难）。

A second major step is manual annotation, which involves studying genes and gene functions specific to each cell cluster or pattern to verify automatic cell annotations and identify novel cell types and states.(尤其存在新的细胞类型，更加难以确认)。

Finally, verification can confirm the identity and function of select cell types using independent methods, such as new validation experiments.接下来，喔们逐步分析看看。

Step 1: automatic cell annotation 两种方式，1、marker gene 2、参考集，各有利弊

A major challenge with automatic cell annotation is that many cell types do not have well-characterized gene expression signatures, resulting in incomplete or inaccurate labeling for some cells.Automated methods typically work better for major cell types and may not be able to effectively distinguish subtypes.（细胞数多的可靠性高）。自动细胞注释有助于快速识别已知细胞类型并突出显示未知细胞类型以供进一步探索。

Comparison of the caveats and recommendations for different approaches to cell annotation

Stage of analysis	Aspect of analysis	Potential caveats	Recommendation
Automatic cell annotation	All automatic methods	Fast, but not effective for poorly characterized cells	Use manual annotation for poorly characterized cells
	Annotating clusters	May miss important differences between cells	Use automatic annotation of clusters to get a general idea of cell type and then refine labels manually. In addition, use multiple cluster-based methods and compare results
	Annotating individual cells	Ideal, but requires high reads per cell	Experiments with low reads per cell require cluster-based annotation
	Marker-based annotation methods	Marker genes not easily accessible for all cell types; may result in conflicting or absent cell labels	Requires expert knowledge to curate more extensive marker lists
	Reference-based annotation methods	Perform poorly with incomplete or poorly matched reference data, which may result in conflicting or absent cell labels	Use well-matched reference data or marker-based methods if such data are unavailable
		Often requires batch correction, which may reduce the accuracy of results	Analyze the reference data for strong biological signals. Use a good experimental protocol that will prevail over batch effects
		Mistakes in reference data get carried over to results	Analyze reference data for potential errors before using
	Comparing results from different automatic annotation methods	Results may not agree with each other	Compare confidence scores of respective labels and consider label agreement (majority rule); resolve conflicts using manual annotation
			Consider the possibility of cell subtypes, new cell types or gradients and cell states
Expert manual cell annotation	All manual methods	Slow, labor-intensive	Whenever possible, begin with automatic annotation to determine general cell labels
		Subjective	Work with an expert; consider multiple cell-type conclusions
	Marker-based annotation	Cell types not distinguishable by a single marker	Use multiple markers for each cell type
		Known markers not distinguishing cell types	Curate larger lists of markers from the literature, additional experiments or experts
		Conflicting marker gene sets between sources	Select a marker gene set that best represents the biological signal being looked for in the data (e.g., if looking for cell subtypes, use more extensive gene sets than what is used for general cell-type annotation)

先来看看第一个，基于marker gene 的注释

To be successful, the marker gene or gene set (a collection of marker genes) should be specifically and consistently expressed in a given cell, cluster or class of cells。大而全的marker gene列表是必需的。

（1）To label individual cells（注意这里是定义单个细胞）,one of the most reliable markerbased annotation tools is semi-supervised category identification and assignment(半监督的策略，SCINA，这里需要我们认为提供marker gene列表，原理大家需要看看原文章了)。

（2）AUCell is another good marker-based labeling method that classifies individual cells or clusters.（关于AUcell大家可以参考文章深入理解R包AUcell对于分析单细胞的作用）。AUCell ranks the genes in each cell by decreasing expression value, and cells are labeled according to their most active (highly expressed) marker gene sets.（原理，了解一下）

（3）gene set variation analysis (GSVA) has been benchmarked to be fast and reliable。GSVA works similarly to AUCell: given a database of marker gene sets, it identifies sets that are enriched in the gene expression profile of a cluster.(关于GSVA做细胞定义，这个是目前大多数公司的做法)。

marker 定义细胞类型的最大问题，A disadvantage of these tools is that markers are not easily accessible for all cell types.

软件注释细胞类型的表格

Tool	Type	Language	Resolution	Approach	Allows ‘None’	Notes
singleCell Net	Reference based	R	Single cells	Relative-expression gene pairs + random forest	Yes, but rarely does so even when it should	10–100× slower than other methods; high accuracy
scmap-cluster	Reference based	R	Single cells	Consistent correlations	Yes	Fastest method available; balances falsepositives and false-negatives; includes web interface for use with a large pre-built reference or custom reference set
scmap-cell	Reference based	R	Single cells	Approximate nearest neighbors	Yes	Assigns individual cells to nearest neighbor cells in reference; allows mapping of cell trajectories; fast and scalable
singleR	Reference based	R	Single cells	Hierarchical clustering and Spearman correlations	No	Includes a large marker reference; does not scale to data sets of ≥10,000 cells; includes web interface with marker database
Scikit-learn	Reference based	Python	Multiple possible	k-nearest neighbors, support vector machine, random forest, nearest mean classifier and linear discriminant analysis	(Optional)	Expertise required for correct design and appropriate training of classifier while avoiding overtraining
AUCell	Marker based	R	Single cells	Area under the curve to estimate marker gene set enrichment	Yes	Because of low detection rates at the level of single cells, it requires many markers for every cell type
SCINA	Marker based	R	Single cells	Expectation maximization, Gaussian mixture model	(Optional)	Simultaneously clusters and annotates cells; robust to the inclusion of incorrect marker genes
GSEA/GSVA	Marker based	R/Java	Clusters of cells	Enrichment test	Yes	Marker gene lists must be reformatted in GMT format. Markers must all be differentially expressed in the same direction in the cluster
Harmony	Integration	R	Single cells	Iterative clustering and adjustment	Yes	Integrates only lower-dimensional projection of the data; seamlessly（无缝地） integrated into Seurat pipeline; may overcorrect data
Seurat-canonical correlation analysis	Integration	R	Single cells	MNN anchors + canonical correlation analysis	Yes	Accuracy depends on the accuracy of MNN anchors, which are automatically-identified corresponding cells across data sets
mnnCorrect	Integration	R	Single cells	MNN pairs + singular value decomposition	Yes	Accuracy depends on the accuracy of MNN pairs (cells matched between data sets).
Linked inference of genomic experimental relationships (LIGER)	Integration	R	Single cells	Non-negative matrix factorization	Yes	Allows interpretation of data set–specific and shared factors of variation

接下来看看第二部分 Reference-based automatic cell annotation

this approach is possible only if high-quality and relevant annotated reference single-cell data are available。目前单细胞已经有了一些公共的数据库可以获取参考集。These atlases typically contain hundreds of thousands of cells and dozens of different annotated cell types。

这种方法有一个共同的特点，需要一个注释完整的参考集，一旦参考集是不完整的、缺失的，准确度就会明显的下降。

当然，原则上任何做整合分析的方法都可以用于细胞定义，简单回顾一下常见整合方法的特点。

Harmony iteratively merges data sets represented by top PCs, which are then used to cluster cells. Each cell is iteratively adjusted on the basis of an estimated correction vector to shift it closer to the center of its cluster until convergence. MNN approaches, such as mnnCorrect/FastMNN or Seurat v321, identify the most similar cells (MNNs), called ‘anchors’, across data sets that are used to estimate and correct the cell type–specific batch effects. LIGER identifies shared (common biology) and unique (biological or technical) factors between data sets using non-negative matrix factorization. LIGER is recommended when specific cell types appear to be present in some of the data sets and missing in others. Integration methods can suffer from overcorrection, where different cell types are merged, or undercorrection, when resulting clusters contain cells from only one input data set. Multiple integration methods may need to be evaluated to find a balance that best represents the data.（批次矫正的方法也要根据情况来判断），过度矫正是目前常见的问题

细化自动注释

Benchmarking studies show variable performance of automatic annotation tools, depending on the data set and distinctiveness of the gene expression profiles of the cell types to be annotated（软件之间的也不具有统一性）。

For instance, distinguishing T cells from B cells is relatively straightforward, but automatic tools sometimes cannot accurately distinguish CD8+ cytotoxic T cells from natural killer cells(如下图)

图片.png

When applying multiple cell annotation methods to a data set, cells or clusters will acquire multiple, sometimes conflicting, cell-type labels.（这也是最大的问题）。如果定义的一致，那么很容易辨别，容易定义结果存在矛盾，那么每个软件提供一个可能性分数to identify a single high-scoring label.但是不同软件的判定分数不具有比较性，但是可以根据定义结果出现的频率来加以识别，出现的频率越高，越可能是某种细胞类型，如果这些都不行，那就只有人工注释了。

图片.png

如果注释存在矛盾，很可能说明该cluster还有subtype，但是，如果不能明确定义亚型，则更通用的细胞类型注释可能更合适。这里举一个例子，For example, if a cluster is annotated as regulatory T cells, naive T cells and helper T cells by different methods, it may be most appropriate to assign the general label of ‘T cells’.

If the conflicting annotations are not subtypes of the same cell type, then the cluster may represent an intermediate cell state or gene-expression gradient。（中间状态或者基因表达等级，这个地方很值得挖掘）。

Finally, a cluster may have a novel cell identity that is absent from the reference data. This often results in widely varying results from automatic annotation methods or insufficient confidence for any tool to assign any label. In such situations, manual annotation must be performed.（新的细胞类型就需要我们人工注释了）。

Step 2: expert manual cell annotation(人工注释)

人工注释细胞类型目前是最可靠的方法（gold-standard method），但是，it is slow and labor intensive and can be subjective.主要的人工注释，就是我们人工选择marker的过程,很繁琐。

图片.png

所有挑选出来的基因必须进行检验和可视化（比如dotplot和热图）

图片.png

Challenges in this approach are that well-known markers are often too few in number to completely annotate an scRNA-seq data set, and some well-known markers may not be as specific within an scRNA-seq data set as expected.

set. Master transcription factors that drive cell fate often make better gene expression markers than cell-surface proteins that are commonly used to classify cell populations（转录因子基因的识别能力更好），因为转录组水平和蛋白水平并不不能很好的关联。

识别同一个细胞类型的marker gene通常是多个，尤其在定义subcluster的时候。

The ideal primary source for cell-defining genes is a singlecell atlas from a relevant organism, organ and disease context.（marker基因具有组织、器官、疾病特异性）。

in some instances a cluster may not express markers of any known cell type; conversely, it may express markers of more than one cell type.（这种情况就是低质量的细胞、新的细胞类型或者含有subcluster）。

Once cell-type information from known markers is exhausted, cells that have not been confidently annotated must be manually examined, cluster by cluster.（marker gene无法起到作用的时候，就需要人工进一步检验了。潜在的marker gene需要用一些软件来进行差异分析）。

图片.png

All marker genes are then manually researched to find functional information that may help identify the cell type of the cluster with which they are associated。（例如通路富集）。

Some cells may be challenging to annotate, including novel cell types, which can be described on the basis of the function of genes they express。

Annotating cell states and gradients(针对新的细胞类型)

When analyzing and characterizing novel cell types, it is important to determine whether they represent a stable cell type or contain multiple cell states.(稳态还是多种细胞状态)。

细胞类型和状态的定义尚未标准化，但可能预期稳定的细胞类型在整个cluster中具有homo基因表达并且聚类在一起。

whereas cell gradients appear as a spread-out string of cells and cell states，Expression gradients indicate continuous differences that are present in the cell population, which could represent states like the cell cycle, immune activation, spatial patterning or transient developmental stages。识别有意义的细胞状态去除批次效应意义重大。

图片.png

注释细胞发育的中间阶段通常很困难，因为这些区域很少表达独特的标记基因。 It is often easier to label the ends of a gradient and then characterize intermediate stages using the order of specific genes that mark these ends as increasing or decreasing across the gradient。

Extracting the cells in the gradient and performing principal component analysis (PCA) on them is often a useful visualization for gradients, because it preserves the large-scale distances between cells（有的轨迹分析软件就采用这样的策略）。

目前没有可以自动注释中间态的方法，不同细胞的层次只能人工识别，making use of known structure and celltype transitions relevant to the particular experiment。

Similarly, homogeneous or similar cell states or cell types are often difficult to annotate because they share many of the same marker genes。(这个时候就需要再分群分析了).

Very fine distinctions between highly similar cell types may not be visible transcriptionally and may be visible only in other genomic layers, such as chromatin state (assay for transposase-accessible chromatin using sequencing (ATAC-seq) and DNA methylation).(多组学识别细胞类型也是很重要的一点)。

Step 3: annotation verification

需要其他的分析辅助验证了，包括多组学，SC-ATAC，CNV等等。

最后，附上一张做细胞定义的软件总结

图片.png

我们来看一下示例代码

1. Reference-based automatic annotation

Create the Reference

The first step in performing reference-based annotation is to select an annotated dataset to use as the reference. Here we will use one of the references created by the authors of SingleR and show how it can be used with other tools such as scmap.

Other reference datasets can be found in GEO (https://www.ncbi.nlm.nih.gov/geo/) or at a link provided by the authors of the reference dataset. However, to use a dataset as a reference you will need both the single-cell RNA sequencing data and the cell-type annotations. GEO does not require authors to provide the cell-type annotations of their data, so you may need to contact the authors directly to to get the annotations for some datasets.

# Set a random seed to ensure result reproducibility
set.seed(9742)
# Download singleR reference data for immune cells and save it as the variable "ref"
# The variable is a class called "Summarized Experiment"
# This will take a while
ref <- celldex::DatabaseImmuneCellExpressionData()

Next we need to reformat the data to ensure it is compatible with the tool we are using. We will be demonstrating scmap, which uses data formatted as a ‘SingleCellExperiment object’, and assumes by default that gene names are found in a column named ‘feature_symbol’ while the cell-type labels are in a column named ‘cell_type1’. In addition, scmap requires that you normalize and log-transform the reference data; this has already been done for the SingleR reference data so we skip those steps here.

# Assign cell-type labels in a column named "cell_type1"
colData(ref)$cell_type1 <- colData(ref)$label.fine
# Assign gene names in a column called "feature_symbol"
rowData(ref)$feature_symbol <- rownames(ref)

# Convert the data into a SingleCellExperiment object
ref_sce <- SingleCellExperiment::SingleCellExperiment(assays=list(logcounts=Matrix::Matrix(assays(ref)$logcounts)), 
            colData=colData(ref), rowData=rowData(ref))

Our reference data is ready to be used now. So lets process this data to build the index we will use to map our unlabeled data to. First, we select genes to use, which will be those deemed most informative by scmap after fitting a linear model to the gene expression by gene dropout distribution. Those which are most informative have high expression values and low % dropout rates across cells.

# Create scmap-cluster reference by first selecting the most informative features
ref_sce <- scmap::selectFeatures(ref_sce, suppress_plot=FALSE)

Your object does not contain counts() slot. Dropouts were calculated using logcounts() slot...

image.png

# Inspect the first 50 genes selected by scmap
rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)][1:50]

# You can check and see how many genes were chosen by checking the length of the
# vector of gene names
length(rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)])

[1] 500

Now we can see the genes that scmap has chosen to use. If there are key marker genes missing we can make sure they are included like this:

# Create a list of key markers that you want to use
my_key_markers = c("TRAC", "TRBC1", "TRBC2", "TRDC", "TRGC1", "TRGC2", "IGKC")
# Ensure markers are in the list of features used by scmap
rowData(ref_sce)$scmap_features[rownames(ref_sce) %in% my_key_markers] <- TRUE
# You can check and see if this added any genes by checking the length 
# of the vector of gene names again
length(rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)])

[1] 502

And we can remove genes that we think might be technical artefacts, such as mitochondria RNAs, like this:

# Create a list of mitochondrial genes from the dataset (genes that begin with "MT")
mt_genes <- rownames(ref_sce)[grep("^MT-", rownames(ref_sce))]
# Remove these genes from the features used by scmap
rowData(ref_sce)$scmap_features[rownames(ref_sce) %in% mt_genes] <- FALSE
# Check how many genes this is
length(rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)])

[1] 495

# Extract the features and assign them to a new variable, "scmap_feature_genes"
scmap_feature_genes <- rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)]
# Note that the number of genes/features is identical to what we just checked
length(scmap_feature_genes)

[1] 495

Now we build the reference profiles used in scmap-cluster, for cluster-based cell-type annotation. These profiles can be accessed and plotted from inside the SingleCellExperiment object as follows:

# Create reference profiles;
# Once reference profiles are generated the original data are 
# not needed for scmap-cluster
ref_sce <- scmap::indexCluster(ref_sce)
# Visualize interesting features as a heatmap
# Reformat the data so that they can be used as input to ggplot2
cormat <- reshape2::melt(as.matrix(metadata(ref_sce)$scmap_cluster_index))
# Plot the data
ggplot2::ggplot(cormat, ggplot2::aes(x = Var2, y = Var1, fill = value)) +
  ggplot2::geom_tile() +
  ggplot2::scale_fill_gradient2(low = "blue", high = "darkred",
                                name = "Expression value") +
  ggplot2::theme_minimal() +
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, vjust = 1,
                                            size = 18, hjust = 1),
                 axis.text.y = ggplot2::element_text(size = 15),
                 axis.title.x = ggplot2::element_blank(),
                 axis.title.y = ggplot2::element_blank())

image.png

# Store expression information as a variable
scmap_cluster_reference <- metadata(ref_sce)$scmap_cluster_index

From here on out, scmap only needs this set of reference profiles. So if working with a very large reference, one could save this index separately to your computer and reload it when annotating new datasets. But since that is not the case here, we will simply save this index to a variable for now.

We will also demonstrate scmap-cell to annotate individual cells of our dataset, so we will create that index as well. As before one would first normalize and log-transform the reference data, and select genes to use. As we have already done that, we need only run the command to build the scmap-cell index. There are two parameters we can set: M and k, increasing M and k will give more accurate mapping but increase the size of the index, and the time needed to map cells. Here we use the defaults (you may see a warning message about the defaults that are being used):

# Update the previous reference to also contain the scmap-cell reference
ref_sce <- scmap::indexCell(ref_sce)

Parameter M was not provided, will use M = n_features / 10 (if n_features <= 1000), where n_features is the number of selected features, and M = 100 otherwise.
Parameter k was not provided, will use k = sqrt(number_of_cells)

# Extract the scmap index from the reference and store as a variable
scmap_cell_reference <- metadata(ref_sce)$scmap_cell_index
# Extract the associated cell IDs from the reference and save as a variable
scmap_cell_metadata <- colData(ref_sce)

scmap-cell assigns cells in one dataset to their “nearest neighbours” in the reference dataset. In this case, the “nearest neighbours” are the cells in the reference dataset most similar to the cells in the query dataset.

One can use any rule they like to transfer information, such as cell-type or pseudotime, from these nearest neighbours to the query data. Thus we need to store the associated metadata (cell type ID) for the reference as well (see above). Now we don’t need to use our original reference dataset anymore.

Assign cells from the query dataset to the reference.

The query dataset we will be using is provided by 10X genomics.

download.file("https://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz",
              "pbmc3k_filtered_gene_bc_matrices.tar.gz")

trying URL 'https://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz'
Content type 'application/x-tar' length 7621991 bytes (7.3 MB)
==================================================
downloaded 7.3 MB

untar("pbmc3k_filtered_gene_bc_matrices.tar.gz")

Now we need to load our unlabeled dataset into R. Normal preprocessing including QC filtering, normalizing and log-transforming the data must be done prior to annotating. In addition, scmap is based on the SingleCellExperiment object, so if our data is stored as a Seurat object we must convert it to SingleCellExperiment as shown below.

# This portion of the tutorial is assuming the raw 10X data is in the
# following folder in your directory:
data <- Seurat::Read10X("filtered_gene_bc_matrices/hg19/")
# Make SingleCellExperiment from the raw matrix
query_sce <- SingleCellExperiment::SingleCellExperiment(assays=list(counts=data))

# Make SingleCellExperiment from Seurat object
query_seur <- Seurat::CreateSeuratObject(data)

Feature names cannot have underscores ('_'), replacing with dashes ('-')

query_sce <- Seurat::as.SingleCellExperiment(query_seur)

# normalize the data using the scater package
query_sce <- scater::logNormCounts(query_sce)

# add feature_symbol column (i.e. the gene symbols)
rowData(query_sce)$feature_symbol <- rownames(query_sce)

Now you should have an entry in assays(my_sce) called logcounts with the log-normalized matrix. We are now ready to annotate our data with scmap-cluster. Let’s start with scmap-cluster:

# Run scmapCluster
scmap_cluster_res <- scmap::scmapCluster(projection=query_sce, 
                index_list=list(immune1 = scmap_cluster_reference), 
                threshold=0.1)

# plot the results of our annotation
par(mar=c(13, 4, 1, 0))
barplot(table(scmap_cluster_res$combined_labs), las=2)

图片.png


# Store this annotation information within the query object
colData(query_sce)$scmap_cluster <- scmap_cluster_res$combined_labs

# Make a UMAP of the cells, labeled with the cell-type annotations from scmapCluster
query_sce <- scater::runUMAP(query_sce)
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="scmap_cluster")

图片.png

Alternatively we could use scmap-cell, to find the 10 nearest neighbours to each cell (i.e. the 10 most similar cells to each query cell), then pick the annotation that is most common among the neighbours, like this:

# Determine the 10 nearest neighbours from the reference dataset for each
# cell in the query dataset using scmapCell
nearest_neighbours <- scmap::scmapCell(projection=query_sce, 
    index_list = list(immune1 = scmap_cell_reference), 
    w=10)

# Get metadata (cell type IDs) for the neighbours of each cell in the query dataset
mode_label <- function(neighbours, metadata=scmap_cell_metadata$cell_type1) {
    freq <- table(metadata[neighbours])
    label <- names(freq)[which(freq == max(freq))]
    if (length(label) > 1) {return("ambiguous")}
    return(label)
}

# Apply these labels to the query cells
scmap_cell_labs <- apply(nearest_neighbours$immune1$cells, 2, mode_label)

# Add the labels to the query object
colData(query_sce)$scmap_cell <- scmap_cell_labs

# Create a bar plot of how many cells in the query dataset were assigned
# a specific label
par(mar=c(10, 4, 0, 0))
barplot(table(scmap_cell_labs), las=2)

图片.png


# Make a UMAP and add the new cell-type annotations
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="scmap_cell")

图片.png

Another option compatible with the SingleCellExperiment Object is SingleR. As before, we need a reference and a query dataset. In the case of SingleR, we need the entirety of the reference dataset, rather than generating a compressed reference index as we did with scmap. In addition, running just this small example demonstrates the difference in run time between the methods (SingleR takes a fair bit of time).

# Run SingleR on the query data and the reference to acquire
# cell-type predictions for the cells in the query dataset
predictions <- SingleR::SingleR(test=query_sce, ref=ref, labels=ref$label.fine)

# You'll notice that some of the cells didn't get assigned a cell identity
# We can count the number here:
sum(is.na(predictions$pruned.labels))

[1] 23

# Change NAs to "ambiguous"
predictions$pruned.labels[which(is.na(predictions$pruned.labels))] <- "ambiguous"
# Add singleR labels to query_sce
colData(query_sce)$singleR <- predictions$pruned.labels

# Create a bar plot of number of cells per assigned cell ID
par(mar=c(13, 4, 2, 0))
barplot(table(predictions$pruned.labels), las=2)

图片.png


# Make a UMAP and add the cell-type annotations
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="singleR")

image.png

Integration as a form of annotation

Another option is to integrate our query data with our reference data. Then we simply transfer the labels from the annotated reference to the neighbouring query cells in the integrated dataset. Clustering the integrated data is a common approach to transferring labels. We demonstrate how this could be done with Harmony below. But the approach would be the same for any integration tool.

Note: the SingleR reference is not single cells, but averages across many cells. Thus we convert and downsample the reference to a single cell object for demonstration purposes. For a real experiment, one would use the original single cells as the reference when integrating datasets.

set.seed(2891)
# Convert reference and query datasets to Seurat Objects

# Add a "counts" slot to the reference SingleCellExperiment object so we can convert it to a Seurat Object
assays(ref_sce)[["counts"]] <- round(2^assays(ref_sce)[["logcounts"]]) -1
colnames(ref_sce) <- paste("cell", 1:ncol(ref_sce))

# Subset both objects so both the reference and query datasets have the same genes
# First subset the reference
ref_seur <- Seurat::as.Seurat(ref_sce[rownames(ref_sce) %in% rownames(query_sce),])
ref_seur@active.ident <- factor(rep("reference", ncol(ref_seur)))
# Now subset the query
query_seur <- Seurat::as.Seurat(query_sce[rownames(query_seur) %in% rownames(ref_sce),])
query_seur@active.ident <- factor(rep("query", ncol(query_seur)))

# Downsample the reference to be similar to query in terms of total UMIs
totalUMI <- median(query_seur@meta.data$nCount_RNA)
ref_seur@assays$RNA@counts <- Seurat::SampleUMI(ref_seur@assays$RNA@counts,
                                                max.umi=totalUMI, upsample=FALSE)

# Merge the datasets together into a single Seurat object
merged_seur <- merge(ref_seur, query_seur)
merged_seur@meta.data$source <- merged_seur@active.ident

# Normalize the combined data
merged_seur <- Seurat::NormalizeData(merged_seur)

# Rather than choosing new variable features, we will choose
# the genes that had been previously important by scmap for consistency
Seurat::VariableFeatures(merged_seur) <- scmap_feature_genes

# Scale the data and run dimensionality reduction on the combined data
merged_seur <- Seurat::ScaleData(merged_seur)

merged_seur <- Seurat::RunPCA(merged_seur)

merged_seur <- Seurat::RunUMAP(merged_seur, dims=1:15)

Seurat::DimPlot(merged_seur, reduction="umap") + ggplot2::ggtitle("Before Integration")

图片.png

# Run Harmony to remove batch effects
merged_seur <- harmony::RunHarmony(merged_seur, "source", dims.use=1:15)

merged_seur <- Seurat::RunUMAP(merged_seur, dims=1:15, reduction="harmony")

# Plot the data
Seurat::DimPlot(merged_seur, reduction="umap") + ggplot2::ggtitle("After Integration")

图片.png

Now that the data is integrated we will cluster the data and look at the annotations of the reference cells present in each cluster. As with all clustering, this may require manual tuning of the resolution parameters to get the best labels.

# Cluster the integrated dataset
merged_seur <- Seurat::FindNeighbors(merged_seur, reduction="harmony", dims=1:15)

merged_seur <- Seurat::FindClusters(merged_seur, resolution=0.5)

# Plot the data
Seurat::DimPlot(merged_seur, reduction="umap") + ggplot2::ggtitle("After Integration")

图片.png

# Create a table of cluster labels based on integrated data
table(merged_seur@meta.data$label.fine, 
        merged_seur@active.ident)


                                     0   1   2   3   4   5   6   7   8   9
  B cells, naive                     0   0   0 106   0   0   0   0   0   0
  Monocytes, CD14+                   0   0 106   0   0   0   0   0   0   0
  Monocytes, CD16+                   0   0   0   0   0   0 105   0   0   0
  NK cells                           0 105   0   0   0   0   0   0   0   0
  T cells, CD4+, memory TREG         0   0   0   0   0   0   0 104   0   0
  T cells, CD4+, naive               4   0   0   0  95   4   0   0   0   0
  T cells, CD4+, naive TREG          1   0   0   0 102   0   0   1   0   0
  T cells, CD4+, naive, stimulated   0   0   0   0   0   0   0   0 102   0
  T cells, CD4+, TFH                96   0   0   0   5   0   0   3   0   0
  T cells, CD4+, Th1               103   0   0   0   1   0   0   0   0   0
  T cells, CD4+, Th1_17            104   0   0   0   0   0   0   0   0   0
  T cells, CD4+, Th17               99   0   0   0   0   0   0   5   0   0
  T cells, CD4+, Th2                96   0   0   0   6   0   0   2   0   0
  T cells, CD8+, naive               0   0   0   0   1 103   0   0   0   0
  T cells, CD8+, naive, stimulated   0   0   0   0   0   1   0   0   0 101

Here we have a table of the reference annotations (across rows) per cluster (across columns). We can manually label the clusters based on this table or we could create a rule to algorithmically label the clusters based on this table. Since there are only 11 clusters, we assign the labels manually.

cluster_labs <- c("0"="ambiguous", 
    "1"="Monocytes, CD14+", 
    "2"="B cells, naive", 
    "3"="T cells, CD4+, naive TREG",
    "4"="T cells, CD4+, Th1_17",
    "5"="NK cells",
    "6"="T cells, CD8+, naive",
    "7"="Monocytes, CD16+",
    "8"="T cells, CD4+, memory TREG",
    "9"="T cells, CD4+, naive, stimulated",
    "10" = "T cells, CD8+, naive, stimulated")

# Assign cluster label to the associated query cells
# (the query cells that had been assigned the same cluster label)
merged_seur@meta.data$annotation <- cluster_labs[merged_seur@meta.data$RNA_snn_res.0.5]

# Add the results to the SingleCellExperiment Object and plot
query_sce$Harmony_lab <- merged_seur@meta.data$annotation[merged_seur@meta.data$source =="query"]
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="Harmony_lab")

image.png

2. Refining / Consensus annotations

Once we have run several tools, we can use the consensus of the labels to get a more robust annotation. In this case we will simply use the most common label across tools to assign the final automatically annotated label.

Hide

annotation_columns <- c("scmap_cluster", "scmap_cell", "singleR", "Harmony_lab")

#Optional check how consistent the labelling was.
#head(colData(query_sce)[,annotation_columns])

get_consensus_label <- function(labels){
    labels <- labels[labels != "ambiguous"]
    if (length(labels) == 0) {return("ambiguous")}
    freq <- table(labels)
    label <- names(freq)[which(freq == max(freq))]
    if (length(label) > 1) {return("ambiguous")}
    return(label)
}

colData(query_sce)$consensus_lab <- apply(colData(query_sce)[,annotation_columns], 1, get_consensus_label)
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="consensus_lab")

image.png

3. Marker-based automatic annotation

An alternative way for annotation of your query scRNAseq dataset is to utilize Marker-based annotation tools. SCINA is a semi-supervised annotation tool that takes in the signature genes and expression matrix and predicts the potential labels based on the prior knowledge of the cell-type-specific markers. List of markers is usually provided in the gmt format. The PBMC gene set used below have been gathered by Diaz-Mejia JJ et al.

download.file("https://zenodo.org/record/3369934/files/pbmc_22_10x.tar.bz2",
              "pbmc_22_10x.tar.bz2")

untar("pbmc_22_10x.tar.bz2")

The extracted data will by located in the following file: ./MY_PAPER/SUPPLEMENTARY_DATA/pbmc_22_10x/pbmc_22_10x_cell_type_signature_gene_sets.gmt

The results from this annotation tool are not used in the above step to find consensus annotations because the lists of marker genes are not consistent with the cell types identified in the reference dataset. This is because these data come from different sources, and would not have been characterizing the exact same set of cells. If you wish for marker-based and reference-based annotation methods to be combined in the above step of automatically determining consensus annotations, you would have to make sure all of the identified cell subtypes are the same and that they are spelt the exact same way in order for R to recognize the names as identical.

# Import the marker genes as a GMT file and store as a variable
markers <- msigdb::read.gmt('./MY_PAPER/SUPPLEMENTARY_DATA/pbmc_22_10x/pbmc_22_10x_cell_type_signature_gene_sets.gmt')
# Convert the expression data from Seurat object into a matrix data structure
exprMatrix <- as.matrix(Seurat::GetAssayData(query_seur))
# Run SCINA on the query data using the marker genes to identify cell types
# Specifying rm_overlap = FALSE allows the same marker gene to specify multiple cell types which
# may be useful if identifying cell subtypes or other similar types of cells
# Specifying allow_unknown = TRUE allows cells to be labeled as "unknown" instead of being
# assigned a low-confident label
predictions.scina = SCINA::SCINA(exp = exprMatrix, signatures = markers$genesets,
                          rm_overlap = FALSE, allow_unknown = TRUE)
# Add SCINA annotation information to each cell in Seurat object
colData(query_sce)$SCINA <- predictions.scina$cell_labels

# Make a UMAP and add the SCINA cell-type annotations
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="SCINA") +
  ggplot2::theme(legend.position = "bottom",
                 legend.text = ggplot2::element_text(size = 4))

image.png

4. Manual annotation

Retrieving marker genes

If you do not have an extensive list of markers per cell type, or a good quality reference dataset, it is useful to extract the top marker genes from each cluster of your query data. We can easily do this in Seurat, with the data formatted as a *Seurat object (which we created earlier and stored as the variable query_seur). First, the data must be normalized and scaled, and the variable genes between cells must be determined.

Hide

query_seur <- Seurat::NormalizeData(query_seur) # Normalize the data

Performing log-normalization
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|

query_seur <- Seurat::FindVariableFeatures(query_seur) # Determine the variable features of the dataset

Calculating gene variances
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating feature variances of standardized and clipped values
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|

query_seur <- Seurat::ScaleData(query_seur) # Scale the data based on the variable features

Next, different types of dimensionality reduction must be performed on the data so that the cells can be grouped together in 2D space.

query_seur <- Seurat::RunPCA(query_seur)

query_seur <- Seurat::RunTSNE(query_seur)
# RunUMAP has already been performed on the data, so the following line of code
# does not need to be run in this case:
#query_seur <- Seurat::RunUMAP(query_seur, dims = 1:50)

From this object, we can cluster the data at a chosen resolution that can be modified later on if desired.

# Determine the "nearest neighbours" of each cell
query_seur <- Seurat::FindNeighbors(query_seur, dims = 1:50)

Computing nearest neighbor graph
Computing SNN

# Cluster the cells
query_seur <- Seurat::FindClusters(query_seur, resolution = 0.5)

Seurat::DimPlot(query_seur, reduction = "UMAP")

图片.png

Now let’s extract the top marker genes, and see which ones correspond with each cluster. This can be done using the FindAllMarkers function within Seurat.

markers_seur <- Seurat::FindAllMarkers(query_seur, only.pos = TRUE)

require(dplyr)

# Retrieve the top 5 marker genes per cluster
# Use whichever genes have the highest values under the AVG_LOG column
top5 <- markers_seur %>% group_by(cluster) %>%
  dplyr::slice_max(get(grep("^avg_log", colnames(markers_seur), value = TRUE)),
                   n = 5)
# Create the dot plot
Seurat::DotPlot(query_seur, features = unique(top5$gene)) +
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, vjust = 1,
                                            size = 8, hjust = 1)) +
  Seurat::NoLegend()

图片.png

# Create the heatmap
Seurat::DoHeatmap(query_seur, features = unique(top5$gene)) +
  Seurat::NoLegend() +
  ggplot2::theme(axis.text.y = ggplot2::element_text(size = 8))

The following features were omitted as they were not found in the scale.data slot for the RNA assay: NOSIP, CD3E, CD3D, IL7R, LDHB

image.png

Pathway analysis

Pathway analysis can also be done for each cluster to determine significantly up- and downregulated pathways based on known gene function. An easy way to do this is by feeding our current Seurat object into cerebroApp. cerebroApp requires that marker genes be fetched again through before performing simple pathway analysis.

# First get marker genes through cerebro
query_seur <- cerebroApp::getMarkerGenes(query_seur,
                                         groups = c('seurat_clusters'),
                                         assay = "RNA",
                                         organism = "hg")

# Get enriched pathways through cerebro
query_seur <- cerebroApp::getEnrichedPathways(query_seur,
                                              databases = c("GO_Biological_Process_2018",
                                                            "GO_Cellular_Component_2018",
                                                            "GO_Molecular_Function_2018",
                                                            "KEGG_2016",
                                                            "WikiPathways_2016",
                                                            "Reactome_2016",
                                                            "Panther_2016",
                                                            "Human_Gene_Atlas",
                                                            "Mouse_Gene_Atlas"),
                                              adj_p_cutoff = 0.05,
                                              max_terms = 100,
                                              URL_API = "http://amp.pharm.mssm.edu/Enrichr/enrich")

图片.png

生活很好，有你更好

最后编辑于：2021.06.28 18:50:38

禁止转载，如需转载请通过简信或评论联系作者。

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 159,117评论 4赞 362
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,328评论 1赞 293
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 108,839评论 0赞 243
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 44,007评论 0赞 206
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,384评论 3赞 287
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,629评论 1赞 219
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,880评论 2赞 313
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,593评论 0赞 198
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,313评论 1赞 243
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,575评论 2赞 246
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,066评论 1赞 260
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,392评论 2赞 253
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,052评论 3赞 236
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,082评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,844评论 0赞 195
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,662评论 2赞 274
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,575评论 2赞 270

10X单细胞（10X空间转录组）进行细胞定义的分析策略

Abstract（我们总结一下）

We recommend a three-step workflow including automatic cell annotation (wherever possible), manual cell annotation and verification.（自动细胞注释（应该是软件），人工细胞注释和确认）。

Frequently encountered challenges are discussed, as well as strategies to address them.（希望对我们有所帮助）。

Guiding principles and specific recommendations for software tools and resources that can be used for each step are covered, and an R notebook is included to help run the recommended workflow.(看来作者已经写好了示例,当然，需要我们有一定的基础)。

Introduction

（1）细胞注释，我们单细胞分析得到的注释类似于下图，注释的结果必须是可解释的并且支持生物学的发现。

简单回顾一下三种降维的优缺点

（2）细胞注释的一般步骤automatic annotation, manual annotation and verification，这一部分大家应该都很熟悉了。

A second major step is manual annotation, which involves studying genes and gene functions specific to each cell cluster or pattern to verify automatic cell annotations and identify novel cell types and states.(尤其存在新的细胞类型，更加难以确认)。

Finally, verification can confirm the identity and function of select cell types using independent methods, such as new validation experiments.接下来，喔们逐步分析看看。

Step 1: automatic cell annotation 两种方式，1、marker gene 2、参考集，各有利弊

Comparison of the caveats and recommendations for different approaches to cell annotation

先来看看第一个，基于marker gene 的注释

To be successful, the marker gene or gene set (a collection of marker genes) should be specifically and consistently expressed in a given cell, cluster or class of cells。大而全的marker gene列表是必需的。

marker 定义细胞类型的最大问题，A disadvantage of these tools is that markers are not easily accessible for all cell types.

软件注释细胞类型的表格

接下来看看第二部分 Reference-based automatic cell annotation

this approach is possible only if high-quality and relevant annotated reference single-cell data are available。目前单细胞已经有了一些公共的数据库可以获取参考集。These atlases typically contain hundreds of thousands of cells and dozens of different annotated cell types。

这种方法有一个共同的特点，需要一个注释完整的参考集，一旦参考集是不完整的、缺失的，准确度就会明显的下降。

当然，原则上任何做整合分析的方法都可以用于细胞定义，简单回顾一下常见整合方法的特点。

细化自动注释

Benchmarking studies show variable performance of automatic annotation tools, depending on the data set and distinctiveness of the gene expression profiles of the cell types to be annotated（软件之间的也不具有统一性）。

If the conflicting annotations are not subtypes of the same cell type, then the cluster may represent an intermediate cell state or gene-expression gradient。（中间状态或者基因表达等级，这个地方很值得挖掘）。

Step 2: expert manual cell annotation(人工注释)

人工注释细胞类型目前是最可靠的方法（gold-standard method），但是，it is slow and labor intensive and can be subjective.主要的人工注释，就是我们人工选择marker的过程,很繁琐。

所有挑选出来的基因必须进行检验和可视化（比如dotplot和热图）

Challenges in this approach are that well-known markers are often too few in number to completely annotate an scRNA-seq data set, and some well-known markers may not be as specific within an scRNA-seq data set as expected.

set. Master transcription factors that drive cell fate often make better gene expression markers than cell-surface proteins that are commonly used to classify cell populations（转录因子基因的识别能力更好），因为转录组水平和蛋白水平并不不能很好的关联。

识别同一个细胞类型的marker gene通常是多个，尤其在定义subcluster的时候。

The ideal primary source for cell-defining genes is a singlecell atlas from a relevant organism, organ and disease context.（marker基因具有组织、器官、疾病特异性）。

in some instances a cluster may not express markers of any known cell type; conversely, it may express markers of more than one cell type.（这种情况就是低质量的细胞、新的细胞类型或者含有subcluster）。

All marker genes are then manually researched to find functional information that may help identify the cell type of the cluster with which they are associated。（例如通路富集）。

Some cells may be challenging to annotate, including novel cell types, which can be described on the basis of the function of genes they express。

Annotating cell states and gradients(针对新的细胞类型)

When analyzing and characterizing novel cell types, it is important to determine whether they represent a stable cell type or contain multiple cell states.(稳态还是多种细胞状态)。

细胞类型和状态的定义尚未标准化，但可能预期稳定的细胞类型在整个cluster中具有homo基因表达并且聚类在一起。

Extracting the cells in the gradient and performing principal component analysis (PCA) on them is often a useful visualization for gradients, because it preserves the large-scale distances between cells（有的轨迹分析软件就采用这样的策略）。

目前没有可以自动注释中间态的方法，不同细胞的层次只能人工识别，making use of known structure and celltype transitions relevant to the particular experiment。

Similarly, homogeneous or similar cell states or cell types are often difficult to annotate because they share many of the same marker genes。(这个时候就需要再分群分析了).

Step 3: annotation verification

需要其他的分析辅助验证了，包括多组学，SC-ATAC，CNV等等。

最后，附上一张做细胞定义的软件总结

我们来看一下示例代码

1. Reference-based automatic annotation

Create the Reference

Assign cells from the query dataset to the reference.

Integration as a form of annotation

2. Refining / Consensus annotations

3. Marker-based automatic annotation

4. Manual annotation

Retrieving marker genes

Pathway analysis

推荐阅读更多精彩内容