SCS【13】单细胞转录组之识别细胞对“基因集”的响应 (AUCell)

AUC允许在单细胞RNA数据集中识别具有活性基因集(如gene signature,gene module)的细胞。AUCell使用曲线下面积来计算输入基因集的一个关键子集是否在每个细胞的表达基因中富集。AUCell和GSVA的算法是一回事，都是排序。AUCell借鉴了ssGSVA的算法，但是在排序的时候，有些处理不太一样。在SCENIC转录因子预测的分析过程中，已经封装了AUCell。AUCell的工作流程分为三步，分别通过三个函数完成:

AUCell_buildRankings : Build the rank;
AUCell_calcAUC : Calculate the Area Under the Curve (AUC);
AUCell_exploreThresholds : Set the assignment thresholds;

软件包安装

if (!requireNamespace("BiocManager", quietly=TRUE))
  install.packages("BiocManager")
# To support paralell execution:
BiocManager::install(c("doMC", "doRNG","doSNOW"))
# For the main example:
BiocManager::install(c("mixtools", "SummarizedExperiment"))
# For the examples in the follow-up section of the tutorial:
BiocManager::install(c("DT", "plotly", "NMF", "d3heatmap", "shiny", "rbokeh",
                       "dynamicTreeCut","R2HTML","Rtsne", "zoo"))

数据读取

AUCell所需要的输入数据分为两部分，矩阵数据和基因集数据。

矩阵数据我们通过下载GEO上的数据集GSE60361下载，之后整理可以的矩阵：

library(AUCell)
dir.create("AUCell_tutorial")
setwd("AUCell_tutorial")
if (!requireNamespace("GEOquery", quietly = TRUE)) BiocManager::install("GEOquery")
library(GEOquery)
# gse <- getGEO('GSE60361') # does not work, the matrix is in a suppl file
geoFile <- "GSE60361_C1-3005-Expression.txt.gz"
# download.file('https://ftp.ncbi.nlm.nih.gov/geo/series/GSE60nnn/GSE60361/suppl/GSE60361_C1-3005-Expression.txt.gz',
# destfile = geoFile)

library(data.table)
exprMatrix <- fread(geoFile, sep = "\t")
geneNames <- unname(unlist(exprMatrix[, 1, with = FALSE]))
exprMatrix <- as.matrix(exprMatrix[, -1, with = FALSE])
rownames(exprMatrix) <- geneNames
exprMatrix <- exprMatrix[unique(rownames(exprMatrix)), ]
dim(exprMatrix)
exprMatrix[1:5, 1:4]

# Remove file(s) downloaded:
file.remove(geoFile)

# Convert to sparse
library(Matrix)
exprMatrix <- as(exprMatrix, "dgCMatrix")

# Save for future use
mouseBrainExprMatrix <- exprMatrix
save(mouseBrainExprMatrix, file = "exprMatrix_MouseBrain.RData")

基因集是AUCell软件包自带的直接调取利用getGmt读取即可：

library(AUCell)
library(GSEABase)
gmtFile <- paste(file.path(system.file("examples", package = "AUCell")), "geneSignatures.gmt",
    sep = "/")
geneSets <- getGmt(gmtFile)
geneSets <- subsetGeneSets(geneSets, rownames(exprMatrix))
cbind(nGenes(geneSets))
geneSets <- setGeneSetNames(geneSets, newNames = paste(names(geneSets), " (", nGenes(geneSets),
    "g)", sep = ""))
# Random
set.seed(321)
extraGeneSets <- c(GeneSet(sample(rownames(exprMatrix), 50), setName = "Random (50g)"),
    GeneSet(sample(rownames(exprMatrix), 500), setName = "Random (500g)"))

countsPerGene <- apply(exprMatrix, 1, function(x) sum(x > 0))
# Housekeeping-like
extraGeneSets <- c(extraGeneSets, GeneSet(sample(names(countsPerGene)[which(countsPerGene >
    quantile(countsPerGene, probs = 0.95))], 100), setName = "HK-like (100g)"))

geneSets <- GeneSetCollection(c(geneSets, extraGeneSets))
names(geneSets)

例子操作

计算gene signatures分值

为每个细胞建立基因表达排序，然后进行计算基因标记的富集(AUC)：

cells_AUC <- AUCell_run(exprMatrix, geneSets)
save(cells_AUC, file = "cells_AUC.RData")

cells_rankings <- AUCell_buildRankings(exprMatrix, plotStats = FALSE, splitByBlocks = TRUE)
cells_AUC <- AUCell_calcAUC(geneSets, cells_rankings)
cells_rankings

确定具有给定基因特征或活性基因集的细胞

AUC表示在signature中表达的基因的比例，以及与细胞内其他基因的相对表达值。我们可以使用此属性根据基因集的表达来探索数据集中存在的细胞种群。

set.seed(123)
par(mfrow = c(3, 3))
cells_assignment <- AUCell_exploreThresholds(cells_AUC, plotHist = TRUE, assign = TRUE)

绘制特定基因集的AUC直方图，并设置新的阈值:

cells_assignment$Oligodendrocyte_Cahoy$aucThr$thresholds
cells_assignment$Oligodendrocyte_Cahoy$aucThr$selected
oligodencrocytesAssigned <- cells_assignment$Oligodendrocyte_Cahoy$assignment
length(oligodencrocytesAssigned)
head(oligodencrocytesAssigned)
geneSetName <- rownames(cells_AUC)[grep("Oligodendrocyte_Cahoy", rownames(cells_AUC))]
AUCell_plotHist(cells_AUC[geneSetName, ], aucThr = 0.25)
abline(v = 0.25)

为这个新的阈值分配细胞:

newSelectedCells <- names(which(getAUC(cells_AUC)[geneSetName, ] > 0.08))
length(newSelectedCells)
head(newSelectedCells)

探索细胞分配(表格和热图)，细胞赋值存储在$assignment中。提取所有基因集的细胞，并将其转化为一个表格，并绘制为直方图:

cellsAssigned <- lapply(cells_assignment, function(x) x$assignment)
assignmentTable <- reshape2::melt(cellsAssigned, value.name = "cell")
colnames(assignmentTable)[2] <- "geneSet"
head(assignmentTable)
assignmentMat <- table(assignmentTable[, "geneSet"], assignmentTable[, "cell"])
assignmentMat[, 1:2]
set.seed(123)
miniAssigMat <- assignmentMat[, sample(1:ncol(assignmentMat), 100)]
library(NMF)
aheatmap(miniAssigMat, scale = "none", color = "black", legend = FALSE)
library(DT)
datatable(assignmentTable, options = list(pageLength = 10), filter = "top")

根据signatrue评分探索细胞/簇

AUC评分还可以用于探索以前聚类分析的输出(反之亦然)。

使用获得的不同signature的AUC为t-SNE上色，该t-SNE基于之前运行的整个表达矩阵(即不是5000个随机基因)。

load(paste(file.path(system.file("examples", package = "AUCell")), "cellsTsne.RData",
    sep = "/"))
cellsTsne <- cellsTsne$Y
plot(cellsTsne, pch = 16, cex = 0.3)

这个t-SNE是用下面的代码创建的(需要一段时间来运行):

sumByGene <- apply(mouseBrainExprMatrix, 1, sum)
exprMatSubset <- mouseBrainExprMatrix[which(sumByGene > 0), ]
logMatrix <- log2(exprMatSubset + 1)
logMatrix <- as.matrix(logMatrix)

library(Rtsne)
set.seed(123)
cellsTsne <- Rtsne(t(logMatrix))  ##运行时间长
rownames(cellsTsne$Y) <- colnames(logMatrix)
colnames(cellsTsne$Y) <- c("tsne1", "tsne2")
save(cellsTsne, file = "cellsTsne.RData")
load("cellsTsne.RData")

t-SNE可以根据AUC评分进行着色。为了突出显示根据特征更可能属于该细胞类型的细胞簇，我们将把细胞分为通过分配阈值的细胞(用粉色-红色表示)和没有通过分配阈值的细胞(用黑色-蓝色表示)。

当然，在解释时也应该记住这些签名的来源(例如，这些签名是从大体积RNA-seq分析中获得的)。此外，只在5000个基因上运行本教程可能会带来一些额外的噪音。

对于本例，我们使用了自动分配的阈值:

selectedThresholds <- getThresholdSelected(cells_assignment)
par(mfrow = c(3, 3))  # Splits the plot into two rows and three columns
for (geneSetName in names(selectedThresholds)) {
    nBreaks <- 5  # Number of levels in the color palettes
    # Color palette for the cells that do not pass the threshold
    colorPal_Neg <- (grDevices::colorRampPalette(c("black", "blue", "skyblue")))(nBreaks)
    # Color palette for the cells that pass the threshold
    colorPal_Pos <- (grDevices::colorRampPalette(c("pink", "magenta", "red")))(nBreaks)

    # Split cells according to their AUC value for the gene set
    passThreshold <- getAUC(cells_AUC)[geneSetName, ] > selectedThresholds[geneSetName]
    if (sum(passThreshold) > 0) {
        aucSplit <- split(getAUC(cells_AUC)[geneSetName, ], passThreshold)

        # Assign cell color
        cellColor <- c(setNames(colorPal_Neg[cut(aucSplit[[1]], breaks = nBreaks)],
            names(aucSplit[[1]])), setNames(colorPal_Pos[cut(aucSplit[[2]], breaks = nBreaks)],
            names(aucSplit[[2]])))

        # Plot
        plot(cellsTsne$Y, main = geneSetName, sub = "Pink/red cells pass the threshold",
            col = cellColor[rownames(cellsTsne$Y)], pch = 16)
    }
}

这种图也可以用来探索分配阈值。例如:

selectedThresholds[2] <- 0.25
par(mfrow = c(2, 3))
AUCell_plotTSNE(tSNE = cellsTsne$Y, exprMat = exprMatrix, cellsAUC = cells_AUC[1:2,
    ], thresholds = selectedThresholds)

与均值比较

评估基因特征表达最直观的方法之一是计算平均表达。然而，计算原始计数上的基因签名的平均值将导致读取次数更多的细胞(无论是技术上还是生物学上的原因)拥有更高的大多数签名的平均值。为了减少这种影响，需要某种类型的库大小规范化。由于AUCell单独评估每个单元上的签名检测，它自动补偿库的大小，并使其应用程序更容易，独立于数据单位和应用的其他规范化(例如原始计数、TPM、UMI计数)。

## Comparison with mean
logMat <- log2(exprMatrix + 2)
meanByGs <- t(sapply(geneSets, function(gs) colMeans(logMat[geneIds(gs), ])))
rownames(meanByGs) <- names(geneSets)
colorPal <- (grDevices::colorRampPalette(c("black", "red")))(nBreaks)
par(mfrow = c(3, 3))
for (geneSetName in names(geneSets)) {
    cellColor <- setNames(colorPal[cut(meanByGs[geneSetName, ], breaks = nBreaks)],
        names(meanByGs[geneSetName, ]))
    plot(cellsTsne$Y, main = geneSetName, axes = FALSE, xlab = "", ylab = "", sub = "Expression mean",
        col = cellColor[rownames(cellsTsne$Y)], pch = 16)
}

绘制tSNE降维的AUC：

AUCell_plotTSNE(tSNE = cellsTsne$Y, exprMat = exprMatrix, plots = "AUC", cellsAUC = cells_AUC[1:9,
    ], thresholds = selectedThresholds)

检测到的基因数量

nGenesPerCell <- apply(exprMatrix, 2, function(x) sum(x > 0))
colorPal <- grDevices::colorRampPalette(c("darkgreen", "yellow", "red"))
cellColorNgenes <- setNames(adjustcolor(colorPal(10), alpha.f = 0.8)[as.numeric(cut(nGenesPerCell,
    breaks = 10, right = FALSE, include.lowest = TRUE))], names(nGenesPerCell))
plot(cellsTsne$Y, axes = FALSE, xlab = "", ylab = "", sub = "Number of detected genes",
    col = cellColorNgenes[rownames(cellsTsne$Y)], pch = 16)

How reliable is AUCell? (Confusion matrix)

当使用AUCell识别单元格类型时，所识别单元格的可靠性将取决于使用的签名，以及数据集中单元格类型之间的差异。在任何情况下，AUCell对噪声都相当稳健，无论是在数据上还是在基因集上。

这是用AUCell(本例)识别的细胞与最初发表的细胞类型的比较:

# 'Real' cell type (e.g. provided in the publication)
cellLabels <- paste(file.path(system.file("examples", package = "AUCell")), "mouseBrain_cellLabels.tsv",
    sep = "/")
cellLabels <- read.table(cellLabels, row.names = 1, header = TRUE, sep = "\t")
# Confusion matrix:
cellTypeNames <- unique(cellLabels[, "level1class"])
confMatrix <- t(sapply(cells_assignment[c(1, 4, 2, 5, 3, 6:9)], function(x) table(cellLabels[x$assignment,
    ])[cellTypeNames]))
colnames(confMatrix) <- cellTypeNames
confMatrix[which(is.na(confMatrix), arr.ind = TRUE)] <- 0
confMatrix

我们这期主要介绍单细胞转录组数据中识别细胞对“基因集”的响应 (AUCell)。目前单细胞测序的费用也在降低，单细胞系列可算是目前的测序神器。

References:

Aibar et al. (2017) SCENIC: single-cell regulatory network inference and clustering. Nature Methods. doi: 10.1038/nmeth.4463
Cahoy, J.D., et al. (2008). A Transcriptome Database for Astrocytes, Neurons, and Oligodendrocytes: A New Resource for Understanding Brain Development and Function. J. Neurosci. 28, 264–278. doi: 10.1523/JNEUROSCI.4178-07.2008
Lein, E.S., et al. (2007). Genome-wide atlas of gene expression in the adult mouse brain. Nature 445, 168–176. doi: 10.1038/nature05453
Lavin, Y., et al. (2014) Tissue-Resident Macrophage Enhancer Landscapes Are Shaped by the Local Microenvironment. Cell 159, 1312–1326. doi: 10.1016/j.cell.2014.11.018

本文使用文章同步助手同步

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 158,847评论 4赞 362
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,208评论 1赞 292
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 108,587评论 0赞 243
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,942评论 0赞 205
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,332评论 3赞 287
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,587评论 1赞 218
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,853评论 2赞 312
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,568评论 0赞 198
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,273评论 1赞 242
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,542评论 2赞 246
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,033评论 1赞 260
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,373评论 2赞 253
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,031评论 3赞 236
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,073评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,830评论 0赞 195
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,628评论 2赞 274
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,537评论 2赞 269