bioconductor-DropletUtils
使用教程：Utilities for handling droplet-based single-cell RNA-seq data
对于基于液滴(droplet-based)的单细胞测序，通常只保留包含且只包含一个细胞的液滴生成的数据。R包DropletUtils针对10X Genomics平台，根据观察到的每个液滴的表达谱与周围溶液的表达谱来区分空液滴（empty droplets，只含溶液中RNA）和含细胞的液滴。
DropletUtils主要功能如下：

读入10X Genomics平台的UMI count matrix
读入CellRanger生成的molecule information file (molecule_info.h5)
降采样：downsampling the UMI count matrix or the raw reads
识别空液滴和doublets
去除Illumina 4000测序仪的barcode swapping效应

R包安装、载入

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("DropletUtils")
library(DropletUtils)

读入10X Genomics数据

读入UMI count matrix

CellRanger生成的稀疏矩阵数据一般在/outs/filtered_gene_bc_matrices/目录，包含barcodes.tsv.gz、features.tsv.gz和matrix.mtx.gz（CellRanger version 3），或者包含barcodes.tsv、genes.tsv、matrix.mtx（CellRanger version 2）。
本教程使用模拟数据：

# To generate the files.
example(write10xCounts, echo=FALSE) 
dir.name <- tmpdir
list.files(dir.name)
## [1] "barcodes.tsv" "genes.tsv"    "matrix.mtx"

read10xCounts函数读取CellRanger输出，并返回SingleCellExperiment对象

sce <- read10xCounts(dir.name)
sce
## class: SingleCellExperiment 
## dim: 100 10 
## metadata(1): Samples
## assays(1): counts
## rownames(100): ENSG00001 ENSG00002 ... ENSG000099 ENSG0000100
## rowData names(2): ID Symbol
## colnames: NULL
## colData names(2): Sample Barcode
## reducedDimNames(0):
## altExpNames(0):

载入的表达矩阵是稀疏矩阵，是R包Matrix的 dgCMatrix对象，该类型对象只储存非零的counts，节省内存空间，广泛应用于有很多dropouts的单细胞测序数据。

class(counts(sce))
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"

也可同时导入多个样本的测序数据，创建一个character向量传入read10xCounts函数即可，返回的是一个单一的SingleCellExperiment对象，其中多个样本的表达矩阵按列整合，然而此种情况要求不同样本表达矩阵的基因一致。

读入molecule information file

CellRanger生成的molecule information file (molecule_info.h5) 文件一般包含如下信息：UMI序列，barcode序列，RNA的reads数。
本教程使用示例文件：

set.seed(1000)
mol.info.file <- DropletUtils:::sim10xMolInfo(tempfile())
mol.info.file
## [1] "/tmp/Rtmpg4CWdT/file3a62437f43e8.1.h5"

read10xMolInfo函数读入R：

mol.info <- read10xMolInfo(mol.info.file)
mol.info
## $data
## DataFrame with 9532 rows and 5 columns
##             cell       umi gem_group      gene     reads
##      <character> <integer> <integer> <integer> <integer>
## 1           TGTT     80506         1        18        12
## 2           CAAT    722585         1        20        17
## 3           AGGG    233634         1         4        17
## 4           TCCC    516870         1        10        13
## 5           ATAG    887407         1         6         8
## ...          ...       ...       ...       ...       ...
## 9528        TACT   1043995         1         9        18
## 9529        GCTG    907401         1        20         7
## 9530        ATTA    255710         1        13         7
## 9531        GCAC    672962         1        20        12
## 9532        TGAA    482852         1         1        10
## 
## $genes
##  [1] "ENSG1"  "ENSG2"  "ENSG3"  "ENSG4"  "ENSG5"  "ENSG6"  "ENSG7"  "ENSG8" 
##  [9] "ENSG9"  "ENSG10" "ENSG11" "ENSG12" "ENSG13" "ENSG14" "ENSG15" "ENSG16"
## [17] "ENSG17" "ENSG18" "ENSG19" "ENSG20"

molecule information file文件中的信息有利于质控环节，比如检查测序饱和度时需要read counts数。
read10xMolInfo函数会自动猜测barcode序列长度，多数情况下猜测是准确的，用户也可用barcode.length参数传入已知的barcode序列长度。

Downsampling across batches

downsampling the UMI count matrix

对于测序深度相差较大的多批次测序，将测序深度最大的批次降采样(downsample)，使其匹配测序深度最小的批次的覆盖率，有利于避免技术噪声导致下游分析按批次聚类。downsampleMatrix函数可实现此功能。

set.seed(100)
new.counts <- downsampleMatrix(counts(sce), prop=0.5)
library(Matrix)
colSums(counts(sce))
##  [1] 508 524 490 518 464 468 484 544 519 473
colSums(new.counts)
##  [1] 254 262 245 259 232 234 242 272 260 237

以上代码使每个细胞的total count减半。
参数prop由用户自定义，取决于实验批次数量以及哪个批次覆盖率最低。参数prop可以是向量，，包含细胞特异的proportions。

downsampling the raw reads

对reads进行downsample更恰当，因为考虑了每个细胞的测序深度差异。通过对包含read counts信息的molecule information file应用downsampleReads函数实现此功能。

set.seed(100)
no.sampling <- downsampleReads(mol.info.file, prop=1)
sum(no.sampling)
## [1] 9532
with.sampling <- downsampleReads(mol.info.file, prop=0.5)
sum(with.sampling)
## [1] 9502

以上代码将reads 降采样至原始覆盖率的50%。downsampleReads函数返回的是UMI counts矩阵，假如测序饱和度很高的话，最后的total count 并不会下降太多。用户如果希望降采样后total count差不多的话，应该使用上一节的downsampleMatrix函数。

Computing barcode ranks

对液滴法测序数据的一个有用的diagnostic 是barcode rank plot图，y轴是每个barcode的(log-)total UMI count，x轴是(log-)rank ，这实际上是一个坐标轴对数变换的转置经验累积密度图，以查看barcodes的total counts分布。
模拟表达矩阵：

set.seed(0)
my.counts <- DropletUtils:::simCounts()

计算barcode ranks，画图

br.out <- barcodeRanks(my.counts)

# Making a plot.
plot(br.out$rank, br.out$total, log="xy", xlab="Rank", ylab="Total")
o <- order(br.out$rank)
lines(br.out$rank[o], br.out$fitted[o], col="red")

abline(h=metadata(br.out)$knee, col="dodgerblue", lty=2)
abline(h=metadata(br.out)$inflection, col="forestgreen", lty=2)
legend("bottomleft", lty=2, col=c("dodgerblue", "forestgreen"), 
    legend=c("knee", "inflection"))

barcode rank plot

曲线的knee和inflection points （拐点）标志着 total count分布的转变，体现含有少量RNA的空液滴和含有大量RNA的细胞液滴之间的区别。不过下面的代码将更严格地区分二者。

检测空液滴(empty droplets)

空液滴可能包含来自周围溶液的RNA，因此counts不为零。emptyDrops函数用以区分空液滴和含有细胞的液滴，原理为检验每个barcode的表达谱和周围溶液的表达谱有无显著偏差。

set.seed(100)
e.out <- emptyDrops(my.counts)
e.out
## DataFrame with 11100 rows and 5 columns
##           Total   LogProb    PValue   Limited        FDR
##       <integer> <numeric> <numeric> <logical>  <numeric>
## 1             2        NA        NA        NA         NA
## 2             9        NA        NA        NA         NA
## 3            20        NA        NA        NA         NA
## 4            20        NA        NA        NA         NA
## 5             1        NA        NA        NA         NA
## ...         ...       ...       ...       ...        ...
## 11096       215  -246.428 9.999e-05      TRUE 0.00014427
## 11097       201  -250.234 9.999e-05      TRUE 0.00014427
## 11098       247  -275.905 9.999e-05      TRUE 0.00014427
## 11099       191  -228.763 9.999e-05      TRUE 0.00014427
## 11100       198  -233.043 9.999e-05      TRUE 0.00014427

FDR低于特定阈值(比如0.01)的液滴代表与周围溶液的表达谱有显著偏差，这些液滴可看作是含有细胞的液滴。并且，为了避免去除内含细胞表达谱与周围溶液表达谱很相似的液滴，counts较大的液滴被自动设置p-value为零以保留。

is.cell <- e.out$FDR <= 0.01
sum(is.cell, na.rm=TRUE)
## [1] 902

p-values用置换检验计算，因此需设置种子数（set.seed）。计算结果e.out的Limited一列为逻辑值，表示是否可以通过增加置换数目来降低p值。如果有些条目的FDR高于设定阈值而Limited==TRUE，则表明应该提高emptyDrops函数的niters参数。

table(Limited=e.out$Limited, Significant=is.cell)
##        Significant
## Limited FALSE TRUE
##   FALSE   398  802
##   TRUE      0  100

画诊断图：the total count against the negative log-probability
含有细胞的液滴应该negative log-probabilities较大，或total counts 较大（total counts阈值参考上一节barcodeRanks函数得到的拐点）
下图基于模拟数据，因此较夸张：

plot(e.out$Total, -e.out$LogProb, col=ifelse(is.cell, "red", "black"),
    xlab="Total UMI count", ylab="-Log Probability")

the total count against the negative log-probability(红色为含细胞液滴，黑色为空液滴)

Demultiplexing hashed libraries

hashedDrops()函数用于demultiplex 单细胞Cell hashing实验。
首先模拟hash tag oligo (HTO) count matrix，同时加入doublets 和empty droplets。

set.seed(10000)

# Simulating empty droplets:
nbarcodes <- 1000
nhto <- 10
y <- matrix(rpois(nbarcodes*nhto, 20), nrow=nhto)

# Simulating cells:
ncells <- 100
true.sample <- sample(nhto, ncells, replace=TRUE)
y[cbind(true.sample, seq_len(ncells))] <- 1000

# Simulating doublets:
ndoub <- ncells/10
next.sample <- (true.sample[1:ndoub]  + 1) %% nrow(y)
next.sample[next.sample==0] <- nrow(y)
y[cbind(next.sample, seq_len(ndoub))] <- 500

首先从HTO count matrix识别含细胞的液滴，emptyDrops()函数需加lower=参数来匹配HTO文库的测序深度。

hto.calls <- emptyDrops(y, lower=500)
has.cell <- hto.calls$FDR <= 0.001
summary(has.cell)
##    Mode    TRUE    NA's 
## logical     100     900

每个barcode文库根据含量最高的HTO分配样本来源，分配的置信度由含量最高和次高的HTO数之间的log-fold change(下面表格demux的LogFC列)来定量。hashedDrops()函数会自动矫正周围溶液的HTO水平差异。

demux <- hashedDrops(y[,which(has.cell)], 
    ambient=metadata(hto.calls)$ambient)
demux
## DataFrame with 100 rows and 7 columns
##         Total      Best    Second     LogFC    LogFC2   Doublet Confident
##     <numeric> <integer> <integer> <numeric> <numeric> <logical> <logical>
## 1        1657         4         5  0.999462   4.60496      TRUE     FALSE
## 2        1635         8         9  0.999492   4.84165      TRUE     FALSE
## 3        1669         6         7  0.999473   4.45073      TRUE     FALSE
## 4        1674         6         7  0.999491   4.49983      TRUE     FALSE
## 5        1645         3         4  1.000292   4.74602      TRUE     FALSE
## ...       ...       ...       ...       ...       ...       ...       ...
## 96       1167         3         1   5.31708  0.427468     FALSE      TRUE
## 97       1158         3         1   5.26081  0.526363     FALSE      TRUE
## 98       1179         4         9   5.00121  0.604380     FALSE      TRUE
## 99       1187         2         5   5.37410  0.196833     FALSE      TRUE
## 100      1177         5         8   5.15739  0.464633     FALSE      TRUE

然后便可判断每个细胞的样本来源。R包作者提供了Confident接口，表明哪些液滴是确切的singlets，识别依据为(i) 液滴不是doublets ， (ii)含量最高和次高的HTO数之间的log-fold change不是很小。（本人理解：上面表格demux的Confident列是由Doublet列和LogFC列来定义的，Confident值为TRUE表明液滴是确切的singlets，Confident列为TRUE的行的Best列表示每个细胞的样本来源，即demux$Best[demux$Confident]）

table(demux$Best[demux$Confident])
## 
##  1  2  3  4  5  6  7  8  9 10 
## 10 15  9  7 12  8  6  6 10  6

还根据含量次高的HTO数与周围溶液的HTO水平之间的log-fold change(上面表格demux的LogFC2列)来识别doublets ，LogFC2值越大，说明每个液滴含量次高的HTO越不可能来源于周围溶液的HTO污染，更加证实了doublet的存在。

colors <- ifelse(demux$Confident, "black",
    ifelse(demux$Doublet, "red", "grey"))
plot(demux$LogFC, demux$LogFC2, col=colors,
    xlab="Log-fold change between best and second HTO",
    ylab="Log-fold change between second HTO and ambient")

黑色为确切的singlet，红色为doublet，灰色为不是确切的singlet但也不是doublet

去除swapping效应

去除样本间barcode swapping效应

用Illumina 4000测序仪混样测序时常发生barcode swapping的现象。一个样本的分子被另一个样本的barcode错误标记，导致在demultiplex过程中被错误地分配。幸运的是液滴法单细胞测序可以消除这种效应，因为不可能存在多个RNA分子的细胞barcode和UMI序列的组合完全一致的情况，因此多个样本中细胞barcode和UMI序列的组合完全一致的RNA分子很可能来源于barcode swapping。
swappedDrops函数根据同一run中10X混合测样的molecule information file识别并去除barcode swapping的分子，生成“cleaned” UMI count matrices。
首先模拟10X混合测样的molecule information file：

set.seed(1000)
mult.mol.info <- DropletUtils:::sim10xMolInfo(tempfile(), nsamples=3)
mult.mol.info
## [1] "/tmp/Rtmpg4CWdT/file3a627557651c.1.h5"
## [2] "/tmp/Rtmpg4CWdT/file3a627557651c.2.h5"
## [3] "/tmp/Rtmpg4CWdT/file3a627557651c.3.h5"

然后用swappedDrops函数去除barcode swapping效应：

s.out <- swappedDrops(mult.mol.info, min.frac=0.9)
length(s.out$cleaned)
## [1] 3
class(s.out$cleaned[[1]])
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"

细胞间的嵌合reads

嵌合分子是在文库准备过程中产生的，在此过程中，一个cDNA分子的不完全PCR产物通过共享序列(如3 'protocols的poly-A尾巴)杂交到另一个分子上进行延伸。这就产生了一种扩增子，其中UMI和细胞barcode来自一个转录本，而基因序列来自另一个转录分子，相当于基因间的reads交换。
chimericDrops()函数应用于molecule information file获得cleaned count matrix，去除嵌合reads效应，去除同一细胞中有相同的UMI的不同转录本，只保留reads数较高的那个。

out <- chimericDrops(mult.mol.info[1])
class(out)
## [1] "list"

其他用法

与read10xCounts函数相对，DropletUtils包的write10xCounts函数可反向将UMI counts稀疏矩阵转换成CellRanger输出文件（3个文件或HDF5格式）。
参数和使用查看?write10xCounts ，其中version参数表示要输出成CellRanger version 3.0还是version 2格式。
例如：反向将Seurat对象中的表达矩阵转换成barcodes.tsv.gz、features.tsv.gz和matrix.mtx.gz三个文件。

library(DropletUtils)    # 加载 R 包
# 从seurat 对象导出 barcodes.tsv.gz、features.tsv.gz 和 matrix.mtx.gz 至 output 文件夹
write10xCounts("output/",seurat_obj[["RNA"]]@counts, version = "3")    
R.utils::gunzip("output/features.tsv.gz")   # 解压gz文件

R包DropletUtils使用