困扰的batch effect

一、什么是批次效应

批次效应（batch effect），表示样品在不同批次中处理和测量产生的与试验期间记录的任何生物变异无关的技术差异。批次效应是高通量试验中常见的变异来源，受日期、环境、处理组、实验人员、试剂、平台等一些非生物因素的影响。
合并分析不同批次的数据时，平常的标准化方法不足以调整批次之间的差异。如果批次效应比较严重，这些差异就会干扰实验结果，我们就不能够判断得到的差异表达的基因是来源于想要研究的因素，还是和批次相关。
批次效应不能被消除，只有尽可能的降低。校正批次效应的目的是，减少批次之间的差异，尽量让多个批次的数据重新组合在一起，这样下游分析就可以只考虑生物学差异因素。

Batch effects are sub-groups of measurements that have qualitatively different behaviour across conditions and are unrelated to the biological or scientific variables in a study. For example, batch effects may occur if a subset of experiments was run on Monday and another set on Tuesday, if two technicians were responsible for different subsets of the experiments or if two different lots of reagents, chips or instruments were used.

Normalization is a data analysis technique that adjusts global properties of measurements for individual samples so that they can be more appropriately compared. Including a normalization step is now standard in data analysis of gene expression experiments. But normalization does not remove batch effects, which affect specific subsets of genes and may affect different genes in different ways. [1]

二、处理方法

目前已经有多种处理批次差异的方法。[2]

批次效应处理方法

三、哪种方法比较好

一项研究比较了6种去除批次效应的方法，其中包括ComBat方法（parametric prior method，ComBat_p和non-parametric method，ComBat_n）、代理变量法（Surrogate variable analysis，SVA）、基于比值的方法（Geometric ratio-based method，Ratio_G）、平均中心方法（Mean-centering，PAMR）和距离加权判别（Distance-weighted discrimination，DWD）方法，综合多种指标认为ComBat在精确性、准确性和整体性能方面（precision, accuracy and overall performance）总体优于其他方法。

A number of approaches have been used or developed for identifying and removing batch effects from microarray data, of which we have chosen six for evaluation.
We systematically evaluated six of these programs using multiple measures of precision, accuracy and overall performance. ComBat, an Empirical Bayes method, outperformed the other five programs by most metrics.
Its parametric and non-parametric algorithms both worked well in both kinds of data sets, controlling the variation attributable to batch effects, increasing the correlation among replicates, and producing the largest AUC in our assessment of overall performance. [3]

四、ComBat方法的算法

模型的假设是基于位置和尺度（Location and scale，L/S）的调整。L/S调整可以定义为一系列广泛的调整，其中为数据在批次内的位置（均值）和/或规模（方差）假设了一个模型，然后调整批次以满足假设模型的规范。因此，L/S批次调整假设批次效应可以通过标准化批次之间的均值和方差来建模。这些调整可以从简单的基因范围的均值和方差标准化，到复杂的基因间线性或非线性调整。

模型：Yijg = αg + Xβg + γig + δigεijg
Yijg表示来自批次i的样品j的基因g的表达值。其中αg是基因g的平均表达值，X是样本条件的设计矩阵，βg是对应于X的回归系数向量。误差项εijg服从期望值为0和方差为σg的正态分布N(0，σg )，γig和δig表示批次i中基因g加法和乘法的批次效应。

算法分为3步： [4]

算法

Location and scale (L/S) adjustments can be defined as a wide family of adjustments in which one assumes a model for the location (mean) and/or scale (variance) of the data within batches and then adjusts the batches to meet assumed model specifications. Therefore, L/S batch adjustments assume that the batch effects can be modeled out by standardizing means and variances across batches. These adjustments can range from simple gene-wise mean and variance standardization to complex linear or non-linear adjustments across the genes. [5]

五、推荐的解决方法

（1）根据分析目的确定批次效应处理方法：差异表达分析，在模型中添加批次因素；可视化，先对原数据进行校正，再使用校正后的数据进行分析。
（2）已知的批次，removeBatchEffect或ComBat；未知的批次，sva。
（3）removeBatchEffect和ComBat、sva输入的数据需要进行转换，例如取对数（rlog或logCPM）。
（4）read counts数据可使用ComBat-Seq或svaseq。

六、差异表达分析的批次校正

很多人以为去除批次效应是要改变你的表达矩阵，新的表达矩阵然后去走差异分析流程，其实大部分的差异分析流程包里面，人家内置好了考虑你的批次效应这样的混杂因素的函数用法设计。例如：
构建DESeq2对象时的设计公式：design = ~ batch + conditions

DESeq2

因为我们把批次这个因素，写在了DESeq2里面，所以它在帮我们做差异分析的时候，其实就考虑了，得到的差异分析结果，就是去除了批次效应的。

如果要合并不同批次的数据进行差异表达分析，建议直接把批次信息加入到构建模型里面。但是这种方法并没有改变原数据。如果你确实一定要亲眼看看批次效应到底是如何影响这个表达矩阵的，就需要对表达矩阵做另外的处理，例如removeBatchEffect或ComBat。但是处理之后会改变counts矩阵，之后就没办法走DESeq2差异分析流程啦，仅仅是为了拿到去除批次效应前后对比的表达矩阵而已。

PCA

七、使用limma的removeBatchEffect处理批次效应

removeBatchEffect这个函数用于进行聚类或无监督分析之前，移除与杂交时间或其他技术变异相关的批次效应。它是针对芯片设计的，因此不要直接使用read counts，数据需要经过一定的标准化操作，如log转化。

removeBatchEffect只用于衔接聚类、PCA等可视化展示，不要在线性建模之前使用。因为用矫正后的数据进行差异表达分析，有两个缺陷：一是批次因素和分组因素可能重叠，所以直接对原数据矫正批次可能会抵消一部分真实生物学因素；二是低估了误差。所以，如果想做差异表达分析，但数据中又有已知的批次问题，则最好把批次效应纳入线性模型中。

This function is useful for removing batch effects, associated with hybridization time or other technical variables, prior to clustering or unsupervised analysis such as PCA, MDS or heatmaps. The design matrix is used to describe comparisons between the samples, for example treatment effects, which should not be removed. The function (in effect) fits a linear model to the data, including both batches and regular treatments, then removes the component due to the batch effects.
This function is intended for use with clustering or PCA, not for use prior to linear modelling. If linear modelling is intended, it is better to include the batch effect as part of the linear model.

removeBatchEffect用法

removeBatchEffect

八、使用SVA的ComBat处理批次效应

sva这个R包可以处理已知的和未知的批次效应，sva函数可以通过构建高维数据集的代理变量，移除批次效应和其它所有不需要的变异。如果是芯片数据用sva，高通量测序数据使用svaseq。ComBat函数可以处理已知的批次效应。

sva has functionality to estimate and remove artifacts from high dimensional data the sva function can be used to estimate artifacts from microarray data. the svaseq function can be used to estimate artifacts from count-based RNA-sequencing (and other sequencing) data. The ComBat function can be used to remove known batch effecs from microarray data. The fsva function can be used to remove batch effects for prediction problems.

ComBat allows users to adjust for batch effects in datasets where the batch covariate is known, using methodology described in Johnson et al. 2007. It uses either parametric or non-parametric
empirical Bayes frameworks for adjusting data for batch effects. Users are returned an expression matrix that has been corrected for batch effects. The input data are assumed to be cleaned and normalized before batch effect removal.

ComBat用法

ComBat

九、使用sva处理未知的批次

SVA具有移除批次效应和高通量测序中其它不需要的变异的功能。使用sva识别和构建高维数据集代理变量(surrogate variables)，代理变量是由高维数据直接构建的协变量(covariates)，可用于后续分析，以调整未知、未建模或潜在的噪声源。
sva函数的输出本身就是代理变量。它们可以包含在全模型矩阵和空模型矩阵中，然后与数据矩阵一起传递给SVA包中的f.pvalue函数，以计算参数F检验p值，从而调整代理变量。

根据经验，当存在大量已知或未知的潜在混杂因素时，代理变量调整（sva）可能更合适。当一个或多个生物学分组已知是异质的，并且有已知的批次变量时，直接调整（ComBat）可能更合适。

sva考虑了两种类型的变量：调整变量和感兴趣的变量。例如，感兴趣变量(variables of interest)可能是癌症组与对照组；调整变量(adjustment variables)可能是病人的年龄、性别、测序时间。
建立两个模型矩阵：全模型(full model)和空模型(null model)。空模型包括所有调整变量，而不包含感兴趣的变量；全模型包括所有调整变量和感兴趣的变量。我们将试图分析感兴趣的变量与基因表达之间的关联，调整调整变量。模型矩阵可以使用model.matrix创建。sva的目标是消除所有不需要的变异来源，同时通过mod中包含的主要变量来检测对比。

注意：在我们最初的工作中，使用识别函数来测量在近似对称和连续尺度上的数据。对于通常表示为counts的测序数据，更合适的模型可能涉及使用适度的对数函数。例如，我们首先用log(counts+1)转换基因表达数据。

The goal of sva is to remove all unwanted sources of variation while protecting the contrasts due to the primary variables specified in the function call. This leads to the identification of features that are consistently different between groups, removing all common sources of latent variation.
As a rule of thumb, when there are a large number of known or unknown potential confounders, surrogate variable adjustment may be more appropriate. Alternatively, when one or more biological groups is known to be heterogeneous, and there are known batch variables, direct adjustment may be more appropriate. [6]

用法：
（1）使用sva得到代理变量

##准备数据
data(bladderdata)
pheno = pData(bladderEset)
head(pheno) #分组信息
edata = exprs(bladderEset)
head(edata) #表达量信息

##建立两个模型
pheno$cancer
mod = model.matrix(~as.factor(cancer), data=pheno)
mod #全模型包括所有调整变量和感兴趣的变量,这里只有感兴趣的变量
mod0 = model.matrix(~1,data=pheno)
mod0 #空模型只有调整变量,这里没有调整变量,所以只有一个截距intercept

##进行SVA
n.sv = num.sv(edata,mod,method="leek") #估计潜在因素的数量
n.sv
svobj = sva(edata,mod,mod0,n.sv=n.sv) #应用SVA函数来估计代理变量
str(svobj) #包含4项的列表
head(svobj$sv)

#SV是一个矩阵,其列对应于估计的代理变量,它们可以用于下游分析
#pprob.gam是每个基因与一个或多个潜在变量相关的后验概率
#pprob.b是每个基因与感兴趣的变量相关的后验概率
#n.sv是SVA估计的代理变量数

（2）使用f.pvalue函数调整代理变量（calculate parametric F-test P-values adjusted for surrogate variables）

#f.pvalue可以用来计算数据矩阵每一行(基因)的参数F检验p值
#F检验比较了模型mod和mod0

#没有调整代理变量的原始矩阵
pValues = f.pvalue(edata,mod,mod0)
qValues = p.adjust(pValues,method="BH") #adjust them for multiple testing
head(qValues)
sum(qValues < 0.05)/length(qValues) #68.2%的基因都是差异表达的,This number seems artificially high

#Now we can perform the same analysis,but adjusting for surrogate variables. 
#首先在空模型和全模型中包含代理变量
#原因是我们想要调整代理变量,所以我们把它们作为调整变量,必须包含在这两个模型中
modSv = cbind(mod,svobj$sv)
mod0Sv = cbind(mod0,svobj$sv)
pValuesSv = f.pvalue(edata,modSv,mod0Sv)
qValuesSv = p.adjust(pValuesSv,method="BH")
head(qValuesSv)
sum(qValuesSv < 0.05)/length(qValuesSv)#66.5%的基因是差异表达的

（3）sva可以与差异表达分析程序一起使用，如limma、DESeq2。

##sva+DESeq2
dds <- DESeqDataSetFromMatrix(countData = countdata,
                              colData = colData,
                              design = ~ Batch + Group)
dds <- dds[rowSums(counts(dds))>1,]
dds <- estimateSizeFactors(dds)
dat <- counts(dds, normalized=TRUE)
dat <- dat[rowMeans(dat) > 1,]
head(dat)

mod <- model.matrix(~ Group, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))

n.sv = num.sv(dat,mod,method="leek")
n.sv
svseq <- svaseq(dat, mod, mod0, n.sv=2)
svseq$sv

par(mfrow=c(2,1),mar=c(3,5,3,1))
stripchart(svseq$sv[,1] ~ dds$Group,vertical=TRUE,main="SV1")
abline(h=0)
stripchart(svseq$sv[,2] ~ dds$Group,vertical=TRUE,main="SV2")
abline(h=0)

ddssva <- dds
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + Group
colData(ddssva)
ddssva <- DESeq(ddssva)

References
[1] Chen C, Grennan K, Badner J, Zhang D, Gershon E, Jin L, et al. Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods[J]. PloS One. 2011;6(2):e17238.
[2] 李飒,赵毅强.基因表达数据批次效应去除方法的研究进展[J].南京农业大学学报,2019,42(03):389-397.
[3] Chen C, Grennan K, Badner J, Zhang D, Gershon E, Jin L, et al. Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods[J]. PloS One. 2011;6(2):e17238.
[4]陈天成,侯艳,李康.基因组学数据整合中的批次效应移除算法[J].中国卫生统计,2016,33(03):527-529+533.
[5] Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods[J]. Biostatistics. 2007;8(1):118-27.
[6] Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package for removing batch effects and other unwanted variation in high-throughput experiments[J]. Bioinformatics. 2012;28(6):882-3.