nsSNPs致病性分析（二）现有工具与原理

SIFT

Polyphen2

CADD

DANN

MetaSVM

dbNSFP数据库：整合多种nsSNP预测工具的结果

1. SIFT

算法说明：

For a given protein sequence, SIFT compiles a dataset of functionally related protein sequences by searching a protein database
 using the PSI-BLAST algorithm6. It then builds an alignment from the homologous sequences with the query sequence. 

In the second step of the algorithm, SIFTscans each position in the alignment and calculates the probabilities for all possible
 20 amino acids at that position. These probabilities are normalized by the probability of the most frequent amino acid and are
 recorded in a scaled probability matrix. 

SIFT predicts a substitution to affect protein function if the scaled probability, also termed the SIFTscore, lies below a certain
 threshold value. Generally, a highly conserved position is intolerant to most substitutions, whereas a poorly conserved position
 can tolerate most substitutions

计算某个位点保守性的公式：

P_ca的计算方法——基于伪计数得到校正的PSSM：

N_c：实际得到的同源序列数
g_ca：目标序列c位点出现a氨基酸的实际频率
B_c：伪计数，未观察到的同源序列数
f_ca：目标序列c位点出现a氨基酸的伪计数频率

由伪计数得到的 B_c 和 f_ca 是怎么确定的？

对于氨基酸组成多样的位点，SIFT倾向于使用更大的伪计数，因为越多样则被漏掉的同源序列可能就越多

得到的PSSM的形式：

2. Polyphen-2

PolyPhen同时结合序列和结果上的信息，主要的假设就是说有一些氨基酸的改变可能会影响蛋白的折叠，影响蛋白的的相互作用区间，影响它的稳定性，而蛋白结构如果有改变，那蛋白的功能就更可能会发生改变，所以它整合了序列和三维结构的一些特征

Sequence-based features

（1）通过已有的蛋白质注释数据库（如UniProtKB/Swiss-Prot），鉴定某个替换 (substitution) 是否落在某个特殊的区域/位置

特殊位点包括：

DISULFID, CROSSLNK bond or

BINDING, ACT_SITE, LIPID, METAL, SITE, MOD_RES, CARBOHYD, NON_STD site

特殊区域包括：

TRANSMEM, INTRAMEM, COMPBIAS, REPEAT, COILED, SIGNAL, PROPEP

（2）另外还会计算替换前后PSIC值的差值

PSIC值的计算方法类似于PSSM的计算方法，即在UniRef100数据库中利用BLAST搜索与qury序列高度同源的序列，然后将这些序列进行多序列比对，基于多序列比对结果得到profile matrix，其中行表示一个特定的氨基酸位点，列表示一种氨基酸，像这样：

这个矩阵的每个元素（profile score）的计算公式为：

其中i表示矩阵的行号，j表示矩阵的列号，PSIC_i,j表示矩阵第i行，第j列的PSIC值，P(aa=A_j| posi=i) 表示在query序列第i个氨基酸位点出现A_j氨基酸的概率，P(aa=A_j) 表示任意位点出现A_j氨基酸的概率

若在qury序列第i个氨基酸位点，发生了A_m到A_n的非同义突变，则

若ΔPSIC是一个比较大的正数，说明这种突变发生的概率很低，这种突变很可能是一个有害突变

Structural features

找到这个蛋白的三维结构，或者这个三维结构没有，但是有一个和你这个蛋白序列比较相类似的另外一个蛋白结构有，那你可以做一个同源建模，来预测它的三维结构

然后基于这个三维结构计算该位点相关的结构参数 (structural parameters)，PolyPhen
2利用DSSP数据库来获得下面的结构参数：

Secondary structure (according to the DSSP nomenclature)

Solvent accessible surface area (absolute value in Å²)

Phi-psi dihedral angles

使用的预测算法为Naive Bayes

训练集有两种：

HumDiv

compiled from all damaging alleles with known effects on the molecular function causing human Mendelian diseases, present in the UniProtKB database, together with differences between human proteins and their closely related mammalian homologs, assumed to be non-damaging

HumVar

consisted of all human disease-causing mutations from UniProtKB, together with common human nsSNPs (MAF>1%) without annotated involvement in disease, which were treated as non-damaging.

基于两种不同类型的训练集训练得到两种不同的预测模型，适用于不同类型nsSNP的预测

HVAR：should be used for diagnostics of Mendelian diseases, which requires distinguishing mutations with drastic effects from all the remaining human variation, including abundant mildly deleterious alleles.The authors recommend calling "probably damaging" if the score is between 0.909 and 1, and "possibly damaging" if the score is between 0.447 and 0.908, and "benign" is the score is between 0 and 0.446.

HDIV： be used when evaluating rare alleles at loci potentially involved in complex phenotypes, dense mapping of regions identified by genome-wide association studies, and analysis of natural selection from sequence data. The authors recommend calling "probably damaging" if the score is between 0.957 and 1, and "possibly damaging" if the score is between 0.453 and 0.956, and "benign" is the score is between 0 and 0.452.

一般突变看HVAR

3. CADD

CADD —— Combined Annotation Dependent Depletion

这个工具出行的历史任务是，在此之前，大多数SNV有害性或可容忍性（deleteriousness）的评估都是基于单个因素，而CADD对多种特征都进行了整合

While many variant annotation and scoring tools are around, most annotations tend to exploit a single information type (e.g. conservation) and/or are restricted in scope (e.g. to missense changes). Thus, a broadly applicable metric that objectively weights and integrates diverse information is needed. Combined Annotation Dependent Depletion (CADD) is a framework that integrates multiple annotations into one metric by contrasting variants that survived natural selection with simulated mutations.

CADD独创了一种打分算法，来衡量变异位点的有害程度。对于一组变异位点，CADD 结合等位基因的多态性，变异的致病性等多个因素，构建了一套模型，对每个变异位点进行评估，并给出一个具体的得分，简称C-Scores。统计模型直接给出的打分叫做RawScore, 这个值越高，代表该变异位点是一个有害突变的概率越高。

对于不同组的变异位点，比如对于1000G和ESP两批变异位点而言，由于各因素的差异，其模型是不同的，RawScore在不同模型间是无法直接比较的。所以提出了scaled C-scores的概念。对RawScores进行从大到小排序，采用-10*log10(rank/total)的公式计算出scaled C-scores。由于这个公式和phread的定义方式类似，所以scaled C-scores也叫做PHREAD。

在分析潜在的致病变异位点时，通常会对PHREAD进行过滤。官方推荐阈值为10,15,20都可以，但是更加推荐结合C-Scores和其他实验证据来对变异位点的致病性进行评估，而不是单纯的进行一个数值过滤。

4. DANN

DANN利用神经网络算法评估变异位点的有害程度

DANN软件可以看作是CADD的改进版本，改进了预测的算法，效果比CADD有所提高。

CADD软件的核心是支持向量机SVM算法，这个算法在机器学习领域是一个常用的算法之一，对于具有线性关系的特征具有具有较好的性能，但是对于非线性关系的特征，其性能就相对差点。DANN采用了神经网络算法，更容易捕获非线性关系的特征，所以效果上比CADD要好一点。

Bioinformatics. 2015 Mar 1; 31(5): 761–763.

可以看到，两幅图中，DANN的AUC都比SVM的要大，说明DANN相比CADD确实是性能更好。

5. MetaSVM

分为三步：

(1) perform imputation for whole-exome variants and fill out missing scores for SIFT, PolyPhen, MutationAssessor and so on.

(2) Normalize all scores to 0-1 range

(3) use a radial SVM model to train prediction model using all available scores and some population genetics parameters, and then apply the model on whole-exome variants.

简单来说，就是结合SIFT, PolyPhen 和 MutationAssessor 的预测分值，训练SVM模型来预测

6. dbNSFP数据库：整合多种nsSNP预测工具的结果

网址：https://sites.google.com/site/jpopgen/dbNSFP

整合了20种nsSNP的功能预测算法与6种保守性评估方法得到的分值

功能预测：

SIFT, Polyphen2-HDIV, Polyphen2-HVAR, LRT, MutationTaster2, MutationAssessor, FATHMM, MetaSVM, MetaLR, CADD, VEST3, PROVEAN, FATHMM-MKL coding, fitCons, DANN, GenoCanyon, Eigen coding, Eigen-PC, M-CAP, REVEL, MutPred

保守性评估：

PhyloP x 2, phastCons x 2, GERP++ and SiPhy

Score (dbtype)	# variants in LJB23 build hg19	Categorical Prediction
SIFT (sift)	77593284	D: Deleterious (sift<=0.05); T: tolerated (sift>0.05)
PolyPhen 2 HDIV (pp2_hdiv)	72533732	D: Probably damaging (>=0.957), P: possibly damaging (0.453<=pp2_hdiv<=0.956); B: benign (pp2_hdiv<=0.452)
PolyPhen 2 HVar (pp2_hvar)	72533732	D: Probably damaging (>=0.909), P: possibly damaging (0.447<=pp2_hdiv<=0.909); B: benign (pp2_hdiv<=0.446)
LRT (lrt)	68069321	D: Deleterious; N: Neutral; U: Unknown
MutationTaster (mt)	88473874	A" ("disease_causing_automatic"); "D" ("disease_causing"); "N" ("polymorphism"); "P" ("polymorphism_automatic"
MutationAssessor (ma)	74631375	H: high; M: medium; L: low; N: neutral. H/M means functional and L/N means non-functional
FATHMM (fathmm)	70274896	D: Deleterious; T: Tolerated
MetaSVM (metasvm)	82098217	D: Deleterious; T: Tolerated
MetaLR (metalr)	82098217	D: Deleterious; T: Tolerated
GERP++ (gerp++)	89076718	higher scores are more deleterious
PhyloP (phylop)	89553090	higher scores are more deleterious
SiPhy (siphy)	88269630	higher scores are more deleterious

dbNSFP的数据已经被整合进ANNOVAR中了，目前的最新版本为dbnsfp33a

# 注释数据下载
$ annotate_variation.pl -downdb -webfrom annovar -buildver hg19 dbnsfp33a humandb/

# 同时获得所有dnNSFP的注释
$ table_annovar.pl ex1.avinput humandb/ -protocol dbnsfp33a -operation f -build hg19 -nastring .

# 获得单一dnNSFP的注释，需要先从ANNOVAR的官方服务器上下载对应某个dnNSFP的注释的文件，以SIFT为例
$ annotate_variation.pl -filter -dbtype ljb23_sift -buildver hg19 -out ex1 example/ex1.avinput humandb/

参考资料：

(1) Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm, Nature Protocols 4, - 1073 - 1081 (2009)

(2) Predicting Deleterious Amino Acid Substitutions, Genome Res. 2001 May; 11(5): 863.874.

(3) PolyPhen-2官网

(4) CADD官网

(5) 【简书】CADD数据库简介

(6) 【简书】DANN：利用神经网络算法评估变异位点的有害程度

(7) Quang D, Chen Y, Xie X. DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics. 2014;31(5):761-3.

(8) dbNSFP 官网

(9) ANNOVAR document

最后编辑于：2019.07.03 23:08:59

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 160,999评论 4赞 368
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 68,102评论 1赞 302
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 110,709评论 0赞 250
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 44,439评论 0赞 217
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,846评论 3赞 294
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,881评论 1赞 224
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 32,062评论 2赞 317
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,783评论 0赞 205
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,517评论 1赞 248
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,762评论 2赞 253
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,241评论 1赞 264
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,568评论 3赞 260
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,236评论 3赞 241
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,145评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,941评论 0赞 201
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,965评论 2赞 283
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,802评论 2赞 275