GEM peak caller

GEM(Genome wide Event finding and Motif discovery)

2.GPS和GEM两种算法

GPS:

input：ChIP-Seq read data

output：binding event calling

GEM：

input：ChIP-Seq read data、基因组序列

output：de novo motif discovery、binding event calling

加入以下参数可调用GEM算法：

--genome/(--k) or (--k_min and --k_max) or (--seed)

3.Read distributions

GEM/GPS 需要输入read distribution文件，用户可把默认的read distribution（https://groups.csail.mit.edu/cgs/gem/download/Read_Distribution_default.txt）文件作为起始文件。经过一轮预测后，GEM会重新计算read distribution。

--smooth

--mrc

4.Input and output

（1）input

GEM以ChIP-Seq的比对结果和基因组序列作为input，返回可能的结合位点以及motif。比对结果可以是SAM或者BED格式。

如果要进行de novo motif discovery，就要提供基因组序列。详见我踩过的坑1

（2）output

GEM会返回结合位点及motif信息。因为read distribution re-estimation，GEM会返回多轮的event prediction和read distribution文件。

output files

软件默认的read distribution文件可能会因为噪点太大而导致发现的结合位点太少。继续运行GEM可能会使结果更糟糕。因此，GEM会保存每一轮的结果文件，这样可以通过检查spatial distributions去选择最好的那一轮的结果。GEM会将各轮结果输出到一张图上（xxx_All_Read_Distributions.png),方便比较。理想结果是靠后的曲线应该更平滑且类似于round 0的结果。如果是这样，那么用户可以使用这些轮的结果。如果不是这种情况，那么我们建议使用round 0的预测结果。曲线不平缓可能是因为结合位点太少导致的。

（1）GEM event text file（GEM_event.txt）

包括以下内容：

* Location: 结合位点的基因坐标

* IP binding strength：结合到该位点的IP reads数目

* Control binding strength: 相应区域结合的control reads数目

Fold: fold enrichment(IP/control)

Expected binding strength:the number of IP read counts expected in the binding region given its local context(defined by parameter W2 or W3),this is used as the Lambda parameter for the Poission test.通过泊松分布检验的最低IP reads数目

Q_-lg10：-log10(q-value),取二项检验和泊松检验中较大的p_value，进行多次检验校正后得到的q-value。

P_lg10：-log10(p-value),（有control存在时）二项检验假设下IP reads和ctrl reads算得的p-value

P_poiss:-log10(p-value), (不考虑control data)，泊松检验假设下IP reads和 expected reads算得的p-value

IPvsEMP:Shape deviation, the KL divergence of IP reads from the empirical read distribution (log10(KL)),this is used to filter predicted events given the --sd cutoff(default=-0.4)

Noise: the fraction of the event read count estimated to be noise.(划成noise的标准是什么)

KmerGroup: 该结合位点的k-mers，只显示最显著的一个k-mer，n/n values are the total number of sequence hits of the k-mer group in the positive and negative training sequences(by default total 5000 of each),respectively

KG_hgp:log10(hypergeometric p-value) log10(hypergeometric p-value), the significance of enrichment of this k-mer group in the positive vs negative training sequences (by default total 5000 of each), it is the hypergeometric p-value computed using the pos/neg hit counts and total counts

Strand: the sequence strand that contains the k-mer group match, the orientation of the motif is determined during the GEM motif discovery, '*' represents that no k-mer is found to associated with this event

(2)K-mer set memory motifs(KSM.txt)

Header line1，dataset name

Header line2，KSM version number

Header line3，eg."#5000/5000",number of positive/negative sequences that were used for learing the motif

Header line4，eg."#3.01", the KSM motif score cutoff, optimized to give best motif enrichment in the training sequences.

我的文件从Headerline3开始

KSM文件各列含义：

k-mer/r.c. k-mer sequence及其reverse complement

cluster:cluster ID of k-mer set(Cluster 0 是primary motif也就是最显著的motif)

Offset:该k-mer相对于seed-kmer的offset

PosCt: 包含该k-mer的postive sequence的数量

wPosCt：weighted PosCt, 用relative sequence weighting来计算。在GEM中，是log（read count）

NegCt:包含该k-mer的negative sequences数目

HgpLg10:HyperGeometric P-value（log10）

PosHits:The IDs of positive sequences that contain this k-mer in bits (base85 encoding)

NegHits:The IDs of negative sequences that contain this k-mer in bits (base85 encoding)

3.PFM file of PWM motifs(PFM.txt)

4.HTML file summarizing the GEM event and motif results

网页解析

（1）motifs in PWM format

每一个motif都对应着一个k-mer set motif（KSM）

Optimal PWM score/Maximum PWM :the score that gives most significant p-value(hgp) when scanning the positive and negative sequences.

Hit:Number of positive sequences containing a match to this PWM/Number of negative sequences containing a match to this PWM

hpg: Hypergeometric p-value of this PWM computed from the pos/neg hit counts given the number of total positive/negative sequences.

the motif spatial distribution plots show the relative positions of each motif PWM match relative to the primary motif PWM match if they are both present win the same positive sequences.

the primary PWM (anchoring motif) is at position 0. The relative position of PWM of interest is plotted.

The color is blue when the two motifs are in the same orientation(as display in the motif logos).The color is red when the motifs are in the opposite orientation.

The number pair represents the position and the counts of secondary motif at that position.For example, in the second plot,(-7,150) means when the primary motif(Sox2) is at position 0, then there are 150 instances of secondary motif(Oct4,reverse compliment orientation) at the -7 position.

Total number of coocurrence of both motif instances.

Thus the second plot says, there are 722 cases of cooccurrence of Sox2 and Oct4 motifs, and 150 of them are detected to have a motif offset of -7.This is consistent with the fact that Sox2 and Oct4 usually bind as heterodimer.

Note: For the motif spatial distribution plot,GEM reports all the motif instances. There could be multiple motif instance on both strands. And the PWM hit is counting the sequences(at most one hit per sequence).Thus the total count in the motif spatial distribution plot may be larger than PWM hit.

motif PWM and Motif spatial distribution

5.GEM event text files(significant,insignificant and filtered)

insignificant events: those do not pass the statistical test

filtered events: those would pass the statistical test using the read count, but have a low IP/Ctrl ratio, or the distribution of reads are quite different from the empirical distribution.

三、Command-line options

!!!All the parameters in GEM are specified in double dashes(--)

1. required parameters

--d [path] : the path to the read distribution model file 用软件提供的read distribution文件

--exptX [path]: the path to the aligned reads file for experiment(IP).X is condition name.In multi-condition alignment mode,X is used to specify different conditions.

2.Optional parameters

--ctrlX [path]: the path to the aligned reads file for control.X should match the condition name in --exptX

--g [path] : the path to a genome information file(genome.chrom.size.file)

两个点：a. tab-delimited b.染色体名需与aligned reads file 以及基因组序列中的保持一致

--f[BED|SAM|BOWTIE|ELAND|NOVO]:SAM options allows SAM or BAM file input(default = BED)

--s [n]:The size of uniquely mappable genome.It depends on the genome and the read length. A good estimate is genome size*0.8.If it is not supplied, it will be estimated using the genome information

--genome [path_to_folder]: the path to genome sequence folder, which contains Fasta files of individual chromosomes.

--k [n]: the width of the k-mers

--k_min [n] --k_max [n]: minimum and maximum value of k

--seed [k-mer]:the seed k-mer to jump start k-mer set discovery. Exact k-mer sequence only.The width of the seed k-mer will be used to set k.

--k_seqs [n]:the number of top ranking events to get sequences for motif discovery(default=5000)

--pp_nmotifs [n]:the max number of top ranking motifs to set the motif-based positional prior(default=1)

--k_win [n]:the sequence window size around the binding event for motif discovery(default=61bp)

--strand_type [n]:Double-strand or single-strand binding event calling and motif discovery.0 for double-strand,1 for single-strand(default =0)

--nd [n]:noise distribution model,0 for no noise model,1 for uniform noise model,2 for smoothed control data specified by --ctrlX(default=1)

--min [n]: minimum number of events called by GPS/GEM to learn a read distribution and continue the next round(default=500).Note that with less number of binding events, the read distribution might be more noisy.You can try increase --smooth parameter.

--fold [value]:Fold(IP/Control) cutoff to filter predicted events(default=3)

--icr [value]:IP/Control Ratio.By default, this ratio is estimated from the data using non-specific binding regions.It is important to set this value explicitly for synthetic dataset.

--out [name]:Output folder and file name prefix

--q [value]:significance level for q-value,specified as -log10(q-value).For example, to enforce a q-value threshold of 0.001,set this value to 3(default =2.i.e.q-value=0.01)

--a [value]:minimum alpha value for sparse prior(default is estimated by mean whole genome read coverage)

--af [value]:a constant to scale alpha value with read count(default=3).A smaller af value will give a larger alpha value.

--sd [value]:Shape deviation cutoff to filter predicted events(default = -0.4)

--smooth [n]:the width(bp) to smooth the read distribution.It it is set to -1,there will be no smoothing(default=30)

--subs [region strings]: subset of genome regions to be analyzed.The string can be in the format of "chr:start-end" or "chr", or both.For example,"1:003-1004 2 X".

--ex [region file]:file that contains subset of genome regions to be excluded.Each line contains a region in "chr:start-end" format

--t [n]:Number of threads to run GEM in parallel.It is suggested to be equal to or less than the physical CPU number on the computer

--top [n]: Number of ranked GEM events to be used for re-estimating the read distribution(default=2000).note that GEM only re-estimate when there are more than 500 significant events called.

--w2 [n]:Size of sliding window to estimate lambda parameter for Possion distribution when there is no ctrl data(default=5000, must be larger than 1000)

--w3 [n]:size of sliding window to estimate lambda parameter for Possion distribution when there is no ctrl data(default=10000,must be larger than w2)

这俩参数啥区别啊

3.Optional flags：

--poisson_control :compute Possion p-value of binding event using control data (default is to use ChIP data)

--k_neg_dinu_shuffle: Use di-nucleotide shuffled sequences as negative sequences for motif finding (default is to use flanking sequences)

--bp:use Branch-seq data specific settings

--pp_pwm: Use PWM motif to set the motif-based positional prior (default is to use the KSM motif model)

--bf: Depreciated .Base filtering is done by Poisson filtering by default.

--fa: GEM will use a fixed user-specified alpha value for all the regions

--help: Print help information and exit

--multi:Depreciated. To run GEM in multi-condition mode, you only need to specify data for conditions X and Y using --exptX and --exptY, etc.

--nf: Do not filter predicted events (default is to perform event filtering by shape and fold)

--nrf: Do not filter duplicate reads (i.e. potential PCR duplicates) (default is to apply filtering considering the read counts at its neighboring bases)

--relax: Use a more relaxed cutoff to include more weak binding calls. This is designed to provide more weak sites for analysis such as IDR analysis (default is more stringent cutoff)

--outNP: Output binding events in ENCODE NarrowPeak format (default is NO narrowPeak file output)

--outBED: Output binding events in BED format for UCSC Genome Browser (default is NO BED file output)

--outJASPAR: Output motif PFM in JASPAR format

--outMEME: Output motif PFM in MEME format

--outHOMER: Output motif PFM in HOMER format

--sl: Sort GEM output by location (default is sorted by P-value)

--print_bound_seqs: Output the "bound sequences" column in the GEM_event.txt file

我踩过的坑：

1. --genome and --g

--genome 指定基因组文件们所在的目录。所有的基因组文件必须以 chr*.fasta（.fa）的方式来命名。并且染色体名字要与genome.chrom.size中的名字相匹配。！！同时要跟bam文件中的染色体名字相匹配。

2.--exptX --ctrlX

X 一定要一致，否则输出结果中ctrl一栏是NA

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 160,108评论 4赞 364
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,699评论 1赞 296
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 109,812评论 0赞 244
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 44,236评论 0赞 213
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,583评论 3赞 288
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,739评论 1赞 222
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,957评论 2赞 315
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,704评论 0赞 204
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,447评论 1赞 246
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,643评论 2赞 249
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,133评论 1赞 261
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,486评论 3赞 256
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,151评论 3赞 238
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,108评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,889评论 0赞 197
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,782评论 2赞 277
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,681评论 2赞 272

GEM peak caller

推荐阅读更多精彩内容