Gfold说明书(备份)

本文资料来源

https://web.archive.org/web/20161125133249/http://compbio.tongji.edu.cn:80/~fengjx/GFOLD/gfold.html

也可以参考软件安装目录doc文件夹里的gfold.html

原网页链接(已挂)

http://compbio.tongji.edu.cn/~fengjx/GFOLD/gfold.html

repo地址

https://bitbucket.org/feeldead/gfold/overview

NAME

gfold - Generalized fold change for ranking differentially expressed genes from RNA-seq data.

GFOLD is especially useful when no replicate is available. GFOLD generalizes the fold change by considering the posterior distribution of log fold change, such that each gene is assigned a reliable fold change. It overcomes the shortcoming of p-value that measures the significance of whether a gene is differentially expressed under different conditions instead of measuring relative expression changes, which are more interesting in many studies. It also overcomes the shortcoming of fold change that suffers from the fact that the fold change of genes with low read count are not so reliable as that of genes with high read count, even these two genes show the same fold change.


CITATION

Feng J, Meyer CA, Wang Q, Liu JS, Liu XS, Zhang Y. GFOLD: a generalized fold change for ranking differentially expressed genes from RNA-seq data. Bioinformatics 2012


SYNOPSIS

  • gfold
  • JOBS
  • OPTIONS

EXAMPLES

Example 1: Count reads and rank genes

In the following example, hg19Ref.gtf is the ucsc knownGene table for hg19; sample1.sam and sample2.sam are the mapped reads in SAM format.

gfold count -ann hg19Ref.gtf -tag sample1.sam -o sample1.read_cnt
gfold count -ann hg19Ref.gtf -tag sample2.sam -o sample2.read_cnt
gfold diff -s1 sample1 -s2 sample2 -suf .read_cnt -o sample1VSsample2.diff

example 2: Count reads

This example utilizes samtools to produce mapped reads in SAM format from BAM format.

samtools view sample1.bam | gfold count -ann hg19Ref.gtf -tag stdin -o sample1.read_cnt

Example 3: Identify differentially expressed genes without replicates

Suppose there are two samples: sample1 and sample2 with corresponding read count file being sample1.read_cnt sample2.read_cnt. This example finds differentially expressed genes using default parameters on two samples

gfold diff -s1 sample1 -s2 sample2 -suf .read_cnt -o sample1VSsample2.diff

Example 4: Identify differentially expressed genes with replicates

This example finds differentially expressed genes using default parameters on two group of samples.

gfold diff -s1 sample1,sample2,sample3 -s2 sample4,sample5,sample6 -suf .read_cnt -o 123VS456.diff

Example 5: Identify differentially expressed genes with replicates only in one condition

This example finds differentially expressed genes using default parameters on two group of samples. Only the first group contains replicates. In this case, the variance estimated based on the first group will be used as the variance of the second group.

gfold diff -s1 sample1,sample2 -s2 sample3 -suf .read_cnt -o 12VS3.diff**

JOBS

  • -h

    Print help information

  • count

    Given the gene annotation in GTF/GPF/BED format and mapped short reads in SAM/BED format, count the number of reads mapped to each gene. Because of possible overlapping of multiple genes, a read could be mapped to the overlaped region of multiple genes. In this case, a read is counted multiple times with each time for each gene. Furthermore, if a gene is on multiple chromosomes or different strands of the same chromosome, only exons on one strand of one chromosome (the one appear first in the annotation file) will be assigned to this gene. Exons not on this strand of the chromosome will be discarded.

  • diff

    For each gene, calculate GFOLD value and other statistics. diff accepts the output of count as the input. Please refer to the output format of count for more information about the input format. If you are not satisfied with the strategy adopted by count, you can generate gene read counts by yourself. In the input for diff 'GeneSymbol' and 'Read Count' are critical. If you fake 'Gene exon length', the RPKM calculated will not be valid. You can fake other columns to make gfold run. Further more the order of gene names should be the same for different samples. The third column of the output of count (gene length) only influences the RPKM in the output of diff. If it is missing, RPKM will not be generated by diff. diff does not use the forth column of the output of count.


OPTIONS

  • -ann <file>

    Gene annotation file in GTF/GPF/BED format. Note that the knownGene table downloaded from UCSC is in GPF format. For job countonly.

  • -annf <GTF/GPF/BED>

    The format of gene annotation file. Default GTF (Gene Transfer Format). For job count only. For GTF format, in the last column, 'gene_id' and 'transcript_id' are required. 'gene_name' and other IDs are optional. For BED format, please provide format which contains 6 columns with the ID at 4'th column. For GPF (Gene Prediction Format), in short, from UCSC Table Browser, the 'knownGene' table with all fields is in GPF and the 'refGene' table without the first column is in GPF format. Note that for either 'knownGene' or 'refGene' table, the downloaded file would contain a header which should be removed before calling GFOLD. More specifically, a file in GPF format contains 12 columns separated by TABs (adapted from UCSC):nameName of genechromReference sequence chromosome or scaffoldstrand+ or - for strandtxStartTranscription start positiontxEndTranscription end positioncdsStartCoding region startcdsEndCoding region endexonCountNumber of exonsexonStartsExon start positions separated by commasexonEndsExon end positions separated by commasproteinIDUniProt display ID for Known Genes, UniProt accession or RefSeq protein ID for UCSC GenesalignIDUnique identifier for each (known gene, alignment position) pair

  • -tag <file>

    Short reads in SAM format. 'stdin' stands for standard input stream. For job count only.

  • -tagf <SAM/BED>

    The format of short reads. Default SAM. For job count only.

  • -s <T/F>

    Whether is the sequencing data strand specific? T stands for strand specific. Default F. If you are not clear about this, using default parameter should be OK even for the strand specific case. For job count only.

  • -acc <T/F>

    When no replicate is available, whether to use accurate method to calculate GFOLD value. T stands for accurate which depends on sequencing depth and slower, F stands for MCMC. Default T. For job diff only.

  • -o <file>

    The file for output for all jobs. For job diff, two output files will be generated. The first is <file> and the second is <file>.ext

  • -s1 <file>

    The prefix for gene read count of the 1st group output by count. Multiple prefixes are separated by commas. For job diff only. If you have gene read count generated by other ways instead of job count, make sure that the format are the same for all files. Each file contains two columns corresponding to gene names and read counts separated by a TAB. All files are sorted by gene names and have the same number of lines.

  • -s2 <file>

    The prefix for gene read count of the 2st group output by count. Multiple prefixes are separated by commas. For job diff only.

  • -d <file>

    Gene description file. The first two columns of the file will be used. The first column should be gene descriptions; The second column should be gene IDs, e.g. the first column of the output of Count. The gene description file can be downloaded from: http://uswest.ensembl.org/biomart/martview/3e5a3ce70eea4f8d0061a0eb03ef11ad . Please select the correct ID type during downloading such that the IDs match the gene IDs appearing in -ann

  • -suf <string>

    The suffix for gene read count file specified by -s1 and -s2. For job diff only.

  • -sc <num>

    The significant cutoff for fold change. Default 0.01. For job diff only.

  • -bi <num>

    For MCMC, the iterations for burn-in phase. Default 1000. For job diff only.

  • -si <num>

    For MCMC, the iterations for sampling phase. Default 1000. For job diff only.

  • -r <num>

    The maximum number of selected pairs for calculating empirical FDR. Default 20. For job diff only.

  • -v <num>

    Verbos level. A larger value gives more information of the running process. Default 2.

  • -norm <Count/DESeq/NO>

    The way to do normalization. 'Count' stands for normalization by total number of mapped reads. 'DESeq' stands for the normalization proposed by DESeq. 'NO' stands for no normalization. You can also specifiy a list of normalization constant separated by commas. E.g. 1.2,2.1,1.0,2.0. Note that the number of constants should be the same as the total number of samples (group1 and group2) and the order should be for -s1 followed by for -s2. GFOLD using normalization constants not by directly multiplication (scaling up) nor division (scaling down). The normalization constants will be built into the model. In the model, division or multiplication has no difference. Default 'DESeq'.


OUTPUT FORMAT

All fields in a output file are separated by TABs.

  • For JOB count:

    The output file contains 5 columns:

    1. GeneSymbol:For BED file, this is the 4'th column. For GPF file, this is the first column. For GTF format, this corresponds to 'gene_id' if it exists, 'NA' otherwise.
    2. GeneName:For BED file, this is always 'NA'. For GPF file, this is the 12'th column. For GTF format, this corresponds to 'gene_name' if it exists, 'NA' otherwise.
    3. Read Count:The number of reads mapped to this gene.
    4. Gene exon length:The length sum of all the exons of this gene.
    5. RPKM:The expression level of this gene in RPKM.
  • For JOB diff:

    The first output file contains 7 columns:

    1. GeneSymbol:
      Gene symbols. The order of gene symbol is the same as what appearing in the read count file.

    2. GeneName:
      Gene name. The order of gene name is the same as what appearing in the read count file.

    3. GFOLD:
      GFOLD value for every gene. The GFOLD value could be considered as a reliable log2 fold change. It is positive/negative if the gene is up/down regulated. The main usefulness of GFOLD is to provide a biological meanlingful ranking of the genes. The GFOLD value is zero if the gene doesn't show differential expression. If the log2 fold change is treated as a random variable, a positive GFOLD value x means that the probability of the log2 fold change (2nd/1st) being larger than x is (1 - the parameter specified by -sc); A negative GFOLD value x means that the probability of the log2 fold change (2st/1nd) being smaller than x is (1 - the parameter specified by -sc). If this file is sorted by this column in descending order then genes ranked at the top are differentially up-regulated and genes ranked at the bottom are differentially down-regulated. Note that a gene with GFOLD value 0 should never be considered differentially expressed. However, it doesn't mean that all genes with non-negative GFOLD value are differentially expressed. For taking top differentially expressed genes, the user is responsible for selecting the cutoff.

    4. E-FDR:
      Empirical FDR based on replicates. It is always 1 when no replicates are available.

    5. log2fdc:
      log2 fold change. If no replicate is available, and -acc is T, log2 fold change is based on read counts and normalization constants. Otherwise, log2 fold change is based on the sampled expression level from the posterior distribution.

    6. 1st RPKM:
      The RPKM for the first condition. It is available only if gene length is available. If multiple replicates are available, the RPKM is calculated simply by summing over replicates. Because RPKM is acturally using sequencing depth as the normalization constant, log2 fold change based on RPKM could be different from the log2fdc field.

    7. 2nd RPKM:
      The RPKM for the second condition. It is available only if gene length is available. Please refer to 1stRPKM for more information.

The second output file (.ext) contains the normalized read counts

  • GeneSymbol:Gene symbols. The order of gene symbol is the same as what appearing in the read count file.
  • Normalized Read Counts:Multiple columns (the same number of input samples) contains the read count normalized by the calculated normalization constants. The are only muturally comparable and the absolute value is meaningless.
  • Gene Description:The last column contains the associated gene description if the description file -d is provided.
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 159,569评论 4 363
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 67,499评论 1 294
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 109,271评论 0 244
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 44,087评论 0 209
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 52,474评论 3 287
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 40,670评论 1 222
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 31,911评论 2 313
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 30,636评论 0 202
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 34,397评论 1 246
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 30,607评论 2 246
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 32,093评论 1 261
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 28,418评论 2 254
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 33,074评论 3 237
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 26,092评论 0 8
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 26,865评论 0 196
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 35,726评论 2 276
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 35,627评论 2 270