生物信息学中常用的文件转换命令

放置在最前面的参考链接

支持的文件转换类型

  • text file to wig
  • bed to wig
  • wig to bed
  • wig to bigwig
  • bed to bigbed
  • BAM to bedGraph for UCSC genome browser
  • bam to bigwig
  • bed to gff
  • Split bed file by chromosome
  • gff to gtf
  • gtf to bed
  • blat to gff
  • ...

由于需要翻墙,所以这里直接将整个网页复制粘贴过来(这应该不侵权吧!)

Convert text file to wig

 Sample command:
   txt2wig.pl foo.txt trackName(one word) > foo.wig

Convert bed to wig

    Sample command:
    bed2wig.pl inputBed sampleName(one word) probeWidth > outputWig
    Note: It assumes that the probe width in all records is constant.
          If probe width is not constant, you can use bedGraph format.
          To convert bed to bedGraph format, just change the track name to bedGraph, and minus chromosome end position in bed format by 1.

Convert wig to bed

  Sample command with variableStep wig format:
   wig2bed.pl inputWig sampleName(one word) > outputBed
  
   Sample command with fixedStep wig format:
   wig2bed_fixedStep.pl inputWig > outputBed

Convert wig to bigwig

  Sample commands:
  Get chromosome lengths
   fetchChromSizes  hg18 > chrSize.txt
  Convert wig to big wig:  
   wigToBigWig foo.wig chrSize.txt foo.bw

Convert bed to bigbed

 Sample commands:
  Get chromosome lengths
   fetchChromSizes  hg18 > chrSize.txt
  Convert bed to big bed:  
   bedToBigBed foo.bed chrSize.txt foo.bb

Convert BAM to bedGraph for UCSC genome browser

  To view BAM files on UCSC browser, both foo.sorted.bam and foo.sorted.bam.bai have to be on a http or ftp server. One way to get around this is to convert BAM files into bedGraph files, which should be small enough that they can be simply uploaded.
   genomeCoverageBed -split -bg -ibam sorted.bam -g hg19.genome    
   where hg19.genome file is tab delimited and structured as follows:
       <chromName><TAB><chromSize>
       chr1    249250621
   One can use the UCSC Genome Browser's MySQL database to extract chromosome sizes. For example, H. sapiens:
       mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg19.chromInfo" > hg19.genome

Convert bam to bigwig

  • Method 1: Single-base resolution across the genome
  Step1: convert bam to bedGraph format:
genomeCoverageBed -split -bg -ibam accepted_hits.bam -g /nfs/genomes/mouse_gp_jul_07/anno/mm9.size > accepted_hits.bedGraph

  Step2: convert bedGraph to bigwig format:
bedGraphToBigWig  accepted_hits.bedGraph /nfs/genomes/mouse_gp_jul_07/anno/mm9.size accepted_hits.bw
    where mm9.size file is tab delimited and structured as follows:
        <chromName><TAB><chromSize>
  • Method 2: Resolution of desired window size (after creating windows across desired regions or genome)
coverageBed -a Human.hg19.1000.500.bed -b Sample_1.sorted.bam | cut -f1-4 > Sample_1.1000.500.coverage.bedgraph

Update/fix UCSC GTF file

  • GTF files from UCSC Table Browser use RefSeq (NM* ids) for both gene_id and transcript_id which may not be compatible for some programs (eg. counting by genes using HTSeq)
  • Some Refseq gtf files (such as for the hg19, hg18, mm9, and dm3 assemblies) are in /nfs/genomes/, under gtf/ in each species folder. If you would like to create additional files, here are the steps:
  Step 1: Use UCSC Table Browser to download RefSeq id and gene symbol.
    Use "Genes and Gene Prediction Tracks" for group, "RefSeq Genes" for track and "refGene" for table.  Choose  "selected fields from primary and related tables" for output format and click "get output".  In the next page select "name" and "name2" for the fields.  
    output format should be : NM_017940       NBPF1
  Step 2: Download a gtf file from the UCSC Table Browser
    This uses refseq ID as gene_id and transcript_id, so we need to replace it with the gene symbol.
    sample command:  
      /nfs/BaRC_Public/BaRC_code/Perl/fix_gtf_refSeq_ensembl.pl hg19.refgene.gtf refseq2symbol > hg19.refgene.gtf
  Step 3: About 50-70 genes in the gtf file from UCSC are incorrect; they include exons with a start coordinate that is larger than the end coordinate.  
    Software such as cufflinks fails to deal with this situation and ignores these exons. 
    Since this only affects the last 1-3 bases of a transcript, a temporary solution is to remove these records.
      sample command: awk -F"\t" '{ if($4<=$5) print $0 }' hg19.refgene.gtf > hg19.refgene_new.gtf

Convert bed to gff

  • Note that bed and gff use slightly different coordinate conventions
  • Use /nfs/BaRC_Public/BaRC_code/Perl/bed2gff/bed2gff.pl
    USAGE: bed2gff.pl bedFile > gffFile
    Ex: bed2gff.pl foo.bed WIBR exon > foo.gff

Split bed file by chromosome

  • Sometimes it's easier working with only one chromosome of regions at a time
  • Output files will be named like "Sample_1.chr1.bed".
 awk '{close(f);f=$1}{print > "Sample_1."f".bed"}' Sample_1_all_chrs.bed

Convert gff to gtf

  • Use ​gffread: Try 'gffread -h' too see the program's many options
gffread My_transcripts_genes.gff3 -T -E -o My_transcripts_genes.gtf

Convert gtf to bed

  • convert gtf to genePhred
  gtfToGenePred my.gtf my.genePhred
  • convert genePhred to bed:
   awk -f genePhredToBed my.genePhred > my.bed
  • genePhredToBed is a awk script by Katrina Learned, downloaded from UCSC Genome Browser discussion list
#!/usr/bin/awk -f

#
# Convert genePred file to a bed file (on stdout)
#
BEGIN {
     FS="\t";
     OFS="\t";
}
{
     name=$1
     chrom=$2
     strand=$3
     start=$4
     end=$5
     cdsStart=$6
     cdsEnd=$7
     blkCnt=$8

     delete starts
     split($9, starts, ",");
     delete ends
     split($10, ends, ",");
     blkStarts=""
     blkSizes=""
     for (i = 1; i <= blkCnt; i++) {
         blkSizes = blkSizes (ends[i]-starts[i]) ",";
         blkStarts = blkStarts (starts[i]-start) ",";
     }

     print chrom, start, end, name, 1000, strand, cdsStart, cdsEnd, 0, blkCnt, blkSizes, blkStarts
}

Convert blat to gff

  • Use /nfs/BaRC_Public/BaRC_code/Perl/blat2gff/blat2gff.pl
 Convert BLAT output file (PSL format) into GFF format (v1.1 14 Dec 2010)
   blat2gff.pl blatFile dataSource(ex:WIBR) > gffFile

Create wiggle files for visualizing paired-end data mapping to the + and - strands

  • split by strand by matched strand
# input:    accepted_hits.bam
# output:   accepted_hits_negStrand.bam: mapped to negative strand
#       accepted_hits_posStrand.bam: mapped to positive strand

bsub "samtools view -f 16 -b accepted_hits.bam >| accepted_hits_negStrand.bam"
bsub "samtools view -F 16 -b accepted_hits.bam >| accepted_hits_posStrand.bam"
  • split reads by pair
# input:    accepted_hits_posStrand.bam or accepted_hits_negStrand.bam
# output:   1st pair: *_1stPair.bam
#           2nd pair: *_2ndPair.bam
bsub "samtools view -b -f 0x0040 accepted_hits_posStrand.bam > accepted_hits_posStrand_1stPair.bam"
bsub "samtools view -b -F 0x0040 accepted_hits_posStrand.bam > accepted_hits_posStrand_2ndPair.bam"
bsub "samtools view -b -f 0x0040 accepted_hits_negStrand.bam > accepted_hits_negStrand_1stPair.bam"
bsub "samtools view -b -F 0x0040 accepted_hits_negStrand.bam > accepted_hits_negStrand_2ndPair.bam"
  • convert from bam to bedgraph format
# input:    bam format: accepted_hits_*Strand_*Pair.bam
#           /nfs/genomes/mouse_gp_jul_07/anno/mm9.size: length of each chromosome, format like 
#                                   chr1    197195432
# output:   bedgraph format: accepted_hits_*Strand_*Pair.bedgraph
bsub "genomeCoverageBed -split -bg -ibam accepted_hits_posStrand_1stPair.bam -g mm9.size >| accepted_hits_posStrand_1stPair.bedgraph"
bsub "genomeCoverageBed -split -bg -ibam accepted_hits_posStrand_2ndPair.bam -g mm9.size >| accepted_hits_posStrand_2ndPair.bedgraph"
bsub "genomeCoverageBed -split -bg -ibam accepted_hits_negStrand_1stPair.bam -g mm9.size >| accepted_hits_negStrand_1stPair.bedgraph"
bsub "genomeCoverageBed -split -bg -ibam accepted_hits_negStrand_2ndPair.bam -g mm9.size >| accepted_hits_negStrand_2ndPair.bedgraph"
  • join the reads sharing the same strand
# This step is for fr-firststrand library (such as dUTP). which is
    1+-,1-+,2++,2--

    read1 mapped to ‘+’ strand indicates parental gene on ‘-‘ strand
    read1 mapped to ‘-‘ strand indicates parental gene on ‘+’ strand
    read2 mapped to ‘+’ strand indicates parental gene on ‘+’ strand
    read2 mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand
 
# input:    bedgraph file from the same strand
# output:   merged bedgraph: pos.bedgraph or neg.bedgraph
unionBedGraphs -i accepted_hits_posStrand_2ndPair.bedgraph accepted_hits_negStrand_1stPair.bedgraph |awk '{ print $1"\t"$2"\t"$3"\t"$4+$5 }' >|pos.bedgraph
unionBedGraphs -i accepted_hits_posStrand_1stPair.bedgraph accepted_hits_negStrand_2ndPair.bedgraph |awk '{ print $1"\t"$2"\t"$3"\t-"$4+$5 }' >|neg.bedgraph
infer_experiment.py -r mm9.refseq.bed12 -i accepted_hits.bam
  • convert bedgraph to bigwig
# get rid of header lines of mm9.size: the header line with "chrom   size" is removed
# input:    mm9.size: length of each chromosome
# output:   mm9.size_noHeader
tail --line=+2 mm9.size > mm9.size_noHeader
# convert bedgraph to bigwig
# input:    bedgraph file: neg.bedgraph or pos.bedgraph
#           mm9.size_noHeader: length of each chromosome
# *output:  bigwig format: neg.bw or pos.bw
#           neg.bw or pos.bw can be visualized with IGV/UCSC genome browser
bsub bedGraphToBigWig neg.bedgraph mm9.size_noHeader neg.bw
bsub bedGraphToBigWig pos.bedgraph mm9.size_noHeader pos.bw
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 151,829评论 1 331
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 64,603评论 1 273
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 101,846评论 0 226
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 42,600评论 0 191
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 50,780评论 3 272
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 39,695评论 1 192
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 31,136评论 2 293
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 29,862评论 0 182
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 33,453评论 0 229
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 29,942评论 2 233
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 31,347评论 1 242
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 27,790评论 2 236
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 32,293评论 3 221
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 25,839评论 0 8
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 26,448评论 0 181
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 34,564评论 2 249
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 34,623评论 2 249

推荐阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,068评论 0 10
  • Introduction What is Bowtie 2? Bowtie 2 is an ultrafast a...
    wzz阅读 5,326评论 0 5
  • 本文资料来源 https://web.archive.org/web/20161125133249/http://...
    x2yline阅读 2,704评论 0 4
  • 1 我,刚满九个月,20斤重,70厘米高,身体圆润,皮肤白皙,昵称宝宝,当然,你们叫我小仙女或小女神也是可以的。 ...
    写意人阅读 391评论 28 11
  • 因为工作的原因,近期笔者开始持续关注一些安全咨询网站,一来是多了解业界安全咨询提升自身安全知识,二来也是需要从各类...
    半夜菊花茶阅读 2,095评论 0 10