基于参考基因组的基因组组装和注释

将基因组组装到染色体水平无非就是两种方式：

独立组装（de novo）；
基于参考基因组的组装（reference-guided），即基于近缘或同一物种的基因组染色体同源性进行组装。

独立组装中的contigs或scaffolds一般用遗传图谱、BioNano光学图谱或HiC技术进行染色体水平的挂，现在比较常用是HiC技术。基于参考基因组的组装一般是将contigs或scaffolds与近缘或同一物种的基因组进行比对，从而提升至染色体水平，比较常用的是RagTag工具包，是RaGOO的升级版。有的时候，我们没有相关图谱数据，只能采用此方法将基因组提升至染色体水平。

RagTag可以进行错误组装校正、scaffold组装和修补、scaffold合并等。之后，可以用Liftoff进行基因注释。

RagTag

Liftoff

RagTag简介

RagTag的流程一共四步：correct，scaffold，patch，merge。

image-20210710151440177

软件安装

conda install -c bioconda ragtag

correct

校正是使用参考基因组来鉴定和校正contigs中的组装错误，该步骤不会将序列减少或增加，仅仅是将序列在错误组装的位置进行打断。

用法

usage: ragtag.py correct <reference.fa> <query.fa>

Homology-based assembly correction: Correct sequences in 'query.fa' by comparing them to sequences in 'reference.fa'>

positional arguments:
  <reference.fa>        reference fasta file (uncompressed or bgzipped)
  <query.fa>            query fasta file (uncompressed or bgzipped)

optional arguments:
  -h, --help            show this help message and exit

correction options:
  -f INT                minimum unique alignment length [1000]
  --remove-small        remove unique alignments shorter than -f
  -q INT                minimum mapq (NA for Nucmer alignments) [10]
  -d INT                maximum alignment merge distance [100000]
  -b INT                minimum break distance from contig ends [5000]
  -e <exclude.txt>      list of reference headers to ignore [null]
  -j <skip.txt>         list of query headers to leave uncorrected [null]
  --inter               only break misassemblies between reference sequences
  --intra               only break misassemblies within reference sequences
  --gff <features.gff>  don't break sequences within gff intervals [null]

input/output options:
  -o PATH               output directory [./ragtag_output]
  -w                    overwrite intermediate files
  -u                    add suffix to unaltered sequence headers

mapping options:
  -t INT                number of minimap2/unimap threads [1]
  --aligner PATH        whole genome aligner executable ('nucmer', 'unimap' or 'minimap2') [minimap2]
  --mm2-params STR      space delimited minimap2 whole genome alignment parameters ['-x asm5']
  --unimap-params STR   space delimited unimap parameters ['-x asm5']
  --nucmer-params STR   space delimted nucmer whole genome alignment parameters ['--maxmatch -l 100 -c 500']

validation options:
  --read-aligner PATH   read aligner executable (only 'minimap2' is allowed) [minimap2]
  -R <reads.fasta>      validation reads (uncompressed or gzipped) [null]
  -F <reads.fofn>       same as '-R', but a list of files [null]
  -T STR                read type. 'sr', 'ont' and 'corr' accepted for Illumina, nanopore and error corrected long-reads, respectively
                        [null]
  -v INT                coverage validation window size [10000]
  --max-cov INT         break sequences at regions at or above this coverage level [AUTO]
  --min-cov INT         break sequences at regions at or below this coverage level [AUTO]

scaffold

该步骤是将相邻的contigs序列用100个N连起来，序列的位置和方向需要根据与参考基因组的比对结果确定。

用法

usage: ragtag.py scaffold <reference.fa> <query.fa>

Homology-based assembly scaffolding: Order and orient sequences in 'query.fa' by comparing them to sequences in 'reference.fa'>

positional arguments:
  <reference.fa>       reference fasta file (uncompressed or bgzipped)
  <query.fa>           query fasta file (uncompressed or bgzipped)

optional arguments:
  -h, --help           show this help message and exit

scaffolding options:
  -e <exclude.txt>     list of reference sequences to ignore [null]
  -j <skip.txt>        list of query sequences to leave unplaced [null]
  -J <hard-skip.txt>   list of query headers to leave unplaced and exclude from 'chr0' ('-C') [null]
  -f INT               minimum unique alignment length [1000]
  --remove-small       remove unique alignments shorter than '-f'
  -q INT               minimum mapq (NA for Nucmer alignments) [10]
  -d INT               maximum alignment merge distance [100000]
  -i FLOAT             minimum grouping confidence score [0.2]
  -a FLOAT             minimum location confidence score [0.0]
  -s FLOAT             minimum orientation confidence score [0.0]
  -C                   concatenate unplaced contigs and make 'chr0'
  -r                   infer gap sizes. if not, all gaps are 100 bp
  -g INT               minimum inferred gap size [100]
  -m INT               maximum inferred gap size [100000]

input/output options:
  -o PATH              output directory [./ragtag_output]
  -w                   overwrite intermediate files
  -u                   add suffix to unplaced sequence headers

mapping options:
  -t INT               number of minimap2/unimap threads [1]
  --aligner PATH       aligner executable ('nucmer', 'unimap' or 'minimap2') [minimap2]
  --mm2-params STR     space delimited minimap2 parameters ['-x asm5']
  --unimap-params STR  space delimited unimap parameters ['-x asm5']
  --nucmer-params STR  space delimted nucmer parameters ['--maxmatch -l 100 -c 500']

patch

该步骤是用contigs序列对上一步得到的scaffold序列进行gap填补。该步骤比较耗时，如果急需使用基因组进行后续分析，可以省略该步骤。

用法

usage: ragtag.py patch <target.fa> <query.fa>

Homology-based continuous assembly scaffolding and gap-filling: Make continuous joins and fill gaps in 'target.fa' using sequences from 'query.fa'

positional arguments:
  <target.fa>          target fasta file (uncompressed or bgzipped)
  <query.fa>           query fasta file (uncompressed or bgzipped)

optional arguments:
  -h, --help           show this help message and exit

patching:
  -e <exclude.txt>     list of target sequences to ignore [null]
  -j <skip.txt>        list of query sequences to ignore [null]
  -f INT               minimum unique alignment length [1000]
  --remove-small       remove unique alignments shorter than '-f'
  -q INT               minimum mapq (NA for Nucmer alignments) [10]
  -d INT               maximum alignment merge distance [100000]
  -s INT               minimum merged alignment length [50000]
  -i FLOAT             maximum merged alignment distance from sequence terminus. fraction of the sequence length if < 1 [0.05]
  --fill-only          only fill existing target gaps. do not join target sequences
  --join-only          only join and patch target sequences. do not fill existing gaps

input/output options:
  -o PATH              output directory [./ragtag_output]
  -w                   overwrite intermediate files
  -u                   add suffix to unplaced sequence headers

mapping options:
  -t INT               number of minimap2/unimap threads [1]
  --aligner PATH       aligner executable ('nucmer' (recommended), 'unimap' or 'minimap2') [nucmer]
  --mm2-params STR     space delimited minimap2 parameters ['-x asm5']
  --unimap-params STR  space delimited unimap parameters ['-x asm5']
  --nucmer-params STR  space delimted nucmer parameters ['--maxmatch -l 100 -c 500']

merge

在scaffolding过程中，可能会根据不同参数或图谱数据产生多个版本的基因组组装结果，该步骤可以将多个结果根据权重进行最终组装结果的生成。

如果有HiC数据，还可以加入HiC数据生成比较好的组装结果。

用法

usage: ragtag.py merge <asm.fa> <scf1.agp> <scf2.agp> [...]

Scaffold merging: derive a consensus scaffolding solution by reconciling distinct scaffoldings of 'asm.fa'

positional arguments:
  <asm.fasta>           assembly fasta file (uncompressed or bgzipped)
  <scf1.agp> <scf2.agp> [...]
                        scaffolding AGP files

optional arguments:
  -h, --help            show this help message and exit

merging options:
  -f FILE               CSV list of (AGP file,weight) [null]
  -j <skip.txt>         list of query headers to leave unplaced [null]
  -l INT                minimum assembly sequence length [100000]
  -e FLOAT              minimum edge weight. NA if using Hi-C [0.0]
  --gap-func STR        function for merging gap lengths {'min', 'max', or 'mean'} [min]

input/output options:
  -o PATH               output directory [./ragtag_output]
  -w                    overwrite intermediate files
  -u                    add suffix to unplaced sequence headers

Hi-C options:
  -b FILE               Hi-C alignments in BAM format, sorted by read name [null]
  -r STR                CSV list of restriction enzymes/sites or 'DNase' [GATC]
  -p FLOAT              portion of the sequence termini to consider for links [1.0]
  --list-enzymes        list all available restriction enzymes/sites

Liftoff 是一个可以准确根据同一物种或近缘物种基因组进行基因注释映射的工具。该工具仅需两个基因组序列和参考基因组的基因注释文件即可进行基因注释。Liftoff使用minimap2将参考基因组的基因序列与目标基因组比对，这样的好处是，即使两个基因组间存在许多结构上的差异，也可将基因结构鉴定出来。对于每一个鉴定的基因，Liftoff找到外显子区的比对，并使序列的一致性最大，同时保留转录本和基因结构。如果两个基因错误地比对到同一个位点，Liftoff会确定最好基因结构。此外，Liftoff还可以找到在目标基因组中存在而在参考基因组中不存在的基因拷贝。

安装

conda install -c bioconda liftoff

用法

usage: liftoff [-h] (-g GFF | -db DB) [-o FILE] [-u FILE] [-exclude_partial]
               [-dir DIR] [-mm2_options =STR] [-a A] [-s S] [-d D] [-flank F]
               [-V] [-p P] [-m PATH] [-f TYPES] [-infer_genes]
               [-infer_transcripts] [-chroms TXT] [-unplaced TXT] [-copies]
               [-sc SC] [-overlap O] [-mismatch M] [-gap_open GO]
               [-gap_extend GE]
               target reference

Lift features from one genome assembly to another

Required input (sequences):
  target              target fasta genome to lift genes to
  reference           reference fasta genome to lift genes from

Required input (annotation):
  -g GFF              annotation file to lift over in GFF or GTF format
  -db DB              name of feature database; if not specified, the -g
                      argument must be provided and a database will be built
                      automatically

Output:
  -o FILE             write output to FILE in GFF3 format; by default, output
                      is written to terminal (stdout)
  -u FILE             write unmapped features to FILE; default is
                      "unmapped_features.txt"
  -exclude_partial    write partial mappings below -s and -a threshold to
                      unmapped_features.txt; if true partial/low sequence
                      identity mappings will be included in the gff file with
                      partial_mapping=True, low_identity=True in comments
  -dir DIR            name of directory to save intermediate fasta and SAM
                      files; default is "intermediate_files"

Alignments:
  -mm2_options =STR   space delimited minimap2 parameters. By default ="-a
                      --end-bonus 5 --eqx -N 50 -p 0.5"
  -a A                designate a feature mapped only if it aligns with
                      coverage ≥A; by default A=0.5
  -s S                designate a feature mapped only if its child features
                      (usually exons/CDS) align with sequence identity ≥S; by
                      default S=0.5
  -d D                distance scaling factor; alignment nodes separated by
                      more than a factor of D in the target genome will not be
                      connected in the graph; by default D=2.0
  -flank F            amount of flanking sequence to align as a fraction
                      [0.0-1.0] of gene length. This can improve gene
                      alignment where gene structure differs between target
                      and reference; by default F=0.0

Miscellaneous settings:
  -h, --help          show this help message and exit
  -V, --version       show program version
  -p P                use p parallel processes to accelerate alignment; by
                      default p=1
  -m PATH             Minimap2 path
  -f TYPES            list of feature types to lift over
  -infer_genes        use if annotation file only includes transcripts,
                      exon/CDS features
  -infer_transcripts  use if annotation file only includes exon/CDS features
                      and does not include transcripts/mRNA
  -chroms TXT         comma seperated file with corresponding chromosomes in
                      the reference,target sequences
  -unplaced TXT       text file with name(s) of unplaced sequences to map
                      genes from after genes from chromosomes in chroms.txt
                      are mapped; default is "unplaced_seq_names.txt"
  -copies             look for extra gene copies in the target genome
  -sc SC              with -copies, minimum sequence identity in exons/CDS for
                      which a gene is considered a copy; must be greater than
                      -s; default is 1.0
  -overlap O          maximum fraction [0.0-1.0] of overlap allowed by 2
                      features; by default O=0.1
  -mismatch M         mismatch penalty in exons when finding best mapping; by
                      default M=2
  -gap_open GO        gap open penalty in exons when finding best mapping; by
                      default GO=2
  -gap_extend GE      gap extend penalty in exons when finding best mapping;
                      by default GE=1

参数的详细说明可以参考这里。

下面是我最近用此方法组装一个基因组并进行注释的流程。

基因组组装与注释分析实战

contigs校正

ragtag.py correct westar.fasta ZYCR1-genome.clean.fa -t 30 -u

输出文件：

ragtag.correct.fasta: 校正的contigs序列；
ragtag.correct.agp：包含contigs序列的断点坐标的文件。

生成scaffold

ragtag.py scaffold westar.fasta ragtag_output/ragtag.correct.fasta -t 30 -u -C

输出文件：

ragtag.scaffold.agp：contigs序列的方向和顺序（AGP格式）；
ragtag.scaffold.fasta：scaffold序列；
ragtag.scaffold.stats：scaffolding过程中的一些统计信息。

基因注释

liftoff -g westar.gff3 -o ZYCR1.gene.gff3 -p 28 -chroms chr_pairs.txt ZYCR1.genome.asm.final.fasta westar.fa

为了提高基因映射的准确性，这里仅允许相同染色体间的基因进行映射，因此需要指定两个基因组间的染色体对应关系，即-chroms参数，其格式如下：

chrA01,chrA01
chrA02,chrA02
chrA03,chrA03
chrA04,chrA04
chrA05,chrA05
chrA06,chrA06
chrA07,chrA07
chrA08,chrA08
chrA09,chrA09
chrA10,chrA10
chrC01,chrC01
chrC02,chrC02
chrC03,chrC03
chrC04,chrC04
chrC05,chrC05
chrC06,chrC06
chrC07,chrC07
chrC08,chrC08
chrC09,chrC09
Uncluster,Uncluster

基因重命名

这里可以使用一个脚本GeneIDRename进行：

perl GeneIDRename -g ZYCR1.gene.gff3 -o ZYCR1.gene.rename.gff3 -c 6

提取mRNA和protein序列

gffread ZYCR1.gene.rename.gff3 -g ZYCR1.genome.asm.final.fasta -x ZYCR1.cds.fa -y ZYCR1.pep.fa

参考资料

https://github.com/malonge/RagTag
https://github.com/malonge/RaGOO
https://github.com/agshumate/Liftoff
Alonge, M., Soyk, S., Ramakrishnan, S. et al. RaGOO: fast and accurate reference-guided scaffolding of draft genomes. Genome Biol 20, 224 (2019). https://doi.org/10.1186/s13059-019-1829-6
Alaina Shumate, Steven L Salzberg, Liftoff: accurate mapping of gene annotations, Bioinformatics, 2021;, btaa1016, https://doi.org/10.1093/bioinformatics/btaa1016

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 162,825评论 4赞 377
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 68,887评论 2赞 308
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 112,425评论 0赞 255
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 44,801评论 0赞 224
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 53,252评论 3赞 299
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 41,089评论 1赞 226
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 32,216评论 2赞 322
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 31,005评论 0赞 215
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,747评论 1赞 250
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,883评论 2赞 255
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,354评论 1赞 265
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,694评论 3赞 265
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,406评论 3赞 246
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,222评论 0赞 9
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,996评论 0赞 201
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 36,242评论 2赞 287
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 36,017评论 2赞 281

基于参考基因组的基因组组装和注释

基于参考基因组的基因组组装和注释

RagTag简介

软件安装

correct

用法

scaffold

用法

patch

用法

merge

用法

Liftoff简介

安装

用法

基因组组装与注释分析实战

contigs校正

生成scaffold

基因注释

基因重命名

提取mRNA和protein序列

参考资料

推荐阅读更多精彩内容