2020年1月bioRxiv生信好文速览

montreal 生信人 2020.2.8

生信人自18年5月推出月度“biorxiv生信好文速览栏目”，平均每期10篇，至今为大家介绍过200篇预印本（preprint）文章了。我们开始做这个栏目的只是本着试一试的态度，一方面将biorxiv上的好文尽快呈现给广大生信从业者，另一方面也是想对预印本进行一些宣传和推广。

从最开始的一期阅读1500，到近期的2500，小编欣喜地看到，biorxiv似乎得到了越来越多人的关注和认可。尽管如此，相比于正式发表于知名杂志的论文，绝大部分媒体对于预印本结果的报道还是凤毛麟角。这样的局面在上个月月底完全改变了：biorxiv数次抢占了各大新闻媒体头条！其实，与其说是预印本“战胜”了传统的同行评议论文，倒不如说是新冠肺炎的影响太大了。这种情况下凸显出预印本在应对突发事件上无可比拟的优势，尽管小编以为在对这次疫情的响应中，甚至有时候，预印本传递的科研成果都显得不那么及时。

前几天有一个特别热的话题。印度理工学院（Indian institute of technology）的Bishwajit Kundu实验室声称：新冠肺炎病毒的spike protein的某段序列或是艾滋病病毒的蛋白插入所致【1】。这篇“神文”立即引发轩然大波，在短短一周时间已有27位网友在下方留言，推特上更是得到了363次转发。由于各界骂声不断，原作者迫于压力在两天之内即撤回原文。正所谓“成也萧何败也萧何”，预印本因缺乏同行评议的特点，既好好蹭了一波疫情的热度，也导致类似的争议性颇大甚至可能有严重问题的结果的出现，而应用预印本结果进行临床指导则更要格外谨慎。因此，biorxiv近日在其网站上特异增加了新的提醒：

换个角度来看，也正是因为预印本，本文得以避免潜在错误出现在正式的论文里。要知道，大部分时候，一篇文章只需两三位审稿人的审稿即可通过同行评议而发表，而这篇印度学者的文章可以说是得到了整个互联网做审稿人的超级贵宾礼遇。

此外，我们还要提醒大家，目前除biorxiv外，还有出色的很多预印本服务器，比如由biorxiv开发团队为班底的医学预印本服务器的medRxiv，本期推送我们也特异为大家带来其中的两篇最新文章。一起来看看吧。

1. 英国华威大学（University of Warwick）学者开发细菌基因组快速搜索工具

BlastFrost: Fast querying of 100,000s of bacterial genomes in Bifrost graphs

BlastFrost is a highly efficient method for querying 100,000s of genome assemblies. It builds on Bifrost, a recently developed dynamic data structure for compacted and colored de Bruijn graphs from bacterial genomes. BlastFrost queries a Bifrost data structure for sequences of interest, and extracts local subgraphs, thereby enabling the efficient identification of the presence or absence of individual genes or single nucleotide sequence variants. Here we describe the algorithms and implementation of BlastFrost. We also present two exemplar practical applications. In the first, we determined the presence of the individual genes within the SPI-2 Salmonella pathogenicity island within a collection of 926 representative genomes in minutes. In the second application, we determined the existence of known single nucleotide polymorphisms associated with fluoroquinolone resistance in the genes gyrA, gyrB and parE among 190, 209 Salmonella genomes. BlastFrost is available for download at https://github.com/nluhmann/BlastFrost.

2. 荷兰乌德勒支大学（Utrecht University）Snel实验室：真核生物激酶的进化历程

The first eukaryotic kinome tree illuminates the dynamic history of present-day kinases

Eukaryotic Protein Kinases (ePKs) are essential for eukaryotic cell signalling. Several phylogenetic trees of the ePK repertoire of single eukaryotes have been published, including the human kinome tree. However, a eukaryote-wide kinome tree was missing due to the large number of kinases in eukaryotes. Using a pipeline that overcomes this problem, we present here the first eukaryotic kinome tree. The tree reveals that the Last Eukaryotic Common Ancestor (LECA) possessed at least 92 ePKs, much more than previously thought. The retention of these LECA ePKs in present-day species is highly variable. Fourteen human kinases with unresolved placement in the human kinome tree were found to originate from three known ePK superfamilies. Further analysis of ePK superfamilies shows that they exhibit markedly diverse evolutionary dynamics between the LECA and present-day eukaryotes. The eukaryotic kinome tree thus unveils the evolutionary history of ePKs, but the tree also enables the transfer of functional information between related kinases.

3. 佐治亚大学（University of Georgia）利用PacBio+Nanopore+Bionano完成玉米无空缺染色体组装

Gapless assembly of maize chromosomes using long read technologies

Creating gapless telomere-to-telomere assemblies of complex genomes is one of the ultimate challenges in genomics. We used long read technologies and an optical map based approach to produce a maize genome assembly composed of only 63 contigs. The B73-Ab10 genome includes gapless assemblies of chromosome 3 (236 Mb) and chromosome 9 (162 Mb), multiple highly repetitive centromeres and heterochromatic knobs, and 53 Mb of the Ab10 meiotic drive haplotype.

4. Robert Edgar发布URMAP，号称快过BWA和bowtie2一个量级

URMAP, an ultra-fast read mapper

Mapping of reads to reference sequences is an essential step in a wide range of biological studies. The large size of datasets generated with next-generation sequencing technologies motivates the development of fast mapping software. Here, I describe URMAP, a new read mapping algorithm. URMAP is an order of magnitude faster than BWA and Bowtie2 with comparable accuracy on a benchmark test using simulated paired 150nt reads of a well-studied human genome. Software is freely available at https://drive5.com/urmap.

5. 跨物种基因表达比较工具EvoGeneX

Modeling gene expression evolution with EvoGeneX uncovers differences in evolution of species, organs and sexes

To solve this challenge, we introduce EvoGeneX, a computationally efficient method to uncover the mode of gene expression evolution based on the Ornstein-Uhlenbeck process. Importantly, EvoGeneX in addition to modelling expression variations between species, models within species variation. To estimate the within species variation, EvoGeneX formally incorporates the data from biological replicates as a part of the mathematical model. We show that by modelling the within species diversity EvoGeneX significantly outperforms the currently available computational method. In addition, to facilitate comparative analysis of gene expression evolution, we introduce a new approach to measure the dynamics of evolutionary divergence of a group of genes.We used EvoGeneX to analyse the evolution of expression across different organs, species and sexes of the Drosophila genus. Our analysis revealed differences in the evolutionary dynamics of male and female gonads, and uncovered examples of adaptive evolution of genes expressed in the head and in the thorax.

6. 英国学者研究发现人基因间隔区RNA主要起源于基因新生转录本

Intergenic RNA mainly derives from nascent transcripts of known genes

Eukaryotic genomes undergo pervasive transcription, leading to the production of many types of stable and unstable RNAs. Transcription is not restricted to regions with annotated gene features but includes almost any genomic context. Currently, the source and function of most RNAs originating from intergenic regions in the human genome remains unclear. We hypothesised that many intergenic RNA can be ascribed to the presence of as-yet unannotated genes or the ‘fuzzy’ transcription of known genes that extends beyond the annotated boundaries. To elucidate the contributions of these two sources, we assembled a dataset of >2.5 billion publicly available RNA-seq reads across 5 human cell lines and multiple cellular compartments to annotate transcriptional units in the human genome. About 80% of transcripts from unannotated intergenic regions can be attributed to the fuzzy transcription of existing genes; the remaining transcripts originate mainly from putative long non-coding RNA loci that are rarely spliced. We validated the transcriptional activity of these intergenic RNA using independent measurements, including transcriptional start sites, chromatin signatures, and genomic occupancies of RNA polymerase II in various phosphorylation states. We also analysed the nuclear localisation and sensitivities of intergenic transcripts to nucleases to illustrate that they tend to be rapidly degraded either ‘on-chromatin’ by XRN2 or ‘off-chromatin’ by the exosome.

7. 42个大麻基因组揭示大麻素合成基因的拷贝数变异

Sequence and annotation of 42 cannabis genomes reveals extensive copy number variation in cannabinoid synthesis and pathogen resistance genes

Cannabis is a diverse and polymorphic species. To better understand cannabinoid synthesis inheritance and its impact on pathogen resistance, we shotgun sequenced and assembled a Cannabis trio (sibling pair and their offspring) utilizing long read single molecule sequencing. This resulted in the most contiguous Cannabis sativa assemblies to date. These reference assemblies were further annotated with full-length male and female mRNA sequencing (Iso-Seq) to help inform isoform complexity, gene model predictions and identification of the Y chromosome. To further annotate the genetic diversity in the species, 40 male, female, and monoecious cannabis and hemp varietals were evaluated for copy number variation (CNV) and RNA expression. This identified multiple CNVs governing cannabinoid expression and 82 genes associated with resistance to Golovinomyces chicoracearum, the causal agent of powdery mildew in cannabis. Results indicated that breeding for plants with low tetrahydrocannabinolic acid (THCA) concentrations may result in deletion of pathogen resistance genes. Low THCA cultivars also have a polymorphism every 51 bases while dispensary grade high THCA cannabis exhibited a variant every 73 bases. A refined genetic map of the variation in cannabis can guide more stable and directed breeding efforts for desired chemotypes and pathogen-resistant cultivars.

8. 约翰霍普金斯大学Steven Salzberg称Genbank中超200万条序列受污染

Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank

Metagenomic sequencing allows researchers to investigate organisms sampled from their native environments by sequencing their DNA directly, and then quantifying the abundance and taxonomic composition of the organisms thus captured. However, these types of analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here we describe Conterminator, an efficient method to detect and remove incorrectly labelled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination in 114,035 sequences and 2767 species in the NCBI Reference Sequence Database (RefSeq), 2,161,746 sequences and 6795 species in the GenBank database, and 14,132 protein sequences in the NR non-redundant protein database. Conterminator uncovers contamination in sequences spanning the whole range from draft genomes to “complete” model organism genomes. Our method, which scales linearly with input size, was able to process 3.3 terabytes of genomic sequence data in 12 days on a single 32-core compute node. We believe that Conterminator can become an important tool to ensure the quality of reference databases with particular importance for downstream metagenomic analyses. Source code (GPLv3): https://github.com/martin-steinegger/conterminator

9. 【medRxiv】武汉“封城”效果如何？看看北师大和牛津大学研究人员的联合报告

Early evaluation of the Wuhan City travel restrictions in response to the 2019 novel coronavirus outbreak

An ongoing outbreak of a novel coronavirus (2019-nCoV) was first reported in China and has spread worldwide. On January 23rd 2020 China shut down transit in and out of Wuhan, a major transport hub and conurbation of 11 million inhabitants, to contain the outbreak. By combining epidemiological and human mobility data we find that the travel ban slowed the dispersal of nCoV from Wuhan to other cities in China by 2.91 days (95% CI: 2.54-3.29). This delay provided time to establish and reinforce other control measures that are essential to halt the epidemic. The ongoing dissemination of 2019-nCoV provides an opportunity to examine how travel restrictions impede the spatial dispersal of an emerging infectious disease.

10. 【medRxiv】美国学者认为目前的各国机场检查对遏制新冠肺炎在国际传播贡献不大

Estimated effectiveness of traveller screening to prevent international spread of 2019 novel coronavirus (2019-nCoV)

Traveller screening is being used to limit further global spread of 2019 novel coronavirus (nCoV) following its recent emergence. Here, we analyze the expected impact of different travel screening programs given remaining uncertainty around the values of key nCoV life history and epidemiological parameters. Even under best-case assumptions, we estimate that screening will miss around half of infected travellers. Breaking down the factors leading to screening successes and failures, we find that most cases missed by screening are fundamentally undetectable, because they have not yet developed symptoms and are unaware they were exposed. These findings emphasize the need for measures to track travellers who become ill after being missed by a travel screening program. We make our model available for interactive use so stakeholders can explore scenarios of interest using the most up-to-date information. We hope these findings contribute to evidence-based policy to combat the spread of nCoV, and to prospective planning to mitigate future emerging.

引文

1. Pradhan et al., Uncanny similarity of unique inserts in the 2019-nCoV spike protein to HIV-1 gp120 and Gag. BioRxiv, 2020

原载于生信人公众号

最后编辑于：2020.03.22 16:55:55

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 159,015评论 4赞 362
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,262评论 1赞 292
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 108,727评论 0赞 243
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,986评论 0赞 205
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,363评论 3赞 287
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,610评论 1赞 219
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,871评论 2赞 312
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,582评论 0赞 198
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,297评论 1赞 242
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,551评论 2赞 246
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,053评论 1赞 260
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,385评论 2赞 253
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,035评论 3赞 236
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,079评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,841评论 0赞 195
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,648评论 2赞 274
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,550评论 2赞 270

2020年1月bioRxiv生信好文速览

推荐阅读更多精彩内容