GWAS走起~（2 fastqc、fastp）

Step four：

对从公司送回来的原始数据raw data预处理，至于问什么要进行这一步，通俗点讲就像是炒菜，直接从地里拔出来的青菜是不能直接用的，而正规的程序，至少要洗洗切切；而洗菜得用盆，切菜得用刀，所以要处理这些raw data也得要找点工具：首先得有个切菜的桌子放菜，而我们要放数据而且是大数据，一般的电脑是肯定不行的，最好是台服务器，而且要台扩大内存的服务器，后期我会讲一讲自己理解的该怎那么配一台能做生信的电脑；其次，就是菜刀了，刀的种类有很多，这里的工具（软件）也是多种多样：fastqc、fastp、bwa、samtools、vcftools、gatk、picard、plink~~~~~这里的是我了解的一小部分，据统计这样的工具现在有数千种，但是步骤就这些，它们的功能也都是大同小异，同一功能软件的差异我以为是侧重点有所不同，这个需要细品，但是大众化的也就我列举的这几个，反正对于我这样的小白白是够了，我也不想去细品了。

我认为那到菜的第一件事就是看看这菜新不新鲜、能不能吃，数据也是这样，看看这个数据合不合格、能不能用。下面就要用到我们的第一把刀了——fastqc，用它看一下“这菜”怎么样？

哎幺幺！忘了和大家说一个前提，做生信的大多数软件最好用Linux系统，其实这个系统也很简单，就和你用惯了Android，现在得用iOS了，多少你的适应一下。其实啊，这里用到的命令也就几条，想了解，后面我也和你说一下。这其实都不难，如果非要说点难点，我觉着就是得了解Linux下每个软件参数怎么用。

噢噢噢噢噢！

没别的意思就是想提提提神，下面就让我们一起了解一下fastqc。

Fastqc是一款基于JAVA的软件，它可以很快的对raw

data 进行数据质量评估，也就是“看菜”，读书人叫他“QC”（quality control）,至于问什么这么叫我觉着它就是想让像我们这样的小白白第一次看的时候看不懂。

用fastqc可以了解以下信息

测序数据的基本信息：也就是basic statistics

每个碱基的质量值：只要大多数数据》20就差不多

每条reads序列的质量值：横轴是quality，纵轴是reads的数目，可以大体上了解那一部分质量比较差

每条序列的ATCG组成：A=T,G=C，满足这个碱基互补配对原则

每条序列N的含量：一般不出现什么东西，有东西出现就是有问题了

GC含量分布：一般是两条线，一条蓝的事理论值，一条红的不均匀的线，一般跟小狗啃的样，崎岖不平，但是模样应该和理论值差不多，如果偏差太大就是样本应该是被污染了

每条序列的长度分布：根据图可以大体了解哪一块长度的reads比较多

序列中duplication程度：重复序列是正常的，在提dna时，样品本来就是有重复的序列，因为，我们不是提的单细胞基因，还有在测序是要进行桥式pcr扩增，用以各碱基扩大信号。但是如果duplication过大就说明了有bias的存在。

其实，还有许多信息，但是，我们这里了解这些就够了。而且如果有问题fastqc会用叉号提示你的。

知道了fastqc可以做什么，能得到什么，我们下面就了解一下怎么用它吧！

首先下载它，安装它，然后用它（这部分我补上的）。在我看来Linux的软件的玩法都差不多，知道它的大体用法，了解一下它的参数基本就可以操作了。好！盘它！

首先敲一下fastqc –h看装好了吗！

FastQC - A high throughputsequence QC analysis tool

SYNOPSIS

fastqc seqfile1 seqfile2 .. seqfileN

fastqc [-o output dir] [--(no)extract] [-ffastq|bam|sam]

[-c contaminant file] seqfile1 ..seqfileN

DESCRIPTION

FastQC reads a set of sequence files andproduces from each one a quality

control report consisting of a number ofdifferent modules, each one of

which will help to identify a differentpotential type of problem in your

data.

If no files to process are specified on thecommand line then the program

will start as an interactive graphicalapplication. If files are provided

on the command line then the program willrun with no user interaction

required. In this mode it is suitable for inclusion into a standardised

analysis pipeline.

The options for the program as as follows:

-h --help Print this help file and exit

-v --version Print the version of the program and exit

-o --outdir Create all output files in the specifiedoutput directory.

Please note that thisdirectory must exist as the program

will not create it. If this option is not set then the

output file for eachsequence file is created in the same

directory as the sequencefile which was processed.

--casava Files come from raw casava output.Files in the same sample

group (differing only bythe group number) will be analysed

as a set rather thanindividually. Sequences with the filter

flag set in the header willbe excluded from the analysis.

Files must have the same names given tothem by casava

(including being gzippedand ending with .gz) otherwise they

won't be grouped togethercorrectly.

--nano Files come from nanopore sequencesand are in fast5 format. In

this mode you can pass indirectories to process and the program

will take in all fast5files within those directories and produce

a single output file fromthe sequences found in all files.

--nofilter If running with --casava then don'tremove read flagged by

casava as poor quality whenperforming the QC analysis.

--extract If set then the zipped output file willbe uncompressed in

the same directory after ithas been created. By default

this option will be set iffastqc is run in non-interactive

mode.

-j --java Provides the full path to the javabinary you want to use to

launch fastqc. If notsupplied then java is assumed to be in

your path.

--noextract Do not uncompress the output file aftercreating it. You

should set this option ifyou do not wish to uncompress

the output when running innon-interactive mode.

--nogroup Disable grouping of bases for reads>50bp. All reports will

show data for every base inthe read. WARNING: Using this

option will cause fastqc tocrash and burn if you use it on

really long reads, and your plots may end upa ridiculous size.

You have been warned!

--min_length Sets an artificial lower limit on thelength of the sequence

to be shown in the report. As long as you set this to a value

greater or equal to yourlongest read length then this will be

the sequence length used tocreate your read groups. This can

be useful for makingdirectly comaparable statistics from

datasets with somewhatvariable read lengths.

-f --format Bypasses the normal sequence file formatdetection and

forces the program to usethe specified format. Valid

formats arebam,sam,bam_mapped,sam_mapped and fastq

-t --threads Specifies the number of files which can beprocessed

simultaneously. Each thread will be allocated 250MB of

memory so you shouldn't run morethreads than your

available memory will copewith, and not more than

6 threads on a 32 bitmachine

-c Specifies a non-default file whichcontains the list of

--contaminants contaminants to screen overrepresentedsequences against.

The file must contain setsof named contaminants in the

formname[tab]sequence. Lines prefixed with ahash will

be ignored.

-a Specifies a non-default filewhich contains the list of

--adapters adapter sequences which will be explicitysearched against

the library. The file mustcontain sets of named adapters

in the formname[tab]sequence. Lines prefixed with ahash

will be ignored.

-l Specifies a non-default filewhich contains a set of criteria

--limits which will be used to determine thewarn/error limits for the

various modules. This file can also be used to selectively

remove some modules fromthe output all together. The format

needs to mirror the defaultlimits.txt file found in the

Configuration folder.

-k --kmers Specifies the length of Kmer to look forin the Kmer content

module. Specified Kmer lengthmust be between 2 and 10. Default

length is 7 if notspecified.

-q --quiet Supress all progress messages on stdoutand only report errors.

-d --dir Selects a directory to be used for temporaryfiles written when

generating report images.Defaults to system temp directory if

not specified.

BUGS

Any bugs in fastqc should be reportedeither to simon.andrews@babraham.ac.uk

or inwww.bioinformatics.babraham.ac.uk/bugzilla/

然后，你如果得到这些内容，它基本就是好了！如果other，就是有问题，自己查一下解决一下，不过也可以私信，一起探讨一下。

fastq-o ./ ../reads/example1.*

-o：用来指定输出文件的所在目录，注意是不能自动新建目录的。

-f：用来强制指定输入文件格式，默认会自动检测。

-c：用来指定一个contaminant文件，fastqc会把overrepresented sequences往这个contaminant文件里搜索。

contaminant文件的格式是"Name\tSequences"，#开头的行是注释。

加上 -q 会进入沉默模式，（这个就没什么必要）

在我看来了解-o 、-f就可以操作了

好！找个文件，尝试一下

[hai@localhost~]$ cd ~/proj1/fastqc/

[hai@localhost fastqc]$ fastqc -f fastq -o ./../reads/example1.*

Started analysis of example1.L.fq

Approx 5% complete for example1.L.fq

Approx 10% complete for example1.L.fq

Approx 15% complete for example1.L.fq

Approx 20% complete for example1.L.fq

Approx 25% complete for example1.L.fq

Approx 30% complete for example1.L.fq

Approx 35% complete for example1.L.fq

Approx 40% complete for example1.L.fq

Approx 45% complete for example1.L.fq

Approx 50% complete for example1.L.fq

Approx 55% complete for example1.L.fq

Approx 60% complete for example1.L.fq

Approx 65% complete for example1.L.fq

Approx 70% complete for example1.L.fq

Approx 75% complete for example1.L.fq

Approx 80% complete for example1.L.fq

Approx 85% complete for example1.L.fq

Approx 90% complete for example1.L.fq

Approx 95% complete for example1.L.fq

Analysis complete for example1.L.fq

Started analysis of example1.R.fq

Approx 5% complete for example1.R.fq

Approx 10% complete for example1.R.fq

Approx 15% complete for example1.R.fq

Approx 20% complete for example1.R.fq

Approx 25% complete for example1.R.fq

Approx 30% complete for example1.R.fq

Approx 35% complete for example1.R.fq

Approx 40% complete for example1.R.fq

Approx 45% complete for example1.R.fq

Approx 50% complete for example1.R.fq

Approx 55% complete for example1.R.fq

Approx 60% complete for example1.R.fq

Approx 65% complete for example1.R.fq

Approx 70% complete for example1.R.fq

Approx 75% complete for example1.R.fq

Approx 80% complete for example1.R.fq

Approx 85% complete for example1.R.fq

Approx 90% complete for example1.R.fq

Approx 95% complete for example1.R.fq

Analysis complete for example1.R.fq

欧了，点开文件查看一下！

Fastqc基本就是这些，看那个红叉叉就是有问题的，有问题，下面就要解决问题了。

如果再跟做饭联系起来，fastqc的作用就相当于看看这刚拔出来的菜什么地方脏，下面就是要洗洗切切了，下一个工具是fastp，说到这个工具，我不禁感叹世界是多么美好啊。

Fastp：fastp开发为具有有用的质量控制和数据过滤功能的超快速FASTQ预处理器。只需扫描FASTQ数据，它即可执行质量控制，适配器修整，质量过滤，每次读取质量修剪和许多其他操作。该工具是用C ++开发的，具有多线程支持。根据我们的评估，fastp比其他FASTQ预处理工具（如Trimmomatic或Cutadapt）快2至5倍，尽管执行的操作要比类似工具多得多。

具体优势：

对数据自动进行全方位质控，生成人性化的报告。过滤功能（低质量，太短，太多N……）。对每一个序列的头部或尾部，计算滑动窗内的质量均值，并将均值较低的子序列进行切除（类似Trimmomatic的做法，但是快非常多）。全局剪裁（在头/尾部，不影响去重），对于Illumina下机数据往往最后一到两个cycle需要这样处理。去除接头污染。厉害的是，你不用输入接头序列，因为算法会自动识别接头序列并进行剪裁。对于双端测序（PE）的数据，软件会自动查找每一对read的重叠区域，并对该重叠区域中不匹配的碱基对进行校正。去除尾部的polyG。对于Illumina

NextSeq/NovaSeq的测序数据，因为是两色法发光，polyG是常有的事，所以该特性对该两类测序平台默认打开。对于PE数据中的overlap区间中不一致的碱基对，依据质量值进行校正可以对带分子标签（UMI）的数据进行预处理，不管UMI在插入片段还是在index上，都可以轻松处理。可以将输出进行分拆，而且支持两种模式，分别是指定分拆的个数，或者分拆后每个文件的行数。

好了，说了这么多开盘吧！

参数：

-i, --in1 R1文件输入；

-I, --in2 R2文件输入；

-o, --out1 R1文件处理后的输出；

-O, --out2 R2文件处理后的输出；

-h, --html 设置输出html格式的质控结果文件名，不设置则默认html文件名为fastp.html

-j, --json 设置输出html格式的质控结果文件名，不设置则默认json文件名为fastp.json

其实它还有一些参数，这个根据需要自己查一下稍微改一下就好了，对于小白白来说，一切都是最好的安排，就一切都默认吧！

SE：

fastp -i in.fq -o out.fq

PE：

fastp -i in.R1.fq -o out.R1.fq -I in.R2.fq -O out.R2.fq

#注意大小写

fastp对于输入和输出都支持gzip压缩，只要文件名的末尾带有.gz，就会被认为是gzip压缩文件，会启用gzip对输入输出进行压缩和解压处理。

fastp -i in.R1.fq.gz -I in.R2.fq.gz -o out.R1.fq.gz -O out.R2.fq.gz

除此之外，fastp还会给出一个人性化的报告。为啥人性化自己看吧。

经过处理后,再用fastqc看一下是否都绿了.

Fstqc –f fastq –o ./ ../data/reads/example.*

哈哈哈，全绿了，红叉叉也没有了。

Ok，下一步,序列比对！

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 159,117评论 4赞 362
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,328评论 1赞 293
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 108,839评论 0赞 243
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 44,007评论 0赞 206
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,384评论 3赞 287
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,629评论 1赞 219
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,880评论 2赞 313
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,593评论 0赞 198
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,313评论 1赞 243
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,575评论 2赞 246
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,066评论 1赞 260
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,392评论 2赞 253
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,052评论 3赞 236
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,082评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,844评论 0赞 195
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,662评论 2赞 274
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,575评论 2赞 270

GWAS走起~（2 fastqc、fastp）

推荐阅读更多精彩内容