# 准备

### 登记GEO信息

submitter这个网址的第二个CONTACT选项登记你的信息。登记结束后点击SAVE即可。PREVIEW可以查看你的录入信息。

GEO数据上传需要三类文件，很简单的阅读理解，不翻译了。

GEOarchive has three required components:

2. processed data files,
3. raw data files.

Details about each component are described below.

# 通过三步悠然的上传数据

EXAMPLE 1

## 计算average insert size，脚本环境为python 2.7

``````\$ head -10000 mapped.sam | python mean_size.py
220 35
\$ samtools view mapped.bam | head -10000 | python mean_size.py
220 35
``````
``````#! /usr/local/bin/python2.7
"""
mean_size.py
Created by Tim Stuart
"""

import numpy as np

def get_data(inp):
lengths = []
for line in inp:
if line.startswith('@'):
pass
else:
line = line.rsplit()
length = int(line[8])
if length > 0:
lengths.append(length)
else:
pass
return lengths

def reject_outliers(data, m=2.):
"""
rejects outliers more than 2
standard deviations from the median
"""
median = np.median(data)
std = np.std(data)
for item in data:
if abs(item - median) > m * std:
data.remove(item)
else:
pass

def calc_size(data):
mn = int(np.mean(data))
std = int(np.std(data))
return mn, std

if __name__ == "__main__":
import sys
lengths = get_data(sys.stdin)
reject_outliers(lengths)
mn, std = calc_size(lengths)
print mn, std
``````

``````awk '{ if (\$9 > 0) { N+=1; S+=\$9; S2+=\$9*\$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt ((S2-M*M*N)/(N-1))}' sample.sam
# 数据过滤，以insert size <2000 为限制举例
awk '{ if (\$9 > 0) {if (\$9 <2000){ N+=1; S+=\$9; S2+=\$9*\$9 }}} END { M=S/N; print "n="N", mean="M", stdev="sqrt ((S2-M*M*N)/(N-1))}' sample.sam
``````

# 数据上传

FTP信息

``````# 2019-03-28 更新
# 在命令行下
lftp ftp://geo:33%259uyj_fCh%3FM16H@ftp-private.ncbi.nlm.nih.gov

# 此时进入GEO ftp的主目录
mirror -R /home/your_directory/

# 此时ls命令可以看到主目录下出现your_directory目录（包含子目录）
``````

# 写信给GEO

``````提供我自己的例本：
Dear GEO officer,

Thanks for you kindly host such great public data resource.

I have successfully transferred our lab's high-throughput sequencing data to NCBI-GEO ftp sever as instruction.
The files are listed in the metadata_spreadsheet excel and we provide md5 checksums in it.
Hope you may assist us to upload our data to share with others.

Here is the information you may be needed for further processing:
2. Names of the directory and files deposited: /******
3. Public release date: not public until we send another e-mail to confirm, we need a private access link to share our data.

Files incorporate three parts:
2. processed_data_files & md5.txt in /*******
3. raw_data_files & md5.txt in /*******

If there is any format or content problem,  please do not hesitate to contact me.

Thanks!

Bset,
NAME
``````