# 中篇-泰坦尼克号

Kaggle获得了一份泰坦尼克号乘客的数据分析哪些因素会让乘客的生还率更高

• 1.性别是否会影响生还率
• 2.年龄是否会影响生还率
• 3.乘客等级会否会影响生还率
• 4.性别和乘客等级共同对生还率的影响
• 5.性别和年纪共同对生还率的影响
• 6.年纪和等级共同对生还率的影响
这里乘客的性别、年龄、等级、是三个自变量，生还率是因变量

``````import numpy as np
import pandas as pd
from pandas import Series,DataFrame
import matplotlib.pyplot as plt
from __future__ import division
from scipy import stats
import seaborn as sns
###首先导入各种模块
###让图片在ipython notebook上直接显示
%matplotlib inline
``````

`````` /Users/zhongyaode/anaconda/envs/py/lib/python2.7/site-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated since IPython 4.0. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
"`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)
path='/Users/zhongyaode/Desktop/udacity—data/'
``````

• PassengerId:乘客ID
• Survived:是否获救，用1和Rescued表示获救,用0或者not saved表示没有获救
• Pclass:乘客等级，“1”表示Upper，“2”表示Middle，“3”表示Lower
• Name:乘客姓名
• Sex:性别
• Age:年龄
• SibSp:乘客在船上的配偶数量或兄弟姐妹数量）
• Parch:乘客在船上的父母或子女数量
• Ticket:船票信息
• Fare:票价
Cabin:是否住在独立的房间，“1”表示是，“0”为否
embarked:表示乘客上船的码头距离泰坦尼克出发码头的距离，数值越大表示距离越远

``````df1.head()
``````

``````df1.info()
``````
``````    <class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
``````

``````df1.describe()
``````

Embarked有两个缺失值，这里用众数'S'填充，因为这里缺失的值相比而言非常的少，

# 所以对分析结果产生不了多大的影响

``````df1['Embarked']=df1['Embarked'].fillna('S')
``````

``````age_mean=df1['Age'].mean()
df1['Age']=df1['Age'].fillna(age_mean)
``````

Cabin 值有缺失值，不需要Cabin列删除掉

``````df=df1.copy()
del df['Cabin']
``````
• 数据探索

``````survives_passenger_df=df[df['Survived']==1]
``````

``````按照name对乘客进行分组，计算每组的人数
def group_passenger_count(data,name):
#按照xx对乘客进行分组后 ，每个组的人数
return data.groupby(name)['PassengerId'].count()
``````

``````def group_passenger_survived_rate(xx):
#按xx对乘客进行分组后每个组的人数
group_all=group_passenger_count(df,xx)
``````

``````group_survived_value=group_passenger_count(survives_passenger_df,xx)

``````

``````return group_survived_value/group_all
``````

``````def print_pie(group_data,title):
group_data.plot.pie(title=title,figsize=(6,6),autopct='%.2f%%'\
,startangle=90,legend=True)
``````

``````def print_bar(data,title):
bar=data.plot.bar(title=title)
for p in bar.patches:
bar.annotate('%.1f%%'%(p.get_height()*100),(p.get_x()*1.005\
,p.get_height()*1.005))
``````

``````def print_bar_count(data,title):
bar=data.plot.bar(title=title)
for p in bar.patches:
bar.annotate('%.f'%(p.get_height()), (p.get_x()*1.005\
,p.get_height()*1.005))
``````

``````# #不同性别对生还率的影响
# df_sex1=df['Sex'][df['Survived']==1]
# df_sex0=df['Sex'][df['Survived']==0]
# plt.hist([df_sex1,df_sex0],
#         stacked=True,
#         label=['Rescued','not saved'])
# plt.xticks([-1,0,1,2],[-1,'F','M',2])
# plt.legend()
# plt.title('Sex_Survived')
``````
``````by_Survived=df.groupby('Sex')['Sex'].count()
by_Survived
``````
``````Sex
female    314
male      577
Name: Sex, dtype: int64
``````
``````#全体乘客的性别比例图
by_Sex=df.groupby('Sex')['Sex'].count()
plt.pie(by_Sex,labels=['femal','male'],autopct='%.2f%%')
``````
``````([<matplotlib.patches.Wedge at 0x10d786f50>,
<matplotlib.patches.Wedge at 0x10d798e10>],
[<matplotlib.text.Text at 0x10d7985d0>,
<matplotlib.text.Text at 0x10d798710>],
[<matplotlib.text.Text at 0x10d7989d0>,
<matplotlib.text.Text at 0x10d7a77d0>])
``````
output_22_1.png

``````
by_Survived_Sex=df[df['Survived']==1]
by_Survived_sex_rate=by_Survived_Sex.groupby('Sex')['Sex'].count()
plt.pie(by_Survived_sex_rate,labels=['femal','male'],autopct='%.2f%%')
``````
output_24_1.png

• 得出结论：女性的生还概率比男性的更高

``````df.groupby('Sex')['Survived'].mean()
``````
``````    Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

``````

``````print_bar(df.groupby('Sex')['Survived'].mean(),'Sex_survived')
#这是直接显示柱状图的方法
#df.groupby('Sex')['Survived'].mean().plot(kind='bar')
``````
output_29_0.png
• 可知女性的生还概率p_female=74%
``````#由于全体乘客的生还率为0.3838，所以认为女性的生还概率为p=0.3838,这里N等于女性总数
#标准差：SD=sqrt(p*(1-p)/N)
import math
p=0.3838
sd=math.sqrt(p*(1-p)/891)
print 'sd:',sd
``````
``````sd: 0.0162920029545
``````

m=sd*1.96
m=df.1.96

``````#阿尔法为95%，查T表得的t临界值为1.96
#m=sd*1.96
m=sd*1.96
print 'm:',m
``````
``````m: 0.0319323257908
``````
``````#女性的置信区间的95%的误差范围是ci
ci=(p-m,p+m)
print 'ci',ci
``````
``````ci (0.3518676742091846, 0.41573232579081537)
``````

p_female超出了ci的范围，所以我们可以说，女性的生还率有显著性

• 得出结论：性别对生还率有影响

``````df_sex1=df['Age'][df['Survived']==1]
df_sex0=df['Age'][df['Survived']==0]
plt.hist([df_sex1,df_sex0],
stacked=True,
label=['Rescued','not saved'])
#plt.xticks([1,2,3],['Upper','Middle','lower'])
plt.legend()
plt.title('title')
plt.title('Age_Survived')
``````
``````<matplotlib.text.Text at 0x110094990>
``````
output_41_1.png

``````def describe_value(data,label):
print '全体乘客的:'+label
print '最大值:' ,df[data].max()
print '最小值:',df[data].min()
print '平均值:',df[data].mean()
print ' '
print '生还乘客的:'+label
print '最大值:' ,survives_passenger_df[data].max()
print '最小值:' ,survives_passenger_df[data]. min()
print '平均值:' ,survives_passenger_df[data].mean()
``````
``````describe_value('Age','年纪')
``````
``````全体乘客的:年纪

``````

``````#对年龄进行均匀分组，按照10岁一组进行划分
bins=np.arange(0,90,10)
df['Age_group']=pd.cut(df['Age'],bins)
#每个年龄段里面，男、女的人数
by_age_count=df.groupby(['Age_group','Survived'])['Survived'].count()
#每个年龄段的生还率
by_age_rate=df.groupby('Age_group')['Survived'].mean()
``````
``````ci
``````
``````(0.3518676742091846, 0.41573232579081537)
``````
``````by_age_rate.plot.bar(title='Survived rate by age')
plt.ylabel('Survival rate')
plt.axhline(y=0.415,color='r',linestyle='--')
plt.axhline(y=0.351,color='g',linestyle='--')
``````
``````<matplotlib.lines.Line2D at 0x1102eda10>
``````
output_48_1.png

``````#可视化每个年龄段里面的男、女人数
by_age_count.unstack().plot(kind='bar',stacked=True)
plt.title('Survived count by age')
plt.ylabel('Survived count')
plt.axhline(y=0.351,color='g',linestyle='--')
``````
``````<matplotlib.lines.Line2D at 0x1103dded0>
``````
output_50_1.png

``````print_bar_count(df.groupby(['Age_group'])['Survived'].count(),'Age_count')
plt.axhline(y=15,color='r',linestyle='--')
``````
output_52_1.png

``````print_bar(df.groupby('Age_group')['Survived'].mean(),'Age_group')
plt.axhline(y=0.415,color='r',linestyle='--')
plt.axhline(y=0.351,color='g',linestyle='--')
``````
``````<matplotlib.lines.Line2D at 0x111a5bfd0>
``````
output_53_1.png

ci =(0.3518676742091846, 0.41573232579081537)

• 得出结论：0-10岁和30-40岁的生还率高于平均值，20-30岁和60-70岁的生还率低于平均值

• 0-10岁的生还率最高，用均值填充缺失的年龄值可能造成，年龄差异的缩小

``````print_bar_count(df.groupby(['Pclass'])['Survived'].count(),'Pclass_count')
``````
output_57_0.png

``````print_pie(group_passenger_count(df,'Pclass'),'All Passenger Pclass')
``````
output_58_0.png
• 可以看出三等级的人数占了总体人数的一半多

``````#survives_passenger_df.groupby('Pclass')['Survived'].count().plot(kind='bar')
#by_age_count.unstack().plot(kind='bar',stacked='Ttue')
#b=survives_passenger_df.groupby('Pclass')['Survived'].count()
#plt.xticks([0,1,2],['Upper_rate','Middle_rate','lower_rate'])
#plt.legend()
print_bar_count(survives_passenger_df.groupby('Pclass')['Survived'].count(),'rate_Pclass')
``````
output_60_0.png
``````#输出生还乘客的等级比例图
print_pie(group_passenger_count(survives_passenger_df,'Pclass'),'All Passenger Pclass')
``````
output_61_0.png

• 得出结论：等级对生还率有较大影响

``````df_sex1=df['Pclass'][df['Survived']==1]
df_sex0=df['Pclass'][df['Survived']==0]
plt.hist([df_sex1,df_sex0],
stacked=True,
label=['Rescued','not saved'])
plt.xticks([1,2,3],['Upper','Middle','lower'])
plt.legend()
plt.title('Pclass_Survived'
``````

``````df.groupby(['Pclass','Survived'])['Survived'].count().unstack().plot(kind='bar',stacked=True)
plt.axhline(y=15,color='r',linestyle='--')
``````
``````<matplotlib.lines.Line2D at 0x110ae0d50>
``````
output_64_1.png

``````print_bar(group_passenger_survived_rate(df['Pclass']),'Pclass_Survived')
plt.axhline(y=0.415,color='r',linestyle='--')
plt.axhline(y=0.351,color='g',linestyle='--')
``````
``````<matplotlib.lines.Line2D at 0x110ecbf10>
``````
output_65_1.png

• 结论："1"等级的生还率>“2”等级>"3"等级

• "1"等级的生还率最高

``````
``````

# 性别和乘客等级共同对生还率的影响

``````#按性别和等级分组计算人数
print_bar_count(df.groupby(['Pclass','Sex'])['Survived'].count().unstack(),'dd')
``````
output_69_0.png
``````#按性别和等级分组计算人数
print_bar_count(df.groupby(['Sex','Pclass'])['Survived'].count(),'Sex_Pclass_count')
#df.groupby(['Sex','Pclass'])['Survived'].count().plot(kind='bar')
``````

``````print_bar(group_passenger_survived_rate(['Pclass','Sex']).unstack(),'Sex_Pclass_Survived')
plt.ylabel('rate_probability')
plt.axhline(y=0.415,color='r',linestyle='--')
plt.axhline(y=0.351,color='g',linestyle='--')
``````
``````<matplotlib.lines.Line2D at 0x1113322d0>
``````
output_71_1.png

• 得出结论：性别对生还率的影响大于等级的影响
``````
``````

``````#性别和年纪分组统计人数
print_bar_count(df.groupby(['Age_group','Sex'])['Survived'].count().unstack(),'Sex_Age_count')
plt.axhline(y=15,color='r',linestyle='--')
``````
``````<matplotlib.lines.Line2D at 0x1109aa4d0>
``````
output_75_1.png

``````Sex_Age_rate=df.groupby(['Age_group','Sex'])['Survived'].mean()
print_bar(Sex_Age_rate.unstack(),'fd')
plt.axhline(y=0.415,color='r',linestyle='--')
plt.axhline(y=0.351,color='g',linestyle='--')
Sex_Age_ra=df.groupby(['Age_group','Sex'])['Survived'].count()
print_bar(Sex_Age_ra.unstack(),'fd')
plt.axhline(y=15,color='r',linestyle='--')
``````
``````<matplotlib.lines.Line2D at 0x110861750>
``````
output_77_1.png
output_77_2.png

• 结论：
1.性别的影响比年龄的影响大（这里可能跟年龄缺失值是用平均值年龄填充的有关）
2.生还率大于均值的有：0-40岁的女性和0-10岁的男性
3.生还率小于均值的有：10-60岁的男性
##### 探索年纪和等级共同对生还率的影响
``````#按年纪和等级分组求各组人数
print_bar_count(df.groupby(['Age_group','Pclass'])['Survived'].count().unstack(),'Age_Pclass_count')
# 画一条y=15的红色虚线
plt.axhline(y=15,color='r',linestyle='--')
``````
``````<matplotlib.lines.Line2D at 0x110d05c10>
``````
output_81_1.png

``````ci
``````
``````(0.3518676742091846, 0.41573232579081537)
``````
``````#按年纪和等级分组求各组生还率
print_bar(df.groupby(['Age_group','Pclass'])['Survived'].mean().unstack(),'Age_Pclass_mean')
#显示各组具有显著性的红线
plt.axhline(y=0.415,color='r',linestyle='--')
plt.axhline(y=0.351,color='g',linestyle='--')
``````
``````<matplotlib.lines.Line2D at 0x111176dd0>
``````
output_84_1.png

``````#过滤人数不足15的组
age_Pclass=df.groupby(['Age_group','Pclass'])[['Survived']].count()
Age_Pclass=age_Pclass.loc[age_Pclass['Survived']>15]
#求过滤后的每组人数
print_bar_count(Age_Pclass.unstack(),'Age_Pclass')
``````
output_86_0.png
``````ci ,#没有统计意义的有:(1,[0,10]),(1,[60,80]),(2,[50,70]),(3,[50,80])
``````
``````((0.3518676742091846, 0.41573232579081537),)
``````
``````#按年龄和等级分组生还率在0.415以上组
#显著性为0.05，生还率在0.415以上的具有显著性
Pclass_Age_group_rate=df.groupby(['Age_group','Pclass'])[['Survived']].mean()
bar_Pclass_Age_group=Pclass_Age_group_rate.loc[Pclass_Age_group_rate['Survived']>0.415]
print_bar(bar_Pclass_Age_group.unstack(),'rate_Age_group_Pclass')
#显示显著性为0.05的置信区间
plt.axhline(y=0.415,color='r',linestyle='--')
plt.axhline(y=0.351,color='r',linestyle='--')
``````
``````<matplotlib.lines.Line2D at 0x1117be0d0>
``````
output_88_1.png

``````#按年龄和等级分组求生还率在0.415以下的组
Pclass_Age_group_rate=df.groupby(['Age_group','Pclass'])[['Survived']].mean()
bar_Pclass_Age_group=Pclass_Age_group_rate.loc[Pclass_Age_group_rate['Survived']<0.351]
print_bar(bar_Pclass_Age_group.unstack(),'rate_Age_group_Pclass')
``````
output_90_0.png

``````#最后两个图怎么只显示具有统计意义，还有组数15才具有统计以及是我拍脑袋得出来的，
#根据38.38%的概率怎么来计算最小具有统计意义的数
``````
``````#
# vvvv=df.groupby('Sex')['Sex'].count()
# plt.pie(vvvv,labels=['sf','df'],autopct='%.2f%%')这里性别的宾饼图能出来
# vvvv=df.groupby('Sex')['Sex'].count()
# plt.pie(vvvv,labels=['sf','df'],autopct='%.2f%%')为什么这里的等级饼图出不来
``````

### 结论

##### 分析的局限性
• 这里并没有从统计上分析得出这些结果的偶然性，所以并不知道这里的结果是真正的差异造成的还是噪音造成的
• 年龄字段有一些缺失值，因为是连续数据这里用的是全体乘客年龄的均值填充缺失值，这样会缩小年龄之间的差异，也会影响分析结果
##### 可能影响生还率的其他因素
• 还有一些因素可能会影响生还率，不如乘客的职业、身体素质、求生意志等，但是数据中并没有个给出

#### 结果的相关性

• 这里的数据并非通过试验得出，所以无法说自变量之间的因果性，只能说她们之间有相关性

]http://www.cnblogs.com/msdynax/p/6099814.html