# Kaggle泰坦尼克生存预测之随机森林学习

• 预览数据
• 数据初分析
• 弥补缺失值
• 建立模型并预测
• 随机森林算法的理解

#### 预览数据

``````library(readr) # File read / write
library(ggplot2) # Data visualization
library(ggthemes) # Data visualization
library(scales) # Data visualization
library(plyr)# Data manipulation
library(stringr) # String manipulation
library(InformationValue) # IV / WOE calculation
library(MLmetrics) # Mache learning metrics.e.g. Recall, Precision,Accuracy, AUC
library(rpart) # Decision tree utils
library(randomForest) # Random Forest
library(dplyr) # Data manipulation
library(e1071) # SVM
library(Amelia) # Missing value utils
library(party) # Conditional inference trees
library(class) # KNN.
``````

``````#通过getwd()函数查看当下工作路劲，把文件放到当下工作路径中
data<- bind_rows(train, test)  #组合
train.row<- 1:nrow(train)
test.row<- (1 + nrow(train)):(nrow(train) + nrow(test))
``````

1. 这里是把训练数据和测试数据合并了，原因是之后的操作过程中要对数据的格式做转换，还会对分类变量增加新的level(比如：分类变量性别，我们的操作就是，原有的基础之上，增加一个字段 为男的列和一个字段为女的列)为了避免重复操作和出错，所以就干脆一起操作了。
2. bind_rows()用法：
When row-binding, columns are matched by name, and any missing columns with be filled with NA,简单的来说就是按照列来组合两份数据，少列的自动填充为NA值(大家查资料的时候，能用谷歌就用谷歌吧，并在英文状态下搜索)

``````> str(data)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   1309 obs. of  12 variables:
\$ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
\$ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
\$ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
\$ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
\$ Sex        : chr  "male" "female" "female" "female" ...
\$ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
\$ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
\$ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
\$ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
\$ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
\$ Cabin      : chr  NA "C85" NA "C123" ...
\$ Embarked   : chr  "S" "C" "S" "S" ...
>

> str(test)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   418 obs. of  11 variables:
\$ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...

> str(train)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   891 obs. of  12 variables:
\$ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
``````

• PassengerId 整型变量，标识乘客的ID，递增变量，对预测无帮助
• Survived 整型变量，标识该乘客是否幸存。0表示遇难，1表示幸存。将其转换为factor变量比较方便处理
• Pclass 整型变量，标识乘客的社会-经济状态，1代表Upper，2代表Middle，3代表Lower
• Name 字符型变量，除包含姓和名以外，还包含Mr.
Mrs. Dr.这样的具有西方文化特点的信息
• Sex 字符型变量，标识乘客性别，适合转换为factor类型变量
• Age 整型变量，标识乘客年龄，有缺失值
• SibSp 整型变量，代表兄弟姐妹及配偶的个数。其中Sib代表Sibling也即兄弟姐妹，Sp代表Spouse也即配偶
• Parch 整型变量，代表父母或子女的个数。其中Par代表Parent也即父母，Ch代表Child也即子女
• Ticket 字符型变量，代表乘客的船票号 Fare 数值型，代表乘客的船票价
• Cabin 字符型，代表乘客所在的舱位，有缺失值
• Embarked 字符型，代表乘客登船口岸，适合转换为factor型变量

#### 数据初分析

``````> table(train\$Survived)
0   1
549 342
> prop.table((table(train\$Survived))) #prop.table()是计算表中值的百分比，这里就是我们的得到的列联表
0         1
0.6161616 0.3838384
``````

• 乘客社会等级与幸存率的影响

``````data\$Survived <- factor(data\$Survived) #变成因子型变量
ggplot(data = data[1:nrow(train),], mapping = aes(x = Pclass, y = ..count.., fill=Survived)) +
geom_bar(stat = "count", position='dodge') +
xlab('Pclass') +
ylab('Count') +
ggtitle('How Pclass impact survivor') +
scale_fill_manual(values=c("#FF0000", "#00FF00")) +
geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) +
theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")
``````

``````> WOETable(X=factor(data\$Pclass[1:nrow(train)]), Y=data\$Survived[1:nrow(train)])
CAT GOODS BADS TOTAL     PCT_G     PCT_B        WOE         IV
1   1   136   80   216 0.3976608 0.1457195  1.0039160 0.25292792
2   2    87   97   184 0.2543860 0.1766849  0.3644848 0.02832087
3   3   119  372   491 0.3479532 0.6775956 -0.6664827 0.21970095
> IV(X=factor(data\$Pclass[1:nrow(train)]), Y=data\$Survived[1:nrow(train)])
[1] 0.5009497
attr(,"howgood")
[1] "Highly Predictive"
``````

• 性别对于幸存率的影响

``````data\$Sex<- as.factor(data\$Sex)
ggplot(data
= data[1:nrow(train),], mapping = aes(x = Sex, y = ..count.., fill=Survived)) +
geom_bar(stat = 'count', position='dodge') +
xlab('Sex') +
ylab('Count') +
ggtitle('How Sex impact survivo') +
geom_text(stat = "count", aes(label
= ..count..), position=position_dodge(width=1), , vjust=-0.5) +
theme(plot.title = element_text(hjust = 0.5),
legend.position="bottom")
``````

``````> WOETable(X=data\$Sex[1:nrow(train)], Y=data\$Survived[1:nrow(train)])
CAT GOODS BADS TOTAL     PCT_G    PCT_B        WOE        IV
1 female   233   81   314 0.6812865 0.147541  1.5298770 0.8165651
2   male   109  468   577 0.3187135 0.852459 -0.9838327 0.5251163
> IV(X=data\$Sex[1:nrow(train)], Y=data\$Survived[1:nrow(train)])
[1] 1.341681
attr(,"howgood")
[1] "Highly Predictive"
``````

Sex的IV为1.34且”Highly Predictive”，也可暂将Sex作为特征变量。

• 年龄对于幸存率的影响

``````> ggplot(data = data[(!is.na(data\$Age)) & row(data[, 'Age']) <= 891, ], aes(x = Age, color=Survived)) +
+     geom_line(aes(label=..count..), stat = 'bin', binwidth=5)  +
+     labs(title = "How Age impact survivor", x = "Age", y = "Count", fill = "Survived")
Warning: Ignoring unknown aesthetics: label
``````

• 不同Title的乘客幸存率不同

``````data\$Title <- sapply(data\$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][2]})
data\$Title <- sub(' ', '', data\$Title)
data\$Title[data\$Title %in% c('Mme', 'Mlle')] <- 'Mlle'
data\$Title[data\$Title %in% c('Capt', 'Don', 'Major', 'Sir')] <- 'Sir'
data\$Title <- factor(data\$Title)
``````

``````ggplot(data = data[1:nrow(train),], mapping = aes(x = Title, y = ..count.., fill=Survived)) +
geom_bar(stat = "count", position='stack') +
xlab('Title') +
ylab('Count') +
ggtitle('How Title impact survivor') +
scale_fill_discrete(name="Survived", breaks=c(0, 1), labels=c("Perish", "Survived")) +
geom_text(stat = "count", aes(label = ..count..), position=position_stack(vjust = 0.5)) +
theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")
``````

``````> WOETable(X=data\$Title[1:nrow(train)], Y=data\$Survived[1:nrow(train)])
CAT GOODS BADS TOTAL       PCT_G       PCT_B         WOE           IV
1     Col     1    1     2 0.002873563 0.001808318  0.46315552 4.933741e-04
2      Dr     3    4     7 0.008620690 0.007233273  0.17547345 2.434548e-04
3    Lady     2    1     3 0.005747126 0.001808318  1.15630270 4.554455e-03
4  Master    23   17    40 0.066091954 0.030741410  0.76543639 2.705859e-02
5    Miss   127   55   182 0.364942529 0.099457505  1.30000942 3.451330e-01
6    Mlle     3    3     3 0.008620690 0.005424955  0.46315552 1.480122e-03
7      Mr    81  436   517 0.232758621 0.788426763 -1.22003757 6.779360e-01
8     Mrs    99   26   125 0.284482759 0.047016275  1.80017883 4.274821e-01
9      Ms     1    1     1 0.002873563 0.001808318  0.46315552 4.933741e-04
10    Rev     6    6     6 0.017241379 0.010849910  0.46315552 2.960244e-03
11    Sir     2    3     5 0.005747126 0.005424955  0.05769041 1.858622e-05
> IV(X=data\$Title[1:nrow(train)], Y=data\$Survived[1:nrow(train)])
[1] 1.487853
attr(,"howgood")
[1] "Highly Predictive"
``````

• 乘客票号对生存率的影响

``````ticket.count <- aggregate(data\$Ticket, by =list(data\$Ticket), function(x) sum(!is.na(x)))
``````

``````data\$TicketCount <-
apply(data, 1, function(x) ticket.count[which(ticket.count[, 1] == x['Ticket']), 2])
data\$TicketCount <-
factor(sapply(data\$TicketCount, function(x) ifelse(x > 1, 'Share', 'Unique')))
ggplot(data = data[1:nrow(train),], mapping =
aes(x = TicketCount, y = ..count.., fill=Survived)) +
geom_bar(stat = 'count', position='dodge') +
xlab('TicketCount') +
ylab('Count') +
ggtitle('How TicketCount impact survivor') +
geom_text(stat = "count",
aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) +
theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")
``````

``````> WOETable(X=data\$TicketCount[1:nrow(train)], Y=data\$Survived[1:nrow(train)])
CAT GOODS BADS TOTAL    PCT_G     PCT_B        WOE        IV
1  Share   212  198   410 0.619883 0.3606557  0.5416069 0.1403993
2 Unique   130  351   481 0.380117 0.6393443 -0.5199641 0.1347889
> IV(X=data\$TicketCount[1:nrow(train)], Y=data\$Survived[1:nrow(train)])
[1] 0.2751882
attr(,"howgood")
[1] "Highly Predictive"
``````

• 支出船票费对幸存率的影响

``````> ggplot(data = data[(!is.na(data\$Fare)) & row(data[, 'Fare']) <= 891, ], aes(x = Fare, color=Survived)) +
+     geom_line(aes(label=..count..), stat = 'bin', binwidth=10)  +
+     labs(title = "How Fare impact survivor", x = "Fare", y = "Count", fill = "Survived")
Warning: Ignoring unknown aesthetics: label
``````

• 配偶及兄弟姐妹数的个数对幸存率的影响
``````ggplot(data = data[1:nrow(train),], mapping = aes(x = SibSp, y = ..count.., fill=Survived)) +
geom_bar(stat = 'count', position='dodge') +
labs(title = "How SibSp impact survivor", x = "Sibsp", y = "Count", fill = "Survived") +
geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) +
theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")
``````

``````> WOETable(X=as.factor(data\$SibSp[1:nrow(train)]), Y=data\$Survived[1:nrow(train)])
CAT GOODS BADS TOTAL       PCT_G       PCT_B        WOE          IV
1   0   210  398   608 0.593220339 0.724954463 -0.2005429 0.026418349
2   1   112   97   209 0.316384181 0.176684882  0.5825894 0.081387334
3   2    13   15    28 0.036723164 0.027322404  0.2957007 0.002779811
4   3     4   12    16 0.011299435 0.021857923 -0.6598108 0.006966604
5   4     3   15    18 0.008474576 0.027322404 -1.1706364 0.022063953
6   5     5    5     5 0.014124294 0.009107468  0.4388015 0.002201391
7   8     7    7     7 0.019774011 0.012750455  0.4388015 0.003081947
> IV(X=as.factor(data\$SibSp[1:nrow(train)]),Y=data\$Survived[1:nrow(train)])
[1] 0.1448994
attr(,"howgood")
[1] "Highly Predictive"
``````

IV为0.1448994，且”Highly Predictive”。可以让SibSp这一个变量当作特征变量。

• 父母与子女数对生存率的影响
``````ggplot(data = data[1:nrow(train),], mapping = aes(x = Parch, y = ..count.., fill=Survived)) +
geom_bar(stat = 'count', position='dodge') +
labs(title = "How Parch impact survivor", x = "Parch", y = "Count", fill = "Survived") +
geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) +
theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")
ggplot(data = data[1:nrow(train),], mapping = aes(x = Parch, y = ..count.., fill=Survived)) +
geom_bar(stat = 'count', position='dodge') +
labs(title = "How Parch impact survivor", x = "Parch", y = "Count", fill = "Survived") +
geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) +
theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")
``````

``````> WOETable(X=as.factor(data\$Parch[1:nrow(train)]), Y=data\$Survived[1:nrow(train)])
CAT GOODS BADS TOTAL       PCT_G       PCT_B        WOE          IV
1   0   233  445   678 0.671469741 0.810564663 -0.1882622 0.026186312
2   1    65   53   118 0.187319885 0.096539162  0.6628690 0.060175728
3   2    40   40    80 0.115273775 0.072859745  0.4587737 0.019458440
4   3     3    2     5 0.008645533 0.003642987  0.8642388 0.004323394
5   4     4    4     4 0.011527378 0.007285974  0.4587737 0.001945844
6   5     1    4     5 0.002881844 0.007285974 -0.9275207 0.004084922
7   6     1    1     1 0.002881844 0.001821494  0.4587737 0.000486461
> IV(X=as.factor(data\$Parch[1:nrow(train)]),Y=data\$Survived[1:nrow(train)])
[1] 0.1166611
attr(,"howgood")
[1] "Highly Predictive"
``````

IV为0.1166611，且”Highly Predictive”，也可把Parch作为一个特征变量

• FamilySize(家庭人员规模)对幸存率的影响

SibSp与Parch都说明，当乘客无亲人时，幸存率较低，乘客有少数亲人时，幸存率高于50%，而当亲人数过高时，幸存率反而降低。在这里，可以考虑将SibSp与Parch相加，生成新的变量，FamilySize。

``````data\$FamilySize <- data\$SibSp + data\$Parch + 1
ggplot(data = data[1:nrow(train),], mapping = aes(x = FamilySize, y = ..count.., fill=Survived)) +
geom_bar(stat = 'count', position='dodge') +
xlab('FamilySize') +
ylab('Count') +
ggtitle('How FamilySize impact survivor') +
geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) +
theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")
``````

``````> WOETable(X=as.factor(data\$FamilySize[1:nrow(train)]), Y=data\$Survived[1:nrow(train)])
CAT GOODS BADS TOTAL       PCT_G      PCT_B        WOE           IV
1   1   163  374   537 0.459154930 0.68123862 -0.3945249 0.0876175539
2   2    89   72   161 0.250704225 0.13114754  0.6479509 0.0774668616
3   3    59   43   102 0.166197183 0.07832423  0.7523180 0.0661084057
4   4    21    8    29 0.059154930 0.01457195  1.4010615 0.0624634998
5   5     3   12    15 0.008450704 0.02185792 -0.9503137 0.0127410643
6   6     3   19    22 0.008450704 0.03460838 -1.4098460 0.0368782940
7   7     4    8    12 0.011267606 0.01457195 -0.2571665 0.0008497665
8   8     6    6     6 0.016901408 0.01092896  0.4359807 0.0026038712
9  11     7    7     7 0.019718310 0.01275046  0.4359807 0.0030378497
Warning message:
In grid.draw.grob(x\$children[[i]], recording = FALSE) :
reached elapsed time limit
> IV(X=as.factor(data\$FamilySize[1:nrow(train)]), Y=data\$Survived[1:nrow(train)])
[1] 0.3497672
attr(,"howgood")
[1] "Highly Predictive"
``````

IV为0.3497672，且“Highly Predictive”，因此，可将这个派生变量FamilySize作为特征变量。

• 不同仓位对幸存率的影响

``````ggplot(data[1:nrow(train), ], mapping = aes(x = as.factor(sapply(data\$Cabin[1:nrow(train)], function(x) str_sub(x, start = 1, end = 1))), y = ..count.., fill = Survived)) +
geom_bar(stat = 'count', position='dodge') +
xlab('Cabin') +
ylab('Count') +
ggtitle('How Cabin impact survivor') +
geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) +
theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")
``````

``````> data\$Cabin <- sapply(data\$Cabin, function(x) str_sub(x, start = 1, end = 1))
> WOETable(X=as.factor(data\$Cabin[1:nrow(train)]), Y=data\$Survived[1:nrow(train)])
CAT GOODS BADS TOTAL       PCT_G       PCT_B        WOE          IV
1   A     7    8    15 0.020408163 0.014571949  0.3368366 0.001965851
2   B    35   12    47 0.102040816 0.021857923  1.5408094 0.123546555
3   C    35   24    59 0.102040816 0.043715847  0.8476622 0.049439873
4   D    25    8    33 0.072886297 0.014571949  1.6098023 0.093874571
5   E    24    8    32 0.069970845 0.014571949  1.5689803 0.086919776
6   F     8    5    13 0.023323615 0.009107468  0.9403716 0.013368461
7   G     2    2     4 0.005830904 0.003642987  0.4703680 0.001029126
8   T     1    1     1 0.002915452 0.001821494  0.4703680 0.000514563
9   X   206  481   687 0.600583090 0.876138434 -0.3776231 0.104056065
> IV(X=as.factor(data\$Cabin[1:nrow(train)]), Y=data\$Survived[1:nrow(train)])
[1] 0.4747148
attr(,"howgood")
[1] "Highly Predictive"
``````

Cabin的IV为0.1866526，且“Highly Predictive”，可以作为特征变量

• Embarked对乘客幸存率的影响

Embarked变量代表登船码头，现通过统计不同码头登船的乘客幸存率来判断Embarked是否可用于预测乘客幸存情况。

``````ggplot(data[1:nrow(train), ], mapping = aes(x = Embarked, y = ..count.., fill = Survived)) +
geom_bar(stat = 'count', position='dodge') +
xlab('Embarked') +
ylab('Count') +
ggtitle('How Embarked impact survivor') +
geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) +
theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")
``````

``````> WOETable(X=as.factor(data\$Embarked[1:nrow(train)]), Y=data\$Survived[1:nrow(train)])
CAT GOODS BADS TOTAL     PCT_G     PCT_B         WOE           IV
1   C    93   75   168 0.2719298 0.1366120  0.68839908 9.315265e-02
2   Q    30   47    77 0.0877193 0.0856102  0.02433748 5.133014e-05
3   S   219  427   646 0.6403509 0.7777778 -0.19442458 2.671917e-02
> IV(X=as.factor(data\$Embarked[1:nrow(train)]), Y=data\$Survived[1:nrow(train)])
[1] 0.1199231
attr(,"howgood")
[1] "Highly Predictive"
``````

#### 弥补缺失值

``````> attach(data)
> missing <- list(Pclass=nrow(data[is.na(Pclass), ]))
> missing\$Name <- nrow(data[is.na(Name), ])
> missing\$Sex <- nrow(data[is.na(Sex), ])
> missing\$Age <- nrow(data[is.na(Age), ])
> missing\$SibSp <- nrow(data[is.na(SibSp), ])
> missing\$Parch <- nrow(data[is.na(Parch), ])
> missing\$Ticket <- nrow(data[is.na(Ticket), ])
> missing\$Fare <- nrow(data[is.na(Fare), ])
> missing\$Cabin <- nrow(data[is.na(Cabin), ])
> missing\$Embarked <- nrow(data[is.na(Embarked), ])
> for (name in names(missing)) {
+     if (missing[[name]][1] > 0) {
+         print(paste('', name, ' miss ', missing[[name]][1], ' values', sep = ''))
+     }
+ }
[1] "Age miss 263 values"
[1] "Fare miss 1 values"
[1] "Cabin miss 1014 values"
[1] "Embarked miss 2 values"
> detach(data)
``````
• 预测乘客年龄

``````age.model <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Title + FamilySize, data=data[!is.na(data\$Age), ], method='anova')
data\$Age[is.na(data\$Age)] <- predict(age.model, data[is.na(data\$Age), ])
``````
• 中位数填补一个缺失的Fare值
``````  data\$Fare[is.na(data\$Fare)] <- median(data\$Fare, na.rm=TRUE)
``````
• 将缺失的Cabin设置为默认值

``````  data\$Cabin<- as.factor(sapply(data\$Cabin, function(x)ifelse(is.na(x), 'X', str_sub(x, start = 1, end = 1))))
``````
• 填补缺失的Embarked值
这里Embarked值就只少了两个值，直接用众数来代替了
``````data\$Embarked[c(62,830)] = "S"
data\$Embarked <- factor(data\$Embarked)
``````

1. 对于不起重要预测作用的变量选择直接删除列
2. 不需要十分精确的估算的直接替换
3. 最近邻插补——kNN
4. 分类回归大树——rpart
rpart()建立分类回归树模型，对于分类变量，添加参数：method = “class”
对于数值变量，添加参数：method = “anova”

#### 建立模型并预测

• 训练模型
``````set.seed(415)
model <-
cforest(Survived ~ Pclass + Title + Sex + Age + SibSp + Parch + FamilySize +
TicketCount + Fare + Cabin + Embarked, data = data[train.row, ],
controls=cforest_unbiased(ntree=2000, mtry=3))
``````
• 交叉验证

``````cv.summarize <- function(data.true, data.predict) {
print(paste('Recall:', Recall(data.true, data.predict)))
print(paste('Precision:', Precision(data.true, data.predict)))
print(paste('Accuracy:', Accuracy(data.predict, data.true)))
print(paste('AUC:', AUC(data.predict, data.true)))
}
set.seed(415)
cv.test.sample <- sample(1:nrow(train), as.integer(0.3 * nrow(train)), replace = TRUE)
cv.test <- data[cv.test.sample,]
cv.prediction <- predict(model, cv.test, OOB=TRUE, type = "response")
cv.summarize(cv.test\$Survived, cv.prediction)

[1] "Recall: 0.953757225433526"
[1] "Precision: 0.841836734693878"
[1] "Accuracy: 0.853932584269663"
[1] "AUC: 0.811984995695486"
``````
• 预测
``````predict.result <- predict(model, data[(1+nrow(train)):(nrow(data)), ], OOB=TRUE, type = "response")
output <- data.frame(PassengerId = test\$PassengerId, Survived = predict.result)
write.csv(output, file = "cit1.csv", row.names = FALSE)
``````
• 上传到kaggle成绩

• 调优增加派生特征

``````data\$Surname <- sapply(data\$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][1]})
data\$FamilyID <- paste(as.character(data\$FamilySize), data\$Surname, sep="")
data\$FamilyID[data\$FamilySize <= 2] <- 'Small'
# Delete erroneous family IDs
famIDs <- data.frame(table(data\$FamilyID))
famIDs <- famIDs[famIDs\$Freq <= 2,]
data\$FamilyID[data\$FamilyID %in% famIDs\$Var1] <- 'Small'
# Convert to a factor
data\$FamilyID <- factor(data\$FamilyID)
``````

1. 部分西方国家中人名的重复度较高，而姓氏重复度较低，姓氏具有一定辨识度
2. 部分国家的姓氏具有一定的身份识别作用
3. 姓氏相同的乘客，可能是一家人（这一点也基于西方国家姓氏重复度较低这一特点），而一家人同时幸存或遇难的可能性较高
4. 考虑到只出现一次的姓氏不可能同时出现在训练集和测试集中，不具辨识度和预测作用，因此将只出现一次的姓氏均命名为’Small’。
``````set.seed(415)
model <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + Fare + Embarked + Title + FamilySize + FamilyID + TicketCount, data = data[train.row, ], controls=cforest_unbiased(ntree=2000, mtry=3))
predict.result <- predict(model, data[test.row, ], OOB=TRUE, type = "response")
submit <- data.frame(PassengerId = test\$PassengerId, Survived = predict.result)
write.csv(submit, file = "cit4.csv", row.names = FALSE)
``````
• 优化后上传到kaggle成绩

#### 关于随机森林的理解：

• 谈随机森林之前，先来谈一下决策树。

• 现在要开始说随机森林了

1）数据随机选取：从原始数据集中选取数据组成不同的子数据集，利用这些数据集构建子决策树，观察子决策树的分类结果，随机森林的分类结果属于子决策树的分类结果指向多的那个。如图，假设随机森林中有3棵子决策树，2棵子树的分类结果是A类，1棵子树的分类结果是B类，那么随机森林的分类结果就是A类。

2）待选特征的随机选取：随机森林中的子树的每一个分裂过程并未用到所有的待选特征，而是从所有的待选特征中随机选取一定的特征，之后再在随机选取的特征中选取最优的特征。