# 机器学习实战之一：一个完整的机器学习项目

### 1. 获取数据

• 使用Pandas加载数据，并返回一个包含所有数据的Pandas`DataFrame`对象。
``````import pandas as pd

csv_path = os.path.join(housing_path, "housing.csv")
``````
• 使用DataFrame的`head()`方法查看该数据集的前5行:
• 使用`describe()`方法展示数值属性的概括：
housing.describe()
• 创建测试集（根据收入，进行分层采样）：
``````from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
``````

### 2. 发现并可视化数据，发现规律

• 地理数据的可视化：
``````housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
s=housing["population"]/100, label="population",
c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
)
plt.legend()
``````

• 查找关联
使用`corr()`方法计算出每对属性间的标准相关系数（standard correlation coefficient，也称作皮尔逊相关系数）：
``````>>> corr_matrix = housing.corr()
>>> corr_matrix["median_house_value"].sort_values(ascending=False)#每个属性和房价中位数的关联度
median_house_value    1.000000
median_income         0.687170
total_rooms           0.135231
housing_median_age    0.114220
households            0.064702
total_bedrooms        0.047865
population           -0.026699
longitude            -0.047279
latitude             -0.142826
Name: median_house_value, dtype: float64
``````
• 尝试不同的属性组合
``````>>> housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
>>> housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
>>> housing["population_per_household"]=housing["population"]/housing["households"]
>>> corr_matrix = housing.corr()
>>> corr_matrix["median_house_value"].sort_values(ascending=False)
median_house_value          1.000000
median_income               0.687170
rooms_per_household         0.199343
total_rooms                 0.135231
housing_median_age          0.114220
households                  0.064702
total_bedrooms              0.047865
population_per_household   -0.021984
population                 -0.026699
longitude                  -0.047279
latitude                   -0.142826
bedrooms_per_room          -0.260070
Name: median_house_value, dtype: float64
#可以看出来，与总房间数或卧室数相比，新的bedrooms_per_room属性与房价中位数的关联更强
``````

### 3. 数据预处理

• 处理缺失值
``````from sklearn.preprocessing import Imputer

imputer = Imputer(strategy="median")
housing_num = housing.drop("ocean_proximity", axis=1)#创建一份不包括文本属性ocean_proximity的数据副本
imputer.fit(housing_num)
X = imputer.transform(housing_num)
``````
• 处理文本和类别属性(使用独热编码One-Hot Encoding)
``````from sklearn.preprocessing import CategoricalEncoder # in future versions of Scikit-Learn

cat_encoder = CategoricalEncoder()
housing_cat_reshaped = housing_cat.values.reshape(-1, 1)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat_reshaped)
``````
• 特征缩放
有两种常见的方法可以让所有的属性有相同的量度：线性函数归一化（Min-Max scaling）和标准化（standardization）。
• 转换流水线
``````from sklearn.pipeline import FeatureUnion

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', Imputer(strategy="median")),
('std_scaler', StandardScaler()),
])

cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('cat_encoder', CategoricalEncoder(encoding="onehot-dense")),
])

full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
``````

``````housing_prepared = full_pipeline.fit_transform(housing)
``````

### 4. 选择模型，进行训练

• 线性回归模型
``````from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
``````
• 决策树模型
``````from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)
``````
• 随机森林模型
``````from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)
``````

``````from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)
``````

### 5. 微调模型

• 网格搜索
``````from sklearn.model_selection import GridSearchCV

param_grid = [
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]

forest_reg = RandomForestRegressor()

grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error')

grid_search.fit(housing_prepared, housing_labels)
``````
• 随机搜索