使用python机器学习六（scikit-learn实战）

image.png

数据加载

2小时血清胰岛素（μU/ ml）

``````import numpy as np
import urllib.request
# url with dataset
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
raw_data = urllib.request.urlopen(url)
# load the CSV file as a numpy matrix
# separate the data from the target attributes
X = dataset[:,0:8]
y = dataset[:,8]
print("size:",dataset.size)
``````

X作为特征向量，y作为目标变量。

数据标准化

``````from sklearn import preprocessing
# standardize the data attributes
standardized_X = preprocessing.scale(X)
# normalize the data attributes
normalized_X = preprocessing.normalize(X)
``````

特征的选取

``````from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(X, y)
# display the relative importance of each attribute
print(model.feature_importances_)
``````

output:

``````[ 0.11193263  0.26076795  0.10153987  0.08278266  0.07190955  0.12292174   0.11527441  0.13287119]
``````

``````from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
# create the RFE model and select 3 attributes
rfe = RFE(model, 3)
rfe = rfe.fit(X, y)
# summarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)
``````

output

``````[ True False False False False  True  True False]
[1 2 3 5 6 1 1 4]
``````

算法的开发

逻辑回归

``````from sklearn import metrics
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
``````

output

``````LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
precision    recall  f1-score   support

0.0       0.79      0.90      0.84       500
1.0       0.74      0.55      0.63       268

avg / total       0.77      0.77      0.77       768

[[448  52]
[121 147]]
``````

`准确率(accuracy)`,其定义是: 对于给定的测试数据集，分类器正确分类的样本数与总样本数之比。
`精确率(precision)`计算的是所有"正确被检索的item(TP)"占所有"实际被检索到的(TP+FP)"的比例.

precision

`召回率(recall)`计算的是所有"正确被检索的item(TP)"占所有"应该检索到的item(TP+FN)"的比例。

recall

`F1-score`

F1-score

朴素贝叶斯

``````from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
``````

output

``````GaussianNB(priors=None)
precision    recall  f1-score   support

0.0       0.80      0.84      0.82       500
1.0       0.68      0.62      0.64       268

avg / total       0.76      0.76      0.76       768

[[421  79]
[103 165]]
``````

k-最近邻

kNN（k-最近邻）方法通常用于一个更复杂分类算法的一部分。例如，我们可以用它的估计值做为一个对象的特征。有时候，一个简单的kNN算法在良好选择的特征上会有很出色的表现。当参数（主要是metrics）被设置得当，这个算法在回归问题中通常表现出最好的质量。

``````from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
# fit a k-nearest neighbor model to the data
model = KNeighborsClassifier()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
``````

output

``````KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform')
precision    recall  f1-score   support

0.0       0.83      0.88      0.85       500
1.0       0.75      0.65      0.70       268

avg / total       0.80      0.80      0.80       768

[[442  58]
[ 93 175]]
``````

决策树

``````from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
# fit a CART model to the data
model = DecisionTreeClassifier()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
``````

output

``````DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')
precision    recall  f1-score   support

0.0       1.00      1.00      1.00       500
1.0       1.00      1.00      1.00       268

avg / total       1.00      1.00      1.00       768

[[500   0]
[  0 268]]
``````

支持向量机

SVM（支持向量机）是最流行的机器学习算法之一，它主要用于分类问题。同样也用于逻辑回归，SVM在一对多方法的帮助下可以实现多类分类。

``````from sklearn import metrics
from sklearn.svm import SVC
# fit a SVM model to the data
model = SVC()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
``````

output

``````SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
precision    recall  f1-score   support

0.0       1.00      1.00      1.00       500
1.0       1.00      1.00      1.00       268

avg / total       1.00      1.00      1.00       768

[[500   0]
[  0 268]]
``````

如何优化算法的参数

``````import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
# prepare a range of alpha values to test
alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
# create and fit a ridge regression model, testing each alpha
model = Ridge()
grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))
grid.fit(X, y)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_.alpha)
``````

output

``````GridSearchCV(cv=None, error_score='raise',
estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, random_state=None, solver='auto', tol=0.001),
fit_params={}, iid=True, n_jobs=1,
param_grid={'alpha': array([  1.00000e+00,   1.00000e-01,   1.00000e-02,   1.00000e-03,
1.00000e-04,   0.00000e+00])},
pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
scoring=None, verbose=0)
0.279617559313
1.0
``````

``````import numpy as np
from scipy.stats import uniform as sp_rand
from sklearn.linear_model import Ridge
from sklearn.model_selection import RandomizedSearchCV
# prepare a uniform distribution to sample for the alpha parameter
param_grid = {'alpha': sp_rand()}
# create and fit a ridge regression model, testing random alpha values
model = Ridge()
rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100)
rsearch.fit(X, y)
print(rsearch)
# summarize the results of the random parameter search
print(rsearch.best_score_)
print(rsearch.best_estimator_.alpha)
``````

output

``````RandomizedSearchCV(cv=None, error_score='raise',
estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, random_state=None, solver='auto', tol=0.001),
fit_params={}, iid=True, n_iter=100, n_jobs=1,
param_distributions={'alpha': <scipy.stats._distn_infrastructure.rv_frozen object at 0x10efc1438>},
pre_dispatch='2*n_jobs', random_state=None, refit=True,
return_train_score=True, scoring=None, verbose=0)
0.279617531252
0.998565254036
``````