# KNN算法基础

KNN算法是机器学习中最好理解的算法之一，属于惰性学习算法的典例。惰性指模型仅通过对训练数据集的记忆功能进行预测，而不产生判别函数。

###### 引申：参数化模型和非参数化模型

KNN属于非参数化模型的一个子类，可以被描述为基于实例的学习。这类模型的特点是对训练数据进行记忆，KNN所属的惰性学习是基于实例学习的一个特例。

#### sklearn中的KNN实现

KNN算法本身很简单，归纳为如下几步：
①选择近邻数量k和距离度量的方法
②找到待分类样本的k个最近邻
③根据最近邻类标进行多数投票
sklearn中已经很好的实现了KNN算法分类器，位于sklearn.neighbors包下的KNeighborsClassifier类中。给出最基本的实现。

``````from sklearn.model_selection import train_test_split
import sklearn.datasets as dataset
from sklearn.neighbors import KNeighborsClassifier

'''加载数据'''
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=666)

knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train, y_train)
score = knn_clf.score(X_test, y_test)
print('when k chose 5, fit score is : %s' % score)
``````

``````when k chose 5, fit score is : 0.9866666666666667
``````

#### KNN算法的超参数

###### （一）k值的选择

k值的选择是第一项，一般的根据经验k=5是最能得到最佳效果的点，但在实际开发过程中需要进行验证。

``````'''最好的k值'''
best_k = -1
best_score = 0.0
for k in range(1, 11):
knn_clf = KNeighborsClassifier(n_neighbors=k)
knn_clf.fit(X_train, y_train)
score = knn_clf.score(X_test, y_test)
if score > best_score:
best_score = score
best_k = k
print("best k is : %s, best_score is : %s\n" % (best_k, best_score))
``````

``````best k is : 5, best_score is : 0.9866666666666667
``````

###### （二）考虑距离权重

``````'''考虑距离权重'''
best_method = ''
best_k = -1
best_score = 0.0
for method in ['uniform', 'distance']:
for k in range(1, 11):
knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)
knn_clf.fit(X_train, y_train)
score = knn_clf.score(X_test, y_test)
if score > best_score:
best_score = score
best_k = k
best_method = method
print('best method is : %s, best k is : %s, best score is : %s\n' % (best_method, best_k, best_score))
``````

``````best method is : uniform, best k is : 5, best score is : 0.9866666666666667
``````

###### （三）距离系数p

``````'''距离p值'''
best_score = 0.0
best_k = -1
best_p = -1
for k in range(1, 11):
for p in range(1, 6):
knn_clf = KNeighborsClassifier(n_neighbors=k, weight='distance', p=p)
knn_clf.fit(X_train, y_train)
score = knn_clf.score(X_test, y_test)
if score > best_score:
best_score = score
best_k = k
best_p = p
print('best k is : %s, best p is : %s, best score is : %s\n' % (best_k, best_p, best_score))
``````

``````best k is : 5, best p is : 2, best score is : 0.9866666666666667
``````

#### 网格搜索与KNN超参数

``````from sklearn.model_selection import train_test_split
import sklearn.datasets as dataset
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

'''加载数据'''
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=666)

'''网格搜索获取最佳参数'''
grid_params = [
{
'weights' : ['uniform'],
'n_neighbors' : [i for i in range(1, 11)]
},{
'weights' : ['distance'],
'n_neighbors' : [i for i in range(1, 11)],
'p' : [p for p in range(1, 6)]
}
]
``````

``````'''创建一个KNN分类器，作为网格搜索使用的分类器参数'''
knn_clf = KNeighborsClassifier()
grid_search = GridSearchCV(knn_clf, param_grid=grid_params, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)
``````

``````Fitting 3 folds for each of 60 candidates, totalling 180 fits
[CV] n_neighbors=1, weights=uniform ..................................
[CV] n_neighbors=1, weights=uniform ..................................
[CV] n_neighbors=1, weights=uniform ..................................
[CV] n_neighbors=2, weights=uniform ..................................
[CV] ................... n_neighbors=1, weights=uniform, total=   0.0s
[CV] ................... n_neighbors=1, weights=uniform, total=   0.1s
[CV] n_neighbors=2, weights=uniform ..................................
[CV] ................... n_neighbors=1, weights=uniform, total=   0.1s
......
``````

verbose的作用就是在搜索过程中打印如下信息，注意第一行60 candidates，回头看grid_params中weight选用uniform时，需要对k从1到10进行测试，而weights选用distance时，对k从1到10和p从1到5进行测试，一共需要进行110 + 510 = 60项测试，对应了candidates的值。

``````print('best estimator is :')
print(grid_search.best_estimator_)
print('best score is %s:' % grid_search.best_score_)
# knn_clf = grid_search.best_estimator_
# score = knn_clf.score(X_test, y_test)
print("test by KNN Classifier's score() function, the score is : %s" % score)
``````

``````best estimator is :
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=3,
weights='distance')
best score is 0.9866369710467706:
test by KNN Classifier's score() function, the score is : 0.9822222222222222
``````

#### KNN算法的缺点

``````维数    坐标1    坐标2    距离
1        0        1       1
2      （0,0）  (1, 1)   1.414
3       (0,0,0) (1,1,1)  1.73
...
``````