# KNN算法实现及其交叉验证

KNN算法

knn

kNN算法的核心思想是如果一个样本在特征空间中的k个最相邻的样本中的大多数属于某一个类别，则该样本也属于这个类别，并具有这个类别上样本的特性。该方法在确定分类决策上只依据最邻近的一个或者几个样本的类别来决定待分样本所属的类别。 kNN方法在类别决策时，只与极少量的相邻样本有关。由于kNN方法主要靠周围有限的邻近的样本，而不是靠判别类域的方法来确定所属类别的，因此对于类域的交叉或重叠较多的待分样本集来说，kNN方法较其他方法更为适合。

1. 数据

defwineprice(rating,age):"""

Input rating & age of wine and Output it's price.

Example:

------

input = [80.,20.] ===> output = 140.0

"""peak_age = rating -50# year before peak year will be more expensiveprice = rating/2.ifage > peak_age:        price = price*(5-(age-peak_age))else:        price = price*(5*((age+1)/peak_age))ifprice <0: price=0returnprice

a = wineprice(80.,20.)a

140.0

defwineset(n=500):"""

Input wineset size n and return feature array and target array.

Example:

------

n = 3

X = np.array([[80,20],[95,30],[100,15]])

y = np.array([140.0,163.6,80.0])

"""X,y  = [], []foriinrange(n):        rating = np.random.random()*50+50age = np.random.random()*50# get reference priceprice = wineprice(rating,age)# add some noiseprice = price*(np.random.random()*0.4+0.8)#[0.8,1.2]X.append([rating,age])        y.append(price)returnnp.array(X), np.array(y)

X,y = wineset(500)

X[:3]

array([[ 88.89511317,  11.63751282],[ 91.57171713,  39.76279923],[ 98.38870877,  14.07015414]])

2. 相似度：欧氏距离

knn的名字叫K近邻，如何叫“近”，我们需要一个数学上的定义，最常见的是用欧式距离，二维三维的时候对应平面或空间距离。

defeuclidean(arr1,arr2):"""

Input two array and output theie distance list.

Example:

------

arr1 = np.array([[3,20],[2,30],[2,15]])

arr2 = np.array([[2,20],[2,20],[2,20]]) # broadcasted, np.array([2,20]) and [2,20] also work.

d    = np.array([1,20,5])

"""ds = np.sum((arr1 - arr2)**2,axis=1)returnnp.sqrt(ds)

arr1 = np.array([[3,20],[2,30],[2,15]])arr2 = np.array([[2,20],[2,20],[2,20]])euclidean(arr1,arr2)

array([1.,10.,5.])

defgetdistance(X,v):"""

Input train data set X and a sample, output the distance between each other with index.

Example:

------

X = np.array([[3,20],[2,30],[2,15]])

v = np.array([2,20]) # to be broadcasted

Output dlist = np.array([1,5,10]), index = np.array([0,2,1])

"""dlist = euclidean(X,np.array(v))    index = np.argsort(dlist)    dlist.sort()# dlist_with_index = np.stack((dlist,index),axis=1)returndlist, index

dlist, index = getdistance(X,[80.,20.])

3. KNN算法

knn算法具体实现的时候很简单，调用前面的函数，计算出排序好的距离列表，然后对其前k项对应的标签值取均值即可。可以用该knn算法与实际的价格模型对比，发现精度还不错。

defknn(X,y,v,kn=3):"""

Input train data and train target, output the average price of new sample.

X = X_train; y = y_train

k: number of neighbors

"""dlist, index = getdistance(X,v)    avg =0.0foriinrange(kn):        avg = avg + y[index[i]]    avg = avg / knreturnavg

knn(X,y,[95.0,5.0],kn=3)

32.043042600537092

wineprice(95.0,5.0)

31.666666666666664

4. 加权KNN

defgaussian(dist,sigma=10.0):"""Input a distance and return it's weight"""weight = np.exp(-dist**2/(2*sigma**2))returnweight

x1 = np.arange(0,30,0.1)y1 = gaussian(x1)plt.title('gaussian function')plt.plot(x1,y1);

defknn_weight(X,y,v,kn=3):dlist, index = getdistance(X,v)    avg =0.0total_weight =0foriinrange(kn):        weight = gaussian(dlist[i])        avg = avg + weight*y[index[i]]        total_weight = total_weight + weight    avg = avg/total_weightreturnavg

knn_weight(X,y,[95.0,5.0],kn=3)

32.063929602836012

Holdout Method(保留)

K-fold Cross Validation(k折叠)

Leave-One-Out Cross Validation(留一)

a.每一回合中几乎所有的样本皆用于训练模型,因此最接近原始样本的分布,这样评估所得的结果比较可靠。

b. 实验过程中没有随机因素会影响实验数据,确保实验过程是可以被复制的.

Holdout method方法的想法很简单，给一个train_size,然后算法就会在指定的比例(train_size)处把数据分为两部分，然后返回。

# Holdout methoddefmy_train_test_split(X,y,train_size=0.95,shuffle=True):"""

Input X,y, split them and out put X_train, X_test; y_train, y_test.

Example:

------

X = np.array([[0, 1],[2, 3],[4, 5],[6, 7],[8, 9]])

y = np.array([0, 1, 2, 3, 4])

Then

X_train = np.array([[4, 5],[0, 1],[6, 7]])

X_test = np.array([[2, 3],[8, 9]])

y_train = np.array([2, 0, 3])

y_test = np.array([1, 4])

"""order = np.arange(len(y))ifshuffle:        order = np.random.permutation(order)    border = int(train_size*len(y))    X_train, X_test = X[:border], X[border:]    y_train, y_test = y[:border], y[border:]returnX_train, X_test, y_train, y_test

K folds算法是把数据分成k份，进行k此循环，每次不同的份分别来充当测试组数据。但是注意，该算法不直接操作数据，而是产生一个迭代器，返回训练和测试数据的索引。看docstring里的例子应该很清楚。

# k folds 产生一个迭代器defmy_KFold(n,n_folds=5,shuffe=False):"""

K-Folds cross validation iterator.

Provides train/test indices to split data in train test sets. Split dataset

into k consecutive folds (without shuffling by default).

Each fold is then used a validation set once while the k - 1 remaining fold form the training set.

Example:

------

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])

y = np.array([1, 2, 3, 4])

kf = KFold(4, n_folds=2)

for train_index, test_index in kf:

X_train, X_test = X[train_index], X[test_index]

y_train, y_test = y[train_index], y[test_index]

print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [2 3] TEST: [0 1]

TRAIN: [0 1] TEST: [2 3]

"""idx = np.arange(n)ifshuffe:        idx = np.random.permutation(idx)    fold_sizes = (n // n_folds) * np.ones(n_folds, dtype=np.int)# folds have size n // n_foldsfold_sizes[:n % n_folds] +=1# The first n % n_folds folds have size n // n_folds + 1current =0forfold_sizeinfold_sizes:        start, stop = current, current + fold_size        train_index = list(np.concatenate((idx[:start], idx[stop:])))        test_index = list(idx[start:stop])yieldtrain_index, test_index        current = stop# move one step forward

X1 = np.array([[1,2], [3,4], [1,2], [3,4]])y1 = np.array([1,2,3,4])kf = my_KFold(4, n_folds=2)

fortrain_index, test_indexinkf:    X_train, X_test = X1[train_index], X1[test_index]    y_train, y_test = y1[train_index], y1[test_index]    print("TRAIN:", train_index,"TEST:", test_index)

('TRAIN:', [2, 3],'TEST:', [0, 1])('TRAIN:', [0, 1],'TEST:', [2, 3])

KNN算法交叉验证

deftest_algo(alg,X_train,X_test,y_train,y_test,kn=3):error =0.0foriinrange(len(y_test)):        guess = alg(X_train,y_train,X_test[i],kn=kn)        error += (y_test[i] - guess)**2returnerror/len(y_test)

X_train,X_test,y_train,y_test = my_train_test_split(X,y,train_size=0.8)

test_algo(knn,X_train,X_test,y_train,y_test,kn=3)

783.80937472673656

defmy_cross_validate(alg,X,y,n_folds=100,kn=3):error =0.0kf = my_KFold(len(y), n_folds=n_folds)fortrain_index, test_indexinkf:        X_train, X_test = X[train_index], X[test_index]        y_train, y_test = y[train_index], y[test_index]        error += test_algo(alg,X_train,X_test,y_train,y_test,kn=kn)returnerror/n_folds

errors1, errors2 = [], []foriinrange(20):    error1 = my_cross_validate(knn,X,y,kn=i+1)    error2 = my_cross_validate(knn_weight,X,y,kn=i+1)    errors1.append(error1)    errors2.append(error2)

xs = np.arange(len(errors1)) +1plt.plot(xs,errors1,color='c')plt.plot(xs,errors2,color='r')plt.xlabel('K')plt.ylabel('Error')plt.title('Error vs K');