python data analysis | python数据预处理（基于scikit-learn模块）

Dataset transformations| 数据转换

Combining estimators|组合学习器
Feature extration|特征提取
Preprocessing data|数据预处理

1 Dataset transformations

scikit-learn provides a library of transformers, which may clean (see Preprocessing data), reduce (see Unsupervised dimensionality reduction), expand (see Kernel Approximation) or generate (see Feature extraction) feature representations.

scikit-learn 提供了数据转换的模块，包括数据清理、降维、扩展和特征提取。

Like other estimators, these are represented by classes with fit method, which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a transform method which applies this transformation model to unseen data. fit_transform may be more convenient and efficient for modelling and transforming the training data simultaneously.

scikit-learn模块有3种通用的方法：fit(X,y=None)、transform(X)、fit_transform(X)、inverse_transform(newX)。fit用来训练模型；transform在训练后用来降维；fit_transform先用训练模型，然后返回降维后的X；inverse_transform用来将降维后的数据转换成原始数据。

1.1 combining estimators

1.1.1 Pipeline:chaining estimators

Pipeline 模块是用来组合一系列估计器的。对固定的一系列操作非常便利，如：同时结合特征选择、数据标准化、分类。

Usage|使用
代码：

from sklearn.pipeline import Pipeline  
from sklearn.svm import SVC 
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
#define estimators
#the arg is a list of (key,value) pairs,where the key is a string you want to give this step and value is an estimators object
estimators=[('reduce_dim',PCA()),('svm',SVC())]  
#combine estimators
clf1=Pipeline(estimators)
clf2=make_pipeline(PCA(),SVC())  #use func make_pipeline() can do the same thing
print(clf1,'\n',clf2)

输出：

Pipeline(steps=[('reduce_dim', PCA(copy=True, n_components=None, whiten=False)), ('svm',           SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]) 
 Pipeline(steps=[('pca', PCA(copy=True, n_components=None, whiten=False)), ('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

可以通过set_params()方法设置学习器的属性，参数形式为<estimator>_<parameter>

clf.set_params(svm__C=10)

上面的方法在网格搜索时很重要：

from sklearn.grid_search import GridSearchCV
params = dict(reduce_dim__n_components=[2, 5, 10],svm__C=[0.1, 10, 100])
grid_search = GridSearchCV(clf, param_grid=params)

上面的例子相当于把pipeline生成的学习器作为一个普通的学习器，参数形式为<estimator>_<parameter>。

Note|说明
1.可以使用dir()函数查看clf的所有属性和方法。例如step属性就是每个操作步骤的属性。
如

('reduce_dim', PCA(copy=True, n_components=None, whiten=False))

2.调用pipeline生成的学习器的fit方法相当于依次调用其包含的所有学习器的方法，transform输入然后把结果扔向下一步骤。pipeline生成的学习器有着它包含的学习器的所有方法。如果最后一个学习器是分类，那么生成的学习器就是分类，如果最后一个是transform，那么生成的学习器就是transform，依次类推。

1.1.2 FeatureUnion: composite feature spaces

与pipeline不同的是FeatureUnion只组合transformer，它们也可以结合成更复杂的模型。

FeatureUnion combines several transformer objects into a new transformer that combines their output. AFeatureUnion takes a list of transformer objects. During fitting, each of these is fit to the data independently. For transforming data, the transformers are applied in parallel, and the sample vectors they output are concatenated end-to-end into larger vectors.

Usage|使用
代码：

from sklearn.pipeline import FeatureUnion   
from sklearn.decomposition import PCA
from sklearn.decomposition import KernelPCA
from sklearn.pipeline import make_union
#define transformers
#the arg is a list of (key,value) pairs,where the key is a string you want to give this step and value is an transformer object
estimators=[('linear_pca)',PCA()),('Kernel_pca',KernelPCA())]  
#combine transformers
clf1=FeatureUnion(estimators)
clf2=make_union(PCA(),KernelPCA())
print(clf1,'\n',clf2) 
print(dir(clf1))

输出：

FeatureUnion(n_jobs=1,
       transformer_list=[('linear_pca)', PCA(copy=True, n_components=None, whiten=False)), ('Kernel_pca', KernelPCA(alpha=1.0, coef0=1, degree=3, eigen_solver='auto',
     fit_inverse_transform=False, gamma=None, kernel='linear',
     kernel_params=None, max_iter=None, n_components=None,
     remove_zero_eig=False, tol=0))],
       transformer_weights=None) 
 FeatureUnion(n_jobs=1,
       transformer_list=[('pca', PCA(copy=True, n_components=None, whiten=False)), ('kernelpca', KernelPCA(alpha=1.0, coef0=1, degree=3, eigen_solver='auto',
     fit_inverse_transform=False, gamma=None, kernel='linear',
     kernel_params=None, max_iter=None, n_components=None,
     remove_zero_eig=False, tol=0))],
       transformer_weights=None)

可以看出FeatureUnion的用法与pipeline一致

Note|说明

(A [FeatureUnion
](http://scikit- learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html#sklearn.pipeline.FeatureUn ion) has no way of checking whether two transformers might produce identical features. It only produces a union when the feature sets are disjoint, and making sure they are is the caller’s responsibility.)

Here is a example python source code:[feature_stacker.py](http://scikit-learn.org/stable/_downloads/feature_stacker.py)

1.2 Feature extraction

The sklearn.feature_extraction
module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.

skilearn.feature_extraction模块是用机器学习算法所支持的数据格式来提取数据，如将text和image信息转换成dataset。
Note:
Feature extraction（特征提取）与Feature selection（特征选择）不同，前者是用来将非数值的数据转换成数值的数据，后者是用机器学习的方法对特征进行学习（如PCA降维）。

1.2.1 Loading features from dicts

The class DictVectorizer
can be used to convert feature arrays represented as lists of standard Python dict
objects to the NumPy/SciPy representation used by scikit-learn estimators.
Dictvectorizer类用来将python内置的dict类型转换成数值型的array。dict类型的好处是在存储稀疏数据时不用存储无用的值。

代码：

measurements=[{'city': 'Dubai', 'temperature': 33.}
,{'city': 'London', 'temperature':12.}
,{'city':'San Fransisco','temperature':18.},]
from sklearn.feature_extraction import DictVectorizer
vec=DictVectorizer()
x=vec.fit_transform(measurements).toarray()
print(x)
print(vec.get_feature_names())```
输出：

[[ 1. 0. 0. 33.]
[ 0. 1. 0. 12.]
[ 0. 0. 1. 18.]]
['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']
[Finished in 0.8s]

* ###<p id='1.2.2'>1.2.2 Feature hashing</p>
* ###<p id='1.2.3'>1.2.3 Text feature extraction</p>
* ###<p id='1.2.4'>1.2.4 Image feature extraction</p>
以上三小节暂未考虑（设计到语言处理及图像处理）[见官方文档][官方文档]
[官方文档]: http://scikit-learn.org/stable/data_transforms.html

##<p id='1.3'>1.3 Preprogressing data</p>
>The sklearn.preprocessing
 package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators

sklearn.preprogressing模块提供了几种常见的数据转换，如标准化、归一化等。
* ###<p id='1.3.1'>1.3.1 Standardization, or mean removal and variance scaling</p>
>**Standardization** of datasets is a **common requirement for many machine learning estimators** implemented in the scikit; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with **zero mean and unit variance**.

 很多学习算法都要求事先对数据进行标准化，如果不是像标准正太分布一样0均值1方差就可能会有很差的表现。

 * Usage|用法

 代码：
```python
from sklearn import preprocessing
import numpy as np
X = np.array([[1.,-1., 2.], [2.,0.,0.], [0.,1.,-1.]])
Y=X
Y_scaled = preprocessing.scale(Y)
y_mean=Y_scaled.mean(axis=0) #If 0, independently standardize each feature, otherwise (if 1) standardize each sample|axis=0 时求每个特征的均值，axis=1时求每个样本的均值
y_std=Y_scaled.std(axis=0)
print(Y_scaled)
scaler= preprocessing.StandardScaler().fit(Y)#用StandardScaler类也能完成同样的功能
print(scaler.transform(Y))

输出：

[[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]
[[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]
[Finished in 1.4s]

Note|说明
1.func scale
2.class StandardScaler
3.StandardScaler 是一种Transformer方法，可以让pipeline来使用。
MinMaxScaler （min-max标准化[0,1]）类和MaxAbsScaler（[-1,1]）类是另外两个标准化的方式，用法和StandardScaler类似。
4.处理稀疏数据时用MinMax和MaxAbs很合适
5.鲁棒的数据标准化方法（适用于离群点很多的数据处理）：

the median and the interquartile range often give better results

用中位数代替均值（使均值为0），用上四分位数-下四分位数代替方差（IQR为1？）。

1.3.2 Impution of missing values|缺失值的处理
Usage
代码：

import scipy.sparse as sp
from sklearn.preprocessing import Imputer
X=sp.csc_matrix([[1,2],[0,3],[7,6]])
imp=preprocessing.Imputer(missing_value=0,strategy='mean',axis=0)
imp.fit(X)
X_test=sp.csc_matrix([[0, 2], [6, 0], [7, 6]])
print(X_test)
print(imp.transform(X_test))

输出：

  (1, 0)    6
  (2, 0)    7
  (0, 1)    2
  (2, 1)    6
[[ 4.          2.        ]
 [ 6.          3.66666675]
 [ 7.          6.        ]]
[Finished in 0.6s]

Note
1.scipy.sparse是用来存储稀疏矩阵的
2.Imputer可以用来处理scipy.sparse稀疏矩阵
1.3.3 Generating polynomial features
Usage
代码：

import numpy as np
from sklearn.preprocessing import PolynomialFeatures
X=np.arange(6).reshape(3,2)
print(X)
poly=PolynomialFeatures(2)
print(poly.fit_transform(X))

输出：

[[0 1]
 [2 3]
 [4 5]]
[[  1.   0.   1.   0.   0.   1.]
 [  1.   2.   3.   4.   6.   9.]
 [  1.   4.   5.  16.  20.  25.]]
[Finished in 0.8s]

Note
生成多项式特征用在多项式回归中以及多项式核方法中。
1.3.4 Custom transformers

这是用来构造transform方法的函数

Usage:
代码：

import numpy as np
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p)
x=np.array([[0,1],[2,3]])
print(transformer.transform(x))

输出：

[[ 0.          0.69314718]
 [ 1.09861229  1.38629436]]
[Finished in 0.8s]

Note

For a full code example that demonstrates using a FunctionTransformer
to do custom feature selection, see Using FunctionTransformer to select columns

最后编辑于：2017.12.03 05:08:38

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 158,736评论 4赞 362
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,167评论 1赞 291
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 108,442评论 0赞 243
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,902评论 0赞 204
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,302评论 3赞 287
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,573评论 1赞 216
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,847评论 2赞 312
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,562评论 0赞 197
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,260评论 1赞 241
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,531评论 2赞 245
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,021评论 1赞 258
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,367评论 2赞 253
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,016评论 3赞 235
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,068评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,827评论 0赞 194
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,610评论 2赞 274
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,514评论 2赞 269

python data analysis | python数据预处理（基于scikit-learn模块）

python data analysis | python数据预处理（基于scikit-learn模块）

<p id='1'>1 Dataset transformations</p>

<p id='1.1'>1.1 combining estimators</p>

<p id='1.1.1'>1.1.1 Pipeline:chaining estimators</p>

<p id='1.1.2'> 1.1.2 FeatureUnion: composite feature spaces</p>

<p id='1.2'>1.2 Feature extraction</p>

<p id='1.2.1'>1.2.1 Loading features from dicts</p>

<p id='1.3.2'>1.3.2 Impution of missing values|缺失值的处理</p>

<p id='1.3.3'>1.3.3 Generating polynomial features</p>

<p id='1.3.4'>1.3.4 Custom transformers</p>

推荐阅读更多精彩内容