Python 数据科学笔记2

Python DataScience Handbook 学习笔记

第二部分　numpy(2)

numpy的向量化操作与Matlab非常类似，需要注意的是向量化操作远比循环要有效率的多，请尽量使用向量化操作来取代循环。

"ufunc"是一些列能够对array进行整体操作的函数

有一些特殊的函数，我们可以通过scipy包来获取

In [29]: from scipy import special

In [30]: x = np.random.randint(15, size = (5,5), dtype = 'int32')

In [31]: x
Out[31]: 
array([[ 4, 14,  8,  5,  7],
       [ 0,  8,  8, 14,  9],
       [ 9,  9, 10, 14,  1],
       [13, 10,  0, 12, 12],
       [ 7,  3,  2, 14,  2]], dtype=int32)

In [32]: special.erf(x)
Out[32]: 
array([[ 0.99999998,  1.        ,  1.        ,  1.        ,  1.        ],
       [ 0.        ,  1.        ,  1.        ,  1.        ,  1.        ],
       [ 1.        ,  1.        ,  1.        ,  1.        ,  0.84270079],
       [ 1.        ,  1.        ,  0.        ,  1.        ,  1.        ],
       [ 1.        ,  0.99997791,  0.99532227,  1.        ,  0.99532227]])

In [33]: x
Out[33]: 
array([[ 4, 14,  8,  5,  7],
       [ 0,  8,  8, 14,  9],
       [ 9,  9, 10, 14,  1],
       [13, 10,  0, 12, 12],
       [ 7,  3,  2, 14,  2]], dtype=int32)

Specifying output


In [24]:
x = np.arange(5)
y = np.empty(5)
np.multiply(x, 10, out=y)
print(y)
[  0.  10.  20.  30.  40.]

y = np.zeros(10)
np.power(2, x, out=y[::2])
print(y)
[  1.   0.   2.   0.   4.   0.   8.   0.  16.   0.]

你可能会问这样做的好处是什么，相比于直接赋值有何优越性？
在y[::2] = 2 ** x的过程中，我们会创建一个临时数组，储存右边语句的值，再将其拷贝到左边的子数组中。很显然，使用specifying output提升了效率。

Aggregation

In [36]: x = np.linspace(0, 10, 5)

In [37]: x
Out[37]: array([  0. ,   2.5,   5. ,   7.5,  10. ])

In [38]: np.add.reduce(x)
Out[38]: 25.0

In [39]: np.multiply.reduce(x)
Out[39]: 0.0

In [40]: np.add.accumulate(x)
Out[40]: array([  0. ,   2.5,   7.5,  15. ,  25. ])

Outer 外积

In [41]: x = np.arange(1, 5)

In [42]: x
Out[42]: array([1, 2, 3, 4])

In [43]: np.multiply.outer(x, x)
Out[43]: 
array([[ 1,  2,  3,  4],
       [ 2,  4,  6,  8],
       [ 3,  6,  9, 12],
       [ 4,  8, 12, 16]])

numpy中的min,max等聚合函数

In [41]: x = np.arange(1, 5)

In [42]: x
Out[42]: array([1, 2, 3, 4])

In [43]: np.multiply.outer(x, x)
Out[43]: 
array([[ 1,  2,  3,  4],
       [ 2,  4,  6,  8],
       [ 3,  6,  9, 12],
       [ 4,  8, 12, 16]])

In [44]: x = np.arange(1, 10)

In [45]: x
Out[45]: array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [46]: %timeit x.sum()
1.11 µs ± 72.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [47]: %timeit sum(x)         #Be careful, don't use the python-version sum()
1.3 µs ± 5.45 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [48]: x.min()
Out[48]: 1

In [49]: x.max()
Out[49]: 9

我们还可以通过设置axis来对行列进行操作

In [50]: Mat = np.random.random((3,4))

In [51]: Mat.sum(axis = 1)
Out[51]: array([ 2.54634383,  2.42121143,  1.28962794])

In [52]: Mat
Out[52]: 
array([[ 0.77880176,  0.57543626,  0.6840498 ,  0.508056  ],
       [ 0.75612961,  0.15132258,  0.65047932,  0.86327992],
       [ 0.25738888,  0.5731711 ,  0.03401482,  0.42505314]])

In [53]: Mat.sum(axis = 0)
Out[53]: array([ 1.79232025,  1.29992993,  1.36854395,  1.79638906])

In [54]: # axis = 0 means adding the elements around column

Broadcasting

最简单的broadcasting

 In [1]: import numpy as np

In [2]: a = np.array([1, 2, 3])

In [3]: b = 3

In [4]: a + b
Out[4]: array([4, 5, 6])

一些更复杂的例子

In [5]: M = np.ones((3, 3))

In [6]: M + a
Out[6]: 
array([[ 2.,  3.,  4.],
       [ 2.,  3.,  4.],
       [ 2.,  3.,  4.]])

In [7]: a = np.arange(3)

In [8]: b = np.arange(3)[:, np.newaxis]

In [9]: a
Out[9]: array([0, 1, 2])

In [10]: b
Out[10]: 
array([[0],
       [1],
       [2]])

In [11]: a + b
Out[11]: 
array([[0, 1, 2],
       [1, 2, 3],
       [2, 3, 4]])

注意在此过程中，不同维度的数组被互相“拉伸”来适应彼此。

How it works

关于broadcasting的三条规则

Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side.
Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

应用实例

创建一个z = f(x,y) 的数据集

# x and y have 50 steps from 0 to 5
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 50)[:, np.newaxis]
z = np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)

Boolean masking

这里书中使用了一个关于雨水的数据集来展示boolean masking的妙用。

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: rainfall = pd.read_csv('data/Seattle2014.csv')['PRCP'].values

In [4]: inches = rainfall / 254.0

In [5]: inches.shape
Out[5]: (365,)

接下来便可以对这些数据进行可视化来找寻其中的规律

ufuncs

前面我们提到过ufunc是一类对array整体进行操作的函数，这里我们把他与boolean masking相结合.

In [1]: import numpy as np

In [2]: rng = np.random.RandomState(0)

In [3]: x = rng.randint(10, size = (3, 4))

In [4]: x
Out[4]: 
array([[5, 0, 3, 3],
       [7, 9, 3, 5],
       [2, 4, 7, 6]])

In [5]: x < 6
Out[5]: 
array([[ True,  True,  True,  True],
       [False, False,  True,  True],
       [ True,  True, False, False]], dtype=bool)

上述的ufunc操作会带给了我们一个boolean array, 接下来作者就展示了boolean array 的妙用。

In [5]: x < 6
Out[5]: 
array([[ True,  True,  True,  True],
       [False, False,  True,  True],
       [ True,  True, False, False]], dtype=bool)

In [6]: np.count_nonzero(_)
Out[6]: 8

In [7]: np.sum(x < 6)
Out[7]: 8

In [8]: np.any(x > 8)
Out[8]: True

In [9]: np.all(x < 8, axis = 1)
Out[9]: array([ True, False,  True], dtype=bool)

In [10]: # Working together with boolean operators

In [11]: np.sum((x < 6) & (x >= 0))
Out[11]: 8

最后boolean array 还可以用为mask,这里与matlab中的logic array还是非常类似的

In [12]: x[x < 6]
Out[12]: array([5, 0, 3, 3, 3, 5, 2, 4])

回到雨水的例子，运用mask可以非常优雅地得到我们要的数据

# construct a mask of all rainy days
rainy = (inches > 0)

# construct a mask of all summer days (June 21st is the 172nd day)
days = np.arange(365)
summer = (days > 172) & (days < 262)

print("Median precip on rainy days in 2014 (inches):   ",
      np.median(inches[rainy]))
print("Median precip on summer days in 2014 (inches):  ",
      np.median(inches[summer]))
print("Maximum precip on summer days in 2014 (inches): ",
      np.max(inches[summer]))
print("Median precip on non-summer rainy days (inches):",
      np.median(inches[rainy & ~summer]))
Median precip on rainy days in 2014 (inches):    0.194881889764
Median precip on summer days in 2014 (inches):   0.0
Maximum precip on summer days in 2014 (inches):  0.850393700787
Median precip on non-summer rainy days (inches): 0.200787401575

最后要注意and, & 与　or, | 的区别，后者是位运算符。

Fancy Indexing

fancy indexing指我们以一个array作为数组的index(就例如上一届的boolean masks)

In [14]: ind = np.array([[3, 7], [4, 5]])

In [15]: rand = np.random.RandomState(45)

In [16]: x= rand.randint(100, size = (10, 5))

In [17]: x
Out[17]: 
array([[75, 30,  3, 32, 95],
       [61, 85, 35, 68, 15],
       [65, 14, 53, 57, 72],
       [87, 46,  8, 53, 12],
       [34, 24, 12, 17, 68],
       [30, 56, 14, 36, 31],
       [86, 36, 57, 61, 79],
       [17,  6, 42, 11,  8],
       [49, 77, 75, 63, 42],
       [54, 16, 24, 95, 63]])

In [18]: x[ind]
Out[18]: 
array([[[87, 46,  8, 53, 12],
        [17,  6, 42, 11,  8]],

       [[34, 24, 12, 17, 68],
        [30, 56, 14, 36, 31]]])

In [19]: # Shape of the result reflects the shape of the index arrays rather tha
    ...: n the shape of the array being indexed

In [20]: X = np.arange(12).reshape((3, 4))

In [21]: X
Out[21]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [22]: row = np.array([0, 1, 2])

In [23]: col = np.array([2, 1, 3])

In [24]: X[row, col]
Out[24]: array([ 2,  5, 11])

In [25]: # We get the (0, 2), (1, 1), (2, 3) th element

In [34]: X.shape
Out[34]: (100, 2)

In [35]: import matplotlib.pyplot as plt

In [36]: import seaborn; seaborn.set()

In [37]: plt.scatter(X[:, 0], X[:, 1])
Out[37]: <matplotlib.collections.PathCollection at 0x7f0cc9c461d0>
<matplotlib.figure.Figure at 0x7f0cc9c6b5f8>

In [38]: plt.show()

In [39]: indices = np.random.choice(X.shape[0], 20, replace = False)

In [40]: indices
Out[40]: 
array([15, 87, 73, 17, 44, 66, 89, 91,  8, 25, 19, 39, 85, 49, 26, 20, 58,
       41, 55, 24])

In [41]: selection = X[indices] # fancy indexing

In [42]: selection
Out[42]: 
array([[ -1.80623391e-01,  -2.15707232e+00],
       [ -8.04178492e-01,  -1.34828994e+00],
       [ -1.24272035e+00,  -2.42157557e+00],
       [  3.57111518e-01,   8.94495954e-02],
       [  2.15274973e+00,   3.24279140e+00],
       [ -4.18439156e-01,  -8.58736471e-01],
       [  6.08859877e-01,  -2.59284917e-01],
       [ -6.29633042e-01,   1.32258627e-01],
       [  1.11113414e+00,   1.77185490e+00],
       [  1.65522319e+00,   4.23558698e+00],
       [ -1.40629915e-01,  -1.62069848e-01],
       [  5.21162541e-01,   2.89756456e+00],
       [ -1.11282410e+00,  -1.82987036e+00],
       [ -5.71948987e-01,  -3.34258009e+00],
       [ -2.34528800e+00,  -3.77554207e+00],
       [ -2.58467915e-01,  -8.69598951e-01],
       [ -1.46270269e-01,  -1.27384266e-04],
       [ -7.79152780e-02,  -2.01423478e+00],
       [ -1.79097697e+00,  -1.08351482e+00],
       [ -1.31637907e+00,  -1.86128924e+00]])

Using Fancy Index to modify values

In [53]: x
Out[53]: array([ 0.,  0.,  2.,  3.,  4.,  0.])

In [54]: i
Out[54]: [2, 3, 3, 4, 4, 4]

In [55]: x[i] += 1

In [56]: x
Out[56]: array([ 0.,  0.,  3.,  4.,  5.,  0.])

In [57]: x = np.zeros(10)

In [58]: np.add.at(x, i, 1) # proper way to do

In [59]: x
Out[59]: array([ 0.,  0.,  1.,  2.,  3.,  0.,  0.,  0.,  0.,  0.])

Binning Data

In [67]: np.random.seed(42)

In [68]: x = np.random.randn(100)

In [69]: size(x)
Out[69]: 100

In [70]: bins = np.linspace(-5, 5, 20)

In [71]: counts = np.zeros_like(bins)

In [72]: size(counts)
Out[72]: 20

In [73]: i = np.searchsorted(bins, x)

In [74]: i
Out[74]: 
array([11, 10, 11, 13, 10, 10, 13, 11,  9, 11,  9,  9, 10,  6,  7,  9,  8,
       11,  8,  7, 13, 10, 10,  7,  9, 10,  8, 11,  9,  9,  9, 14, 10,  8,
       12,  8, 10,  6,  7, 10, 11, 10, 10,  9,  7,  9,  9, 12, 11,  7, 11,
        9,  9, 11, 12, 12,  8,  9, 11, 12,  9, 10,  8,  8, 12, 13, 10, 12,
       11,  9, 11, 13, 10, 13,  5, 12, 10,  9, 10,  6, 10, 11, 13,  9,  8,
        9, 12, 11,  9, 11, 10, 12,  9,  9,  9,  7, 11, 10, 10, 10])

In [75]: x
Out[75]: 
array([ 0.49671415, -0.1382643 ,  0.64768854,  1.52302986, -0.23415337,
       -0.23413696,  1.57921282,  0.76743473, -0.46947439,  0.54256004,
       -0.46341769, -0.46572975,  0.24196227, -1.91328024, -1.72491783,
       -0.56228753, -1.01283112,  0.31424733, -0.90802408, -1.4123037 ,
        1.46564877, -0.2257763 ,  0.0675282 , -1.42474819, -0.54438272,
        0.11092259, -1.15099358,  0.37569802, -0.60063869, -0.29169375,
       -0.60170661,  1.85227818, -0.01349722, -1.05771093,  0.82254491,
       -1.22084365,  0.2088636 , -1.95967012, -1.32818605,  0.19686124,
        0.73846658,  0.17136828, -0.11564828, -0.3011037 , -1.47852199,
       -0.71984421, -0.46063877,  1.05712223,  0.34361829, -1.76304016,
        0.32408397, -0.38508228, -0.676922  ,  0.61167629,  1.03099952,
        0.93128012, -0.83921752, -0.30921238,  0.33126343,  0.97554513,
       -0.47917424, -0.18565898, -1.10633497, -1.19620662,  0.81252582,
        1.35624003, -0.07201012,  1.0035329 ,  0.36163603, -0.64511975,
        0.36139561,  1.53803657, -0.03582604,  1.56464366, -2.6197451 ,
        0.8219025 ,  0.08704707, -0.29900735,  0.09176078, -1.98756891,
       -0.21967189,  0.35711257,  1.47789404, -0.51827022, -0.8084936 ,
       -0.50175704,  0.91540212,  0.32875111, -0.5297602 ,  0.51326743,
        0.09707755,  0.96864499, -0.70205309, -0.32766215, -0.39210815,
       -1.46351495,  0.29612028,  0.26105527,  0.00511346, -0.23458713])

In [76]: np.add.at(counts, i, 1)

In [77]: counts
Out[77]: 
array([  0.,   0.,   0.,   0.,   0.,   1.,   3.,   7.,   9.,  23.,  22.,
        17.,  10.,   7.,   1.,   0.,   0.,   0.,   0.,   0.])

Sorting

numpy主要提供了两个与排序有关的函数sort()与argsort()

In [18]: x
Out[18]: array([14, 92, 58, 74, 22])

In [19]: i = np.argsort(x)

In [20]: x[i]
Out[20]: array([14, 22, 58, 74, 92])

根据argsort得到的index array, 我们可以用fancy index来构建出排序后的数组

In [21]: x = np.arange(1,10)

In [22]: np.random.shuffle(x)

In [23]: x
Out[23]: array([2, 9, 4, 3, 8, 6, 7, 5, 1])

In [24]: np.partition(x, 5)
Out[24]: array([1, 2, 3, 4, 5, 6, 7, 9, 8])

用partition而非sort我们可以得到最小的k个元素

Structured arrays

In [25]: name = ['Alice', 'Bob', 'Cathy', 'Doug']
    ...: age = [25, 45, 37, 19]
    ...: weight = [55.0, 85.5, 68.0, 61.5]
    ...: 

In [26]: x = np.zeros(4, dtype=int)

In [27]: # compound data type

In [28]: data = np.zeros(4, dtype={'names':('name', 'age', 'weight'), 'formats'
    ...: :('U10', 'i4', 'f8')})

In [29]: data.dtype
Out[29]: dtype([('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])

In [30]: data['name']=name;data['age']=age;data['weight']=weight

In [31]: data
Out[31]: 
array([('Alice', 25,  55. ), ('Bob', 45,  85.5), ('Cathy', 37,  68. ),
       ('Doug', 19,  61.5)],
      dtype=[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])

In [32]: data[data['age'] < 30]['name']
Out[32]: 
array(['Alice', 'Doug'],
      dtype='<U10')

除了structured array, numpy还内置了record　array,最大的区别是能够把上面的这些key作为属性来访问，但坏处是访问速度要慢于按键访问
最后，pandas为我们提供了更加强大高效的处理这类数组的工具。

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 159,835评论 4赞 364
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,598评论 1赞 295
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 109,569评论 0赞 244
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 44,159评论 0赞 213
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,533评论 3赞 287
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,710评论 1赞 222
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,923评论 2赞 313
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,674评论 0赞 203
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,421评论 1赞 246
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,622评论 2赞 245
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,115评论 1赞 260
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,428评论 2赞 254
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,114评论 3赞 238
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,097评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,875评论 0赞 197
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,753评论 2赞 276
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,649评论 2赞 271