python 数据聚合groupby和分组运算transform , apply

小结

这个跟SQL很像,好处就是直接在python集成处理,不用像以前那样要分开在数据库里操作。

  1. Groupby 分组聚合
    1.1. 基本操作
    1.2. 对分组进行迭代
    1.3. 选取一个或一组列
  2. 数据聚合
    2.1. 分组级转换 transform
    2.2. apply "拆分-应用-合并"

1. Groupby 分组聚合

Groupby.png

1.1. 基本操作

引入相关库:

import pandas as pd
import numpy as np
from pandas import DataFrame,Series

基本格式

DataFrame['数据处理序列'].groupby(键值).操作()

例子:
以下的分组键是Series

In [96]: df = DataFrame({'key1':['a','a','b','b','a'],
    ...: 'key2':['one','two','one','two','one'],
    ...: 'data1':np.random.randn(5),
    ...: 'data2':np.random.randn(5)})

In [97]:

In [97]: df
Out[97]:
  key1 key2     data1     data2
0    a  one -1.390006  0.277334
1    a  two -1.309464  1.245893
2    b  one  1.293608  1.206705
3    b  two -0.549139  0.140945
4    a  one  2.292085  0.271638

In [98]: grouped = df['data1'].groupby(df['key1'])

# 操作

In [99]: grouped
Out[99]: <pandas.core.groupby.groupby.SeriesGroupBy object at 0x000000EA33F9B438
>

# mean() 看平均值
In [100]: grouped.mean()
Out[100]:
key1
a   -0.135795
b    0.372235
Name: data1, dtype: float64

In [101]: means = df['data1'].groupby([df['key1'],df['key2']]).mean()

In [102]: means
Out[102]:
key1  key2
a     one     0.451039
      two    -1.309464
b     one     1.293608
      two    -0.549139
Name: data1, dtype: float64

In [103]: means.unstack()
Out[103]:
key2       one       two
key1
a     0.451039 -1.309464
b     1.293608 -0.549139

以下键值是任意数组

#  np.array 数组作为键值

In [5]: states = np.array(['O','C','C','O','O'])

In [6]: years = np.array([2005,2005,2006,2005,2006])

In [7]: df['data1'].groupby([states,years]).mean()
Out[7]:
C  2005    1.325095
   2006    0.111973
O  2005   -0.428585
   2006    0.479145
Name: data1, dtype: float64

# 仅指定键值/键值对,对全部数据序列进行groupby

In [9]: df.groupby('key1').mean()
Out[9]:
         data1     data2
key1
a     0.543673  0.807876
b    -0.285988  1.684493

In [10]: df.groupby(['key1','key2']).mean()
Out[10]:
              data1     data2
key1 key2
a    one   0.152961  1.385948
     two   1.325095 -0.348269
b    one   0.111973  1.977837
     two  -0.683948  1.391149

# 查看分组大小

In [11]: df.groupby('key1').size()
Out[11]:
key1
a    3
b    2
dtype: int64

In [12]: df.groupby(['key1','key2']).size()
Out[12]:
key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

1.2. 对分组进行迭代

  1. 查看某个键/键值对的详细分组信息
# 需要处理的DataFrame

In [14]: df
Out[14]:
  key1 key2     data1     data2
0    a  one -0.173222  1.621207
1    a  two  1.325095 -0.348269
2    b  one  0.111973  1.977837
3    b  two -0.683948  1.391149
4    a  one  0.479145  1.150689

# 针对一个键值的分组情况

In [16]: for name,group in df.groupby('key1'):
    ...:     print(name)
    ...:     print(group)
    ...:
a
  key1 key2     data1     data2
0    a  one -0.173222  1.621207
1    a  two  1.325095 -0.348269
4    a  one  0.479145  1.150689
b
  key1 key2     data1     data2
2    b  one  0.111973  1.977837
3    b  two -0.683948  1.391149

# 针对键值对的分组情况

In [19]: for (k1,k2),group in df.groupby(['key1','key2']):
    ...:     print(k1,k2)
    ...:     print(group)
    ...:
a one
  key1 key2     data1     data2
0    a  one -0.173222  1.621207
4    a  one  0.479145  1.150689
a two
  key1 key2     data1     data2
1    a  two  1.325095 -0.348269
b one
  key1 key2     data1     data2
2    b  one  0.111973  1.977837
b two
  key1 key2     data1     data2
3    b  two -0.683948  1.391149

  1. 把以上的分组变为字典
In [20]: pieces = dict(list(df.groupby('key1')))

In [21]: pieces
Out[21]:
{'a':   key1 key2     data1     data2
 0    a  one -0.173222  1.621207
 1    a  two  1.325095 -0.348269
 4    a  one  0.479145  1.150689, 'b':   key1 key2     data1     data2
 2    b  one  0.111973  1.977837
 3    b  two -0.683948  1.391149}

In [22]: pieces
Out[22]:
{'a':   key1 key2     data1     data2
 0    a  one -0.173222  1.621207
 1    a  two  1.325095 -0.348269
 4    a  one  0.479145  1.150689, 'b':   key1 key2     data1     data2
 2    b  one  0.111973  1.977837
 3    b  two -0.683948  1.391149}

In [23]: pieces['a']
Out[23]:
  key1 key2     data1     data2
0    a  one -0.173222  1.621207
1    a  two  1.325095 -0.348269
4    a  one  0.479145  1.150689

In [24]: pieces['b']
Out[24]:
  key1 key2     data1     data2
2    b  one  0.111973  1.977837
3    b  two -0.683948  1.391149

1.3. 选取一个或一组列

  1. 基本格式
# normal code 

In [29]: df.groupby('key1')['data1']

In [30]: df[['data2']].groupby(df['key1'])

# Syntactic sugar

df.groupby('key1')['data1']
df.groupby('key1')[['data2']]

# example 

In [33]: df.groupby(['key1','key2'])['data2'].mean()
Out[33]:
key1  key2
a     one     1.385948
      two    -0.348269
b     one     1.977837
      two     1.391149
Name: data2, dtype: float64

2. 数据聚合

简单aggaggregate例子:

In [48]: df
Out[48]:
  key1 key2     data1     data2
0    a  one -0.173222  1.621207
1    a  two  1.325095 -0.348269
2    b  one  0.111973  1.977837
3    b  two -0.683948  1.391149
4    a  one  0.479145  1.150689

In [49]: def PeakToPeak(arr):
    ...:     return arr.max() - arr.min()

In [50]: df.groupby('key1').agg(PeakToPeak)
Out[50]:
         data1     data2
key1
a     1.498317  1.969476
b     0.795921  0.586688

2.1. 分组级转换 transform

  1. 添加一个用于存放各索引分组平均值的列
# previous method

In [68]: df
Out[68]:
  key1 key2     data1     data2
0    a  one -0.173222  1.621207
1    a  two  1.325095 -0.348269
2    b  one  0.111973  1.977837
3    b  two -0.683948  1.391149
4    a  one  0.479145  1.150689

In [69]: k1_means = df.groupby('key1').mean().add_prefix('mean_')

In [70]: k1_means
Out[70]:
      mean_data1  mean_data2
key1
a       0.543673    0.807876
b      -0.285988    1.684493

In [71]:

In [71]: pd.merge(df,k1_means,left_on = 'key1' ,  right_index = True)
Out[71]:
  key1 key2     data1     data2  mean_data1  mean_data2
0    a  one -0.173222  1.621207    0.543673    0.807876
1    a  two  1.325095 -0.348269    0.543673    0.807876
4    a  one  0.479145  1.150689    0.543673    0.807876
2    b  one  0.111973  1.977837   -0.285988    1.684493
3    b  two -0.683948  1.391149   -0.285988    1.684493

# transform method 

In [83]: meanData = df.groupby('key1').transform(np.mean).add_prefix('mean_')

In [84]: meanData
Out[84]:
   mean_data1  mean_data2
0    0.543673    0.807876
1    0.543673    0.807876
2   -0.285988    1.684493
3   -0.285988    1.684493
4    0.543673    0.807876

In [85]: pd.concat([df,meanData], axis = 1)
Out[85]:
  key1 key2     data1     data2  mean_data1  mean_data2
0    a  one -0.173222  1.621207    0.543673    0.807876
1    a  two  1.325095 -0.348269    0.543673    0.807876
2    b  one  0.111973  1.977837   -0.285988    1.684493
3    b  two -0.683948  1.391149   -0.285988    1.684493
4    a  one  0.479145  1.150689    0.543673    0.807876
  1. transform 原数据转换为均值。
In [90]: people
Out[90]:
               a         b         c         d         e
Joe     0.498185  0.460470 -0.892633 -1.561500  0.279949
Steve  -0.885170 -1.490421 -0.787302  1.559050  1.183115
Wes    -0.237464       NaN       NaN -0.043788 -1.091813
Jim    -1.547607 -0.121682 -0.355623 -1.703322 -0.733741
Travis  0.638562  0.486515 -0.233517  0.023372  0.366325

In [94]: key = list('ototo')

# 按键值key,计算均值

In [95]: people.groupby(key).mean()
Out[95]:
          a         b         c         d         e
o  0.299761  0.473492 -0.563075 -0.527305 -0.148513
t -1.216388 -0.806052 -0.571462 -0.072136  0.224687

# 把原数据转换为以上均值

In [96]: people.groupby(key).transform(np.mean)
Out[96]:
               a         b         c         d         e
Joe     0.299761  0.473492 -0.563075 -0.527305 -0.148513
Steve  -1.216388 -0.806052 -0.571462 -0.072136  0.224687
Wes     0.299761  0.473492 -0.563075 -0.527305 -0.148513
Jim    -1.216388 -0.806052 -0.571462 -0.072136  0.224687
Travis  0.299761  0.473492 -0.563075 -0.527305 -0.148513

  1. 计算数据与均值的差值(一般用于规范化数据)
In [105]: people
Out[105]:
               a         b         c         d         e
Joe     0.498185  0.460470 -0.892633 -1.561500  0.279949
Steve  -0.885170 -1.490421 -0.787302  1.559050  1.183115
Wes    -0.237464       NaN       NaN -0.043788 -1.091813
Jim    -1.547607 -0.121682 -0.355623 -1.703322 -0.733741
Travis  0.638562  0.486515 -0.233517  0.023372  0.366325

# 设置规范化值,原数据和均值的差值

In [106]: def demean(arr):
     ...:     return arr - arr.mean()

In [107]: key
Out[107]: ['o', 't', 'o', 't', 'o']

# 将原数据转换为规范化值

In [108]: demeaned = people.groupby(key).transform(demean)

In [109]: demeaned
Out[109]:
               a         b         c         d         e
Joe     0.198424 -0.013023 -0.329558 -1.034194  0.428462
Steve   0.331218 -0.684370 -0.215840  1.631186  0.958428
Wes    -0.537225       NaN       NaN  0.483517 -0.943300
Jim    -0.331218  0.684370  0.215840 -1.631186 -0.958428
Travis  0.338801  0.013023  0.329558  0.550677  0.514838

# 理论上结果是0,由于计算机浮点值限制,实际上是无限趋近于零的极小值

In [110]: demeaned.groupby(key).mean()
Out[110]:
              a             b      ...                  d             e
o  1.850372e-17  2.775558e-17      ...      -3.700743e-17 -3.700743e-17
t  5.551115e-17  0.000000e+00      ...       0.000000e+00  0.000000e+00

[2 rows x 5 columns]

2.2. apply "拆分-应用-合并"

将待处理对象拆分为多个对象-->对每一个对象应用函数-->最后合并数据

  1. 基本使用方法
# 定义 top 函数,显示'tip_pct'最高的几列

In [114]: def top(df, n=5, column = 'tip_pct'):
     ...:     return df.sort_index(by = column)[-n:]

# 直接应用函数

In [115]: top(tips)
C:\Users\jasminezhang\Anaconda3\Scripts\ipython:2: FutureWarning: by argument to
 sort_index is deprecated, please use .sort_values(by=...)
Out[115]:
     total_bill   tip     sex smoker  day    time  size   tip_pct
183       23.17  6.50    Male    Yes  Sun  Dinner     4  0.280535
232       11.61  3.39    Male     No  Sat  Dinner     2  0.291990
67         3.07  1.00  Female    Yes  Sat  Dinner     1  0.325733
178        9.60  4.00  Female    Yes  Sun  Dinner     2  0.416667
172        7.25  5.15    Male    Yes  Sun  Dinner     2  0.710345

In [117]: top(tips,n=6)
C:\Users\jasminezhang\Anaconda3\Scripts\ipython:2: FutureWarning: by argument to
 sort_index is deprecated, please use .sort_values(by=...)
Out[117]:
     total_bill   tip     sex smoker  day    time  size   tip_pct
109       14.31  4.00  Female    Yes  Sat  Dinner     2  0.279525
183       23.17  6.50    Male    Yes  Sun  Dinner     4  0.280535
232       11.61  3.39    Male     No  Sat  Dinner     2  0.291990
67         3.07  1.00  Female    Yes  Sat  Dinner     1  0.325733
178        9.60  4.00  Female    Yes  Sun  Dinner     2  0.416667
172        7.25  5.15    Male    Yes  Sun  Dinner     2  0.710345

# groupby 后应用 apply 函数

In [118]: tips.groupby('smoker').apply(top)
C:\Users\jasminezhang\Anaconda3\Scripts\ipython:2: FutureWarning: by argument to
 sort_index is deprecated, please use .sort_values(by=...)
Out[118]:
            total_bill   tip     sex    ...       time size   tip_pct
smoker                                  ...
No     88        24.71  5.85    Male    ...      Lunch    2  0.236746
       185       20.69  5.00    Male    ...     Dinner    5  0.241663
       51        10.29  2.60  Female    ...     Dinner    2  0.252672
       149        7.51  2.00    Male    ...      Lunch    2  0.266312
       232       11.61  3.39    Male    ...     Dinner    2  0.291990
Yes    109       14.31  4.00  Female    ...     Dinner    2  0.279525
       183       23.17  6.50    Male    ...     Dinner    4  0.280535
       67         3.07  1.00  Female    ...     Dinner    1  0.325733
       178        9.60  4.00  Female    ...     Dinner    2  0.416667
       172        7.25  5.15    Male    ...     Dinner    2  0.710345

[10 rows x 8 columns]


In [119]: tips.groupby(['smoker', 'day']).apply(top, n = 1 , column = 'total_bi
     ...: ll')
C:\Users\jasminezhang\Anaconda3\Scripts\ipython:2: FutureWarning: by argument to
 sort_index is deprecated, please use .sort_values(by=...)
Out[119]:
                 total_bill    tip     sex    ...       time size   tip_pct
smoker day                                    ...
No     Fri  94        22.75   3.25  Female    ...     Dinner    2  0.142857
       Sat  212       48.33   9.00    Male    ...     Dinner    4  0.186220
       Sun  156       48.17   5.00    Male    ...     Dinner    6  0.103799
       Thur 142       41.19   5.00    Male    ...      Lunch    5  0.121389
Yes    Fri  95        40.17   4.73    Male    ...     Dinner    4  0.117750
       Sat  170       50.81  10.00    Male    ...     Dinner    3  0.196812
       Sun  182       45.35   3.50    Male    ...     Dinner    3  0.077178
       Thur 197       43.11   5.00  Female    ...      Lunch    4  0.115982

[8 rows x 8 columns]


In [123]: result = tips.groupby('smoker')['tip_pct'].describe()

In [124]: result
Out[124]:
        count      mean       std    ...          50%       75%       max
smoker                               ...
No      151.0  0.159328  0.039910    ...     0.155625  0.185014  0.291990
Yes      93.0  0.163196  0.085119    ...     0.153846  0.195059  0.710345

[2 rows x 8 columns]

In [125]:

In [125]: result.T
Out[125]:
smoker          No        Yes
count   151.000000  93.000000
mean      0.159328   0.163196
std       0.039910   0.085119
min       0.056797   0.035638
25%       0.136906   0.106771
50%       0.155625   0.153846
75%       0.185014   0.195059
max       0.291990   0.710345
  1. 分组键是否禁止 group_keys - True/False
In [127]: tips.groupby('smoker', group_keys= False).apply(top)
C:\Users\jasminezhang\Anaconda3\Scripts\ipython:2: FutureWarning: by argument to
 sort_index is deprecated, please use .sort_values(by=...)
Out[127]:
     total_bill   tip     sex smoker   day    time  size   tip_pct
88        24.71  5.85    Male     No  Thur   Lunch     2  0.236746
185       20.69  5.00    Male     No   Sun  Dinner     5  0.241663
51        10.29  2.60  Female     No   Sun  Dinner     2  0.252672
149        7.51  2.00    Male     No  Thur   Lunch     2  0.266312
232       11.61  3.39    Male     No   Sat  Dinner     2  0.291990
109       14.31  4.00  Female    Yes   Sat  Dinner     2  0.279525
183       23.17  6.50    Male    Yes   Sun  Dinner     4  0.280535
67         3.07  1.00  Female    Yes   Sat  Dinner     1  0.325733
178        9.60  4.00  Female    Yes   Sun  Dinner     2  0.416667
172        7.25  5.15    Male    Yes   Sun  Dinner     2  0.710345

In [128]: tips.groupby('smoker').apply(top)
C:\Users\jasminezhang\Anaconda3\Scripts\ipython:2: FutureWarning: by argument to
 sort_index is deprecated, please use .sort_values(by=...)
Out[128]:
            total_bill   tip     sex    ...       time size   tip_pct
smoker                                  ...
No     88        24.71  5.85    Male    ...      Lunch    2  0.236746
       185       20.69  5.00    Male    ...     Dinner    5  0.241663
       51        10.29  2.60  Female    ...     Dinner    2  0.252672
       149        7.51  2.00    Male    ...      Lunch    2  0.266312
       232       11.61  3.39    Male    ...     Dinner    2  0.291990
Yes    109       14.31  4.00  Female    ...     Dinner    2  0.279525
       183       23.17  6.50    Male    ...     Dinner    4  0.280535
       67         3.07  1.00  Female    ...     Dinner    1  0.325733
       178        9.60  4.00  Female    ...     Dinner    2  0.416667
       172        7.25  5.15    Male    ...     Dinner    2  0.710345

[10 rows x 8 columns]
  1. 数据集的桶(bucket)和分位数(quantity)分析

先把数据集用cut / qcut 分成数据桶(块),然后用groupby/apply进行分位数分析。

In [129]: frame = DataFrame({'data1':np.random.randn(1000),
     ...:                    'data2':np.random.randn(1000)})


# using cut 

In [130]: factor = pd.cut(frame.data1, 4)

In [131]: factor[:10]
Out[131]:
0     (-0.286, 1.462]
1     (-0.286, 1.462]
2    (-2.034, -0.286]
3      (1.462, 3.209]
4     (-0.286, 1.462]
5     (-0.286, 1.462]
6    (-3.788, -2.034]
7    (-3.788, -2.034]
8     (-0.286, 1.462]
9     (-0.286, 1.462]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-3.788, -2.034] < (-2.034, -0.286] < (-0.28
6, 1.462] <
                                    (1.462, 3.209]]

In [132]: len(factor)
Out[132]: 1000

# 定义 { 函数名:函数 }
In [134]: def get_stats(group):
     ...:     return { 'min': group.min(), 'max' : group.max(),
     ...:             'count': group.count(), 'mean' : group.mean()}


In [135]: grouped = frame.data2.groupby(factor)

In [136]: grouped.apply(get_stats).unstack()
Out[136]:
                  count       max      mean       min
data1
(-3.788, -2.034]   18.0  1.701853  0.276175 -1.210724
(-2.034, -0.286]  383.0  2.860290  0.006860 -2.509304
(-0.286, 1.462]   513.0  3.147908  0.081546 -2.712499
(1.462, 3.209]     86.0  2.066749 -0.001550 -2.043683


# use qcut

In [137]: grouping = pd.qcut(frame.data1, 10, labels=False)

In [138]: grouped = frame.data2.groupby(grouping)

In [139]: grouped.apply(get_stats).unstack()
Out[139]:
       count       max      mean       min
data1
0      100.0  2.585796  0.094647 -2.299329
1      100.0  2.601896  0.110122 -2.396772
2      100.0  2.860290 -0.025375 -2.177280
3      100.0  2.139157 -0.104499 -2.509304
4      100.0  2.826224  0.143575 -2.290512
5      100.0  3.147908  0.106912 -1.686569
6      100.0  2.173290  0.056089 -1.773193
7      100.0  1.974363 -0.036664 -2.352925
8      100.0  2.182190  0.161370 -2.712499
9      100.0  2.066749 -0.013192 -2.043683

  1. 用分组特定值填充缺失值

4.1. 填充单行序列


In [140]: s = Series(np.random.randn(6))

In [141]: s[::2] = np.nan

In [142]: s
Out[142]:
0         NaN
1   -0.504470
2         NaN
3   -0.358606
4         NaN
5   -0.257657
dtype: float64

In [143]: s.fillna(s.mean())
Out[143]:
0   -0.373577
1   -0.504470
2   -0.373577
3   -0.358606
4   -0.373577
5   -0.257657
dtype: float64

4.2. 分组填充缺失值

apply后因为是对各分组进行操作,一般会用到lambda函数,或类似的def函数


In [178]: numbers = ContinueLetter('a',8)

In [179]: numbers
Out[179]: ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']

In [180]: group_key = ['first'] * 4 + ['second'] * 4

In [181]: data = Series(np.random.randn(8) , index = numbers)

In [183]: data[['a','f','h']] = np.nan

In [184]: data
Out[184]:
a         NaN
b    0.670058
c   -0.931242
d   -0.512491
e    0.150320
f         NaN
g    0.266838
h         NaN
dtype: float64

In [187]: fill_mean = lambda g : g.fillna(g.mean())

In [188]: data.groupby(group_key).apply(fill_mean)
Out[188]:
a   -0.257892
b    0.670058
c   -0.931242
d   -0.512491
e    0.150320
f    0.208579
g    0.266838
h    0.208579
dtype: float64


2018.8.24 《用python进行数据分析》

推荐阅读更多精彩内容