# Pandas数据分析教程——超好用的Groupby用法详解

``````company=["A","B","C"]

data=pd.DataFrame({
"company":[company[x] for x in np.random.randint(0,len(company),10)],
"salary":np.random.randint(5,50,10),
"age":np.random.randint(15,50,10)
}
)
``````
company salary age
0 C 43 35
1 C 17 25
2 C 8 30
3 A 20 22
4 B 10 17
5 B 21 40
6 A 23 33
7 C 49 19
8 B 8 30

# 一、Groupby的基本原理

``````In [5]: group = data.groupby("company")
``````

``````In [6]: group
Out[6]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002B7E2650240>
``````

``````In [8]: list(group)
Out[8]:
[('A',   company  salary  age
3       A      20   22
6       A      23   33),
('B',   company  salary  age
4       B      10   17
5       B      21   40
8       B       8   30),
('C',   company  salary  age
0       C      43   35
1       C      17   25
2       C       8   30
7       C      49   19)]
``````

groupby原理.png

# 二、agg 聚合操作

min 最小值
max 最大值
sum 求和
mean 均值
median 中位数
std 标准差
var 方差
count 计数

``````In [12]: data.groupby("company").agg('mean')
Out[12]:
salary    age
company
A         21.50  27.50
B         13.00  29.00
C         29.25  27.25
``````

``````In [17]: data.groupby('company').agg({'salary':'median','age':'mean'})
Out[17]:
salary    age
company
A          21.5  27.50
B          10.0  29.00
C          30.0  27.25
``````

`agg`聚合过程可以图解如下（第二个例子为例）：

agg图解.png

# 三、transform

`transform`是一种什么数据操作？和`agg`有什么区别呢？为了更好地理解`transform``agg`的不同，下面从实际的应用场景出发进行对比。

``````In [21]: avg_salary_dict = data.groupby('company')['salary'].mean().to_dict()

In [22]: data['avg_salary'] = data['company'].map(avg_salary_dict)

In [23]: data
Out[23]:
company  salary  age  avg_salary
0       C      43   35       29.25
1       C      17   25       29.25
2       C       8   30       29.25
3       A      20   22       21.50
4       B      10   17       13.00
5       B      21   40       13.00
6       A      23   33       21.50
7       C      49   19       29.25
8       B       8   30       13.00
``````

``````In [24]: data['avg_salary'] = data.groupby('company')['salary'].transform('mean')

In [25]: data
Out[25]:
company  salary  age  avg_salary
0       C      43   35       29.25
1       C      17   25       29.25
2       C       8   30       29.25
3       A      20   22       21.50
4       B      10   17       13.00
5       B      21   40       13.00
6       A      23   33       21.50
7       C      49   19       29.25
8       B       8   30       13.00
``````

transform图解.png

# 四、apply

`apply`应该是大家的老朋友了，它相比`agg``transform`而言更加灵活，能够传入任意自定义的函数，实现复杂的数据操作。在Pandas数据处理三板斧——map、apply、applymap详解
)中，介绍了`apply`的使用，那在`groupby`后使用`apply`和之前所介绍的有什么区别呢？

``````In [38]: def get_oldest_staff(x):
...:     df = x.sort_values(by = 'age',ascending=True)
...:     return df.iloc[-1,:]
...:

In [39]: oldest_staff = data.groupby('company',as_index=False).apply(get_oldest_staff)

In [40]: oldest_staff
Out[40]:
company  salary  age
0       A      23   33
1       B      21   40
2       C      43   35
``````

apply过程.png