Pandas数据规整 - 转换 - 层次化索引
层次化索引(hierarchical indexing)使你能在一个轴上拥有超过1个索引级别
层次化索引能以低维度形式处理高维度数据
In [1]:
import numpy as np
import pandas as pd
In [2]:
data = pd.Series(np.random.randn(9), index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'], [1, 2, 3, 1, 3, 1, 2, 2, 3]])
data
Out[2]:
a 1 1.109219
2 -1.157204
3 -0.193946
b 1 1.446870
3 1.178472
c 1 -0.021896
2 0.517590
d 2 -1.629166
3 -0.174633
dtype: float64
In [3]:
data.index
Out[3]:
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 2, 0, 1, 1, 2]])
层次化索引的一维数据可以模拟二维数据
In [4]:
data.unstack()
Out[4]:
1 | 2 | 3 | |
---|---|---|---|
a | 1.109219 | -1.157204 | -0.193946 |
b | 1.446870 | NaN | 1.178472 |
c | -0.021896 | 0.517590 | NaN |
d | NaN | -1.629166 | -0.174633 |
Series层次化索引的查询
In [5]:
data
Out[5]:
a 1 1.109219
2 -1.157204
3 -0.193946
b 1 1.446870
3 1.178472
c 1 -0.021896
2 0.517590
d 2 -1.629166
3 -0.174633
dtype: float64
In [7]:
# 默认索引
data[0] # 查询单值
data[[0, 3]] # 查询多值
Out[7]:
a 1 1.109219
b 1 1.446870
dtype: float64
In [10]:
# 外层索引
data['a'] # 查询单值
data[['a', 'c']] # 查询多值
data['a':'c'] # 切片
Out[10]:
a 1 1.109219
2 -1.157204
3 -0.193946
b 1 1.446870
3 1.178472
c 1 -0.021896
2 0.517590
dtype: float64
In [13]:
# loc查询
data.loc['a'] # 查询外层索引
data.loc['a', 2] # 外层、内层
data.loc[:, 2] # 外层所有,内层2
Out[13]:
a -1.157204
c 0.517590
d -1.629166
dtype: float64
将层次化索引的Series转为DataFrame
In [14]:
data
Out[14]:
a 1 1.109219
2 -1.157204
3 -0.193946
b 1 1.446870
3 1.178472
c 1 -0.021896
2 0.517590
d 2 -1.629166
3 -0.174633
dtype: float64
In [15]:
data.unstack()
Out[15]:
1 | 2 | 3 | |
---|---|---|---|
a | 1.109219 | -1.157204 | -0.193946 |
b | 1.446870 | NaN | 1.178472 |
c | -0.021896 | 0.517590 | NaN |
d | NaN | -1.629166 | -0.174633 |
In [16]:
data.unstack().stack()
Out[16]:
a 1 1.109219
2 -1.157204
3 -0.193946
b 1 1.446870
3 1.178472
c 1 -0.021896
2 0.517590
d 2 -1.629166
3 -0.174633
dtype: float64
In [17]:
data
Out[17]:
a 1 1.109219
2 -1.157204
3 -0.193946
b 1 1.446870
3 1.178472
c 1 -0.021896
2 0.517590
d 2 -1.629166
3 -0.174633
dtype: float64
In [22]:
# 交换索引顺序
data.unstack().unstack().dropna()
data.unstack().T.stack()
Out[22]:
1 a 1.109219
b 1.446870
c -0.021896
2 a -1.157204
c 0.517590
d -1.629166
3 a -0.193946
b 1.178472
d -0.174633
dtype: float64
DataFrame层次化索引
In [23]:
frame = pd.DataFrame(
np.arange(12).reshape((4, 3)),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=[['Ohio', 'Ohio', 'Colorado'],['Green', 'Red', 'Green']],
)
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame
Out[23]:
In [24]:
frame.index
frame.columns
Out[24]:
MultiIndex(levels=[['Colorado', 'Ohio'], ['Green', 'Red']],
labels=[[1, 1, 0], [0, 1, 0]],
names=['state', 'color'])
DataFrame层次化索引查询
In [25]:
frame
Out[25]:
In [26]:
frame.loc['a'] # 外层行索引
Out[26]:
In [35]:
# 外层行索引,内层行索引
frame.loc['a', 2]
Out[35]:
state color
Ohio Green 3
Red 4
Colorado Green 5
Name: (a, 2), dtype: int32
In [27]:
frame.loc['a', 'Ohio'] # 外层行,外层列
Out[27]:
color | Green | Red |
---|---|---|
key2 | ||
1 | 0 | 1 |
2 | 3 | 4 |
In [28]:
# 外层列,内层列
frame['Ohio', 'Green']
Out[28]:
key1 key2
a 1 0
2 3
b 1 6
2 9
Name: (Ohio, Green), dtype: int32
In [29]:
frame
Out[29]:
综合应用
In [33]:
# 外层行,内层行,外层列,内层列
frame.loc['a', 1]['Ohio', 'Red']
# 外层行,外层列,内层行,内层列
frame.loc['a', 'Ohio'].loc[1, 'Green']
Out[33]:
0
重排与分级排序
调整某条轴上各级别的顺序,或根据指定级别上的值对数据进行排序
In [44]:
frame
Out[44]:
In [59]:
# 行索引交换层级
frame.swaplevel()
frame.swaplevel('key1', 'key2')
frame.swaplevel('key2', 'key1')
frame.swaplevel(1, 0)
frame.swaplevel(0, 1)
Out[59]:
In [60]:
# 列索引交换层级
frame.swaplevel(axis=1)
Out[60]:
按索引层级排序
In [48]:
frame.sort_index(ascending=False) # 行索引排序
Out[48]:
In [49]:
frame.sort_index(ascending=False, level='key2') # 排序索引层级
Out[49]:
In [50]:
frame.sort_index(ascending=False, axis=1) # 列索引排序
Out[50]:
根据级别汇总统计
许多对DataFrame和Series的描述和汇总统计函数都有一个level选项,它用于指定在某条轴上计算的级别
其实是利用了pandas的groupby功能
In [51]:
frame
Out[51]:
In [61]:
frame.sum() # 按行求和
Out[61]:
state color
Ohio Green 18
Red 22
Colorado Green 26
dtype: int64
In [53]:
frame.sum(level='key1')) # 以key1索引分组求和
Out[53]:
In [64]:
# 用groupby实现
# 以外层行索引为分组基准
frame.groupby('key1').sum() # 分组基准,行索引name
frame.groupby(level='key1').sum() # level传入分组基准
frame.groupby(['a', 'a', 'b', 'b']).sum() # 手动构造分组基准
Out[64]:
state | Ohio | Colorado | |
---|---|---|---|
color | Green | Red | Green |
a | 3 | 5 | 7 |
b | 15 | 17 | 19 |
In [66]:
# 以内层行索引为分组基准
frame.groupby('key2').sum() # 分组基准,行索引name
frame.groupby(level='key2').sum() # level传入分组基准
frame.groupby([1,2,1,2]).sum() # 手动构造分组基准
Out[66]:
state | Ohio | Colorado | |
---|---|---|---|
color | Green | Red | Green |
1 | 6 | 8 | 10 |
2 | 12 | 14 | 16 |
按列求和
In [57]:
frame
Out[57]:
In [37]:
frame.sum() # 按行求和
Out[37]:
state color
Ohio Green 18
Red 22
Colorado Green 26
dtype: int64
In [58]:
frame.sum(axis=1) # 按列求和
Out[58]:
key1 key2
a 1 3
2 12
b 1 21
2 30
dtype: int64
以内层列索引为基准实现
In [67]:
frame.sum(axis=1, level='color') # 两个 Green 相加
Out[67]:
In [60]:
# 用分组实现
frame.groupby(['Green', 'Red', 'Green'], axis=1).sum()
frame.groupby(axis=1, level='color').sum()
Out[60]:
以外层列索引为基准
In [62]:
frame
Out[62]:
In [68]:
frame.sum(axis=1, level='state') # Ohio下的两列相加
Out[68]:
In [73]:
frame.groupby(axis=1, level='state').sum().sort_index(ascending=False, axis=1)
# 报错,直接传入分组索引值,默认使用最里层列索引
# frame.groupby(['Ohio', 'Ohio', 'Colorado'], axis=1).sum().sort_index(ascending=False, axis=1)
# 列索引交换层级
frame.swaplevel(axis=1)
frame.swaplevel(axis=1).groupby(['Ohio', 'Ohio', 'Colorado'], axis=1).sum().sort_index(ascending=False, axis=1)
Out[73]:
使用DataFrame的列或行进行索引
将DataFrame的一个或多个列当做行索引来用,或者将行索引变成DataFrame的列
In [74]:
frame2 = pd.DataFrame(
{'a': range(7), 'b': range(7, 0, -1),
'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
'd': [0, 1, 2, 0, 1, 2, 3]}
)
frame2
Out[74]:
a | b | c | d | |
---|---|---|---|---|
0 | 0 | 7 | one | 0 |
1 | 1 | 6 | one | 1 |
2 | 2 | 5 | one | 2 |
3 | 3 | 4 | two | 0 |
4 | 4 | 3 | two | 1 |
5 | 5 | 2 | two | 2 |
6 | 6 | 1 | two | 3 |
将普通列转为行索引
In [75]:
frame2.set_index('a')
frame3 = frame2.set_index(['c', 'd'])
frame3
Out[75]:
In [76]:
frame2.set_index(['c', 'd'], append=True) # 增加索引,非替换,保留原索引
Out[76]:
a | b | |||
---|---|---|---|---|
c | d | |||
0 | one | 0 | 0 | 7 |
1 | one | 1 | 1 | 6 |
2 | one | 2 | 2 | 5 |
3 | two | 0 | 3 | 4 |
4 | two | 1 | 4 | 3 |
5 | two | 2 | 5 | 2 |
6 | two | 3 | 6 | 1 |
In [77]:
frame2.set_index(['c', 'd'], drop=False) # 列转索引后,保留原列
Out[77]:
行索引转为普通列
In [78]:
frame3
Out[78]:
In [79]:
frame3.reset_index()
Out[79]:
c | d | a | b | |
---|---|---|---|---|
0 | one | 0 | 0 | 7 |
1 | one | 1 | 1 | 6 |
2 | one | 2 | 2 | 5 |
3 | two | 0 | 3 | 4 |
4 | two | 1 | 4 | 3 |
5 | two | 2 | 5 | 2 |
6 | two | 3 | 6 | 1 |
转换列索引
In [53]:
frame2
Out[53]:
a | b | c | d | |
---|---|---|---|---|
0 | 0 | 7 | one | 0 |
1 | 1 | 6 | one | 1 |
2 | 2 | 5 | one | 2 |
3 | 3 | 4 | two | 0 |
4 | 4 | 3 | two | 1 |
5 | 5 | 2 | two | 2 |
6 | 6 | 1 | two | 3 |
In [82]:
# 先将原表转置,修改行索引后,再转置
frame2.T.set_index(0, append=True).T
Out[82]:
a | b | c | d | |
---|---|---|---|---|
0 | 0 | 7 | one | 0 |
1 | 1 | 6 | one | 1 |
2 | 2 | 5 | one | 2 |
3 | 3 | 4 | two | 0 |
4 | 4 | 3 | two | 1 |
5 | 5 | 2 | two | 2 |
6 | 6 | 1 | two | 3 |