《Pandas Cookbook》第08章数据清理

第01章 Pandas基础
 第02章 DataFrame运算
 第03章数据分析入门
 第04章选取数据子集
 第05章布尔索引
 第06章索引对齐
 第07章分组聚合、过滤、转换
第08章数据清理
第09章合并Pandas对象
 第10章时间序列分析
 第11章用Matplotlib、Pandas、Seaborn进行可视化

 In[1]: import pandas as pd
        import numpy as np

1. 用stack清理变量值作为列名

# 加载state_fruit数据集
 In[2]: state_fruit = pd.read_csv('data/state_fruit.csv', index_col=0)
        state_fruit
out[2]:

# stack方法可以将所有列名，转变为垂直的一级行索引
 In[3]: state_fruit.stack()
out[3]: Texas    Apple      12
                 Orange     10
                 Banana     40
        Arizona  Apple       9
                 Orange      7
                 Banana     12
        Florida  Apple       0
                 Orange     14
                 Banana    190
        dtype: int64

# 使用reset_index()，将结果变为DataFrame
 In[4]: state_fruit_tidy = state_fruit.stack().reset_index()
        state_fruit_tidy
out[4]:

# 重命名列名
 In[5]: state_fruit_tidy.columns = ['state', 'fruit', 'weight']
        state_fruit_tidy
out[5]:

# 也可以使用rename_axis给不同的行索引层级命名
 In[6]: state_fruit.stack()\
                   .rename_axis(['state', 'fruit'])\
out[6]: state    fruit 
        Texas    Apple      12
                 Orange     10
                 Banana     40
        Arizona  Apple       9
                 Orange      7
                 Banana     12
        Florida  Apple       0
                 Orange     14
                 Banana    190
        dtype: int64

# 再次使用reset_index方法
 In[7]: state_fruit.stack()\
                   .rename_axis(['state', 'fruit'])\
                   .reset_index(name='weight')
out[7]:

# 读取state_fruit2数据集
 In[8]: state_fruit2 = pd.read_csv('data/state_fruit2.csv')
        state_fruit2
out[8]:

# 州名不在行索引的位置上，使用stack将所有列名变为一个长Series
 In[9]: state_fruit2.stack()
out[9]: 0  State       Texas
           Apple          12
           Orange         10
           Banana         40
        1  State     Arizona
           Apple           9
           Orange          7
           Banana         12
        2  State     Florida
           Apple           0
           Orange         14
           Banana        190
        dtype: object

# 先设定state作为行索引名，再stack，可以得到和前面相似的结果
 In[10]: state_fruit2.set_index('State').stack()
out[10]: 0  State       Texas
            Apple          12
            Orange         10
            Banana         40
         1  State     Arizona
            Apple           9
            Orange          7
            Banana         12
         2  State     Florida
            Apple           0
            Orange         14
            Banana        190
         dtype: object

2. 用melt清理变量值作为列名

# 读取state_fruit2数据集
 In[11]: state_fruit2 = pd.read_csv('data/state_fruit2.csv')
         state_fruit2
out[11]:

# 使用melt方法，将列传给id_vars和value_vars。melt可以将原先的列名作为变量，原先的值作为值。
 In[12]: state_fruit2.melt(id_vars=['State'],
                           value_vars=['Apple', 'Orange', 'Banana'])
out[12]:

# 随意设定一个行索引
 In[13]: state_fruit2.index=list('abc')
         state_fruit2.index.name = 'letter'
 In[14]: state_fruit2
out[14]:

# var_name和value_name可以用来重命名新生成的变量列和值的列
 In[15]: state_fruit2.melt(id_vars=['State'],
                      value_vars=['Apple', 'Orange', 'Banana'],
                      var_name='Fruit',
                      value_name='Weight')
out[15]:

# 如果你想让所有值都位于一列，旧的列标签位于另一列，可以直接使用melt
 In[16]: state_fruit2.melt()
out[16]:

# 要指明id变量，只需使用id_vars参数
 In[17]: state_fruit2.melt(id_vars='State')
out[17]:

3. 同时stack多组变量

# 读取movie数据集，选取所有演员名和其Facebook likes
 In[18]: movie = pd.read_csv('data/movie.csv')
         actor = movie[['movie_title', 'actor_1_name', 'actor_2_name', 'actor_3_name', 
               'actor_1_facebook_likes', 'actor_2_facebook_likes', 'actor_3_facebook_likes']]
         actor.head()
out[18]:

# 创建一个自定义函数，用来改变列名。wide_to_long要求分组的变量要有相同的数字结尾：
 In[19]: def change_col_name(col_name):
             col_name = col_name.replace('_name', '')
             if 'facebook' in col_name:
                 fb_idx = col_name.find('facebook')
                 col_name = col_name[:5] + col_name[fb_idx - 1:] + col_name[5:fb_idx-1]
             return col_name
 In[20]: actor2 = actor.rename(columns=change_col_name)
         actor2.head()
out[20]:

# 使用wide_to_long函数，同时stack两列actor和Facebook
 In[21]: stubs = ['actor', 'actor_facebook_likes']
         actor2_tidy = pd.wide_to_long(actor2, 
                                       stubnames=stubs, 
                                       i=['movie_title'], 
                                       j='actor_num', 
                                       sep='_').reset_index()
         actor2_tidy.head()
out[21]:

# 加载数据
 In[22]: df = pd.read_csv('data/stackme.csv')
         df
out[22]:

# 对列重命名
 In[23]: df2 = df.rename(columns = {'a1':'group1_a1', 'b2':'group1_b2',
                                    'd':'group2_a1', 'e':'group2_b2'})
         df2
out[23]:

# 设定stubnames=['group1', 'group2']，对任何数字都起作用
 In[24]: pd.wide_to_long(df2, 
                         stubnames=['group1', 'group2'], 
                         i=['State', 'Country', 'Test'], 
                         j='Label', 
                         suffix='.+', 
                         sep='_')
out[24]:

4. 反转stacked数据

# 读取college数据集，学校名作为行索引，，只选取本科生的列
 In[25]: usecol_func = lambda x: 'UGDS_' in x or x == 'INSTNM'
         college = pd.read_csv('data/college.csv', 
                                   index_col='INSTNM', 
                                   usecols=usecol_func)
         college.head()
out[25]:

# 用stack方法，将所有水平列名，转化为垂直的行索引
 In[26]: college_stacked = college.stack()
         college_stacked.head(18)
out[26]: INSTNM                                         
Alabama A & M University             UGDS_WHITE    0.0333
                                     UGDS_BLACK    0.9353
                                     UGDS_HISP     0.0055
                                     UGDS_ASIAN    0.0019
                                     UGDS_AIAN     0.0024
                                     UGDS_NHPI     0.0019
                                     UGDS_2MOR     0.0000
                                     UGDS_NRA      0.0059
                                     UGDS_UNKN     0.0138
University of Alabama at Birmingham  UGDS_WHITE    0.5922
                                     UGDS_BLACK    0.2600
                                     UGDS_HISP     0.0283
                                     UGDS_ASIAN    0.0518
                                     UGDS_AIAN     0.0022
                                     UGDS_NHPI     0.0007
                                     UGDS_2MOR     0.0368
                                     UGDS_NRA      0.0179
                                     UGDS_UNKN     0.0100
dtype: float64

# unstack方法可以将其还原
 In[27]: college_stacked.unstack().head()
out[27]:

# 另一种方式是先用melt，再用pivot。先加载数据，不指定行索引名
 In[28]: college2 = pd.read_csv('data/college.csv', 
                               usecols=usecol_func)
         college2.head()
out[28]:

# 使用melt，将所有race列变为一列
 In[29]: college_melted = college2.melt(id_vars='INSTNM', 
                                        var_name='Race',
                                        value_name='Percentage')
         college_melted.head()
out[29]:

# 用pivot还原
 In[30]: melted_inv = college_melted.pivot(index='INSTNM',
                                           columns='Race',
                                           values='Percentage')
         melted_inv.head()
out[30]:

# 用loc同时选取行和列，然后重置索引，可以获得和原先索引顺序一样的DataFrame
 In[31]: college2_replication = melted_inv.loc[college2['INSTNM'], 
                                               college2.columns[1:]]\
                                                  .reset_index()
         college2.equals(college2_replication)
out[31]: True

# 使用最外层的行索引做unstack
 In[32]: college.stack().unstack(0)
out[32]:

# 转置DataFrame更简单的方法是transpose()或T
 In[33]: college.T
out[33]:

5. 分组聚合后unstacking

# 读取employee数据集，求出每个种族的平均工资
 In[34]: employee = pd.read_csv('data/employee.csv')
 In[35]: employee.groupby('RACE')['BASE_SALARY'].mean().astype(int)
out[35]: RACE
         American Indian or Alaskan Native    60272
         Asian/Pacific Islander               61660
         Black or African American            50137
         Hispanic/Latino                      52345
         Others                               51278
         White                                64419
         Name: BASE_SALARY, dtype: int64

# 对种族和性别分组，求平均工资
 In[36]: agg = employee.groupby(['RACE', 'GENDER'])['BASE_SALARY'].mean().astype(int)
         agg
out[36]: RACE                               GENDER
         American Indian or Alaskan Native  Female    60238
                                            Male      60305
         Asian/Pacific Islander             Female    63226
                                            Male      61033
         Black or African American          Female    48915
                                            Male      51082
         Hispanic/Latino                    Female    46503
                                            Male      54782
         Others                             Female    63785
                                            Male      38771
         White                              Female    66793
                                            Male      63940
         Name: BASE_SALARY, dtype: int64

# 对索引层GENDER做unstack
 In[37]: agg.unstack('GENDER')
out[37]:

# 对索引层RACE做unstack
 In[38]: agg.unstack('RACE')
out[38]:

# 按RACE和GENDER分组，求工资的平均值、最大值和最小值
 In[39]: agg2 = employee.groupby(['RACE', 'GENDER'])['BASE_SALARY'].agg(['mean', 'max', 'min']).astype(int)
         agg2
out[39]:

# 此时unstack('GENDER')会生成多级列索引，可以用stack和unstack调整结构
agg2.unstack('GENDER')

6. 用分组聚合实现透视表

# 读取flights数据集
 In[40]: flights = pd.read_csv('data/flights.csv')
         flights.head()
out[40]:

# 用pivot_table方法求出每条航线每个始发地的被取消的航班总数
 In[41]: fp = flights.pivot_table(index='AIRLINE', 
                                  columns='ORG_AIR', 
                                  values='CANCELLED', 
                                  aggfunc='sum',
                                  fill_value=0).round(2)
         fp.head()
out[41]:

# groupby聚合不能直接复现这张表。需要先按所有index和columns的列聚合
 In[42]: fg = flights.groupby(['AIRLINE', 'ORG_AIR'])['CANCELLED'].sum()
         fg.head()
out[42]: AIRLINE  ORG_AIR
         AA       ATL         3
                  DEN         4
                  DFW        86
                  IAH         3
                  LAS         3
         Name: CANCELLED, dtype: int64

# 再使用unstack，将ORG_AIR这层索引作为列名
 In[43]: fg_unstack = fg.unstack('ORG_AIR', fill_value=0)
         fg_unstack.head()
out[43]:

# 判断两个方式是否等价
 In[44]: fg_unstack = fg.unstack('ORG_AIR', fill_value=0)
         fp.equals(fg_unstack)
out[44]: True

# 先实现一个稍微复杂的透视表
 In[45]: fp2 = flights.pivot_table(index=['AIRLINE', 'MONTH'],
                                   columns=['ORG_AIR', 'CANCELLED'],
                                   values=['DEP_DELAY', 'DIST'],
                                   aggfunc=[np.mean, np.sum],
                                   fill_value=0)
         fp2.head()
out[45]:

# 用groupby和unstack复现上面的方法
 In[46]: flights.groupby(['AIRLINE', 'MONTH', 'ORG_AIR', 'CANCELLED'])['DEP_DELAY', 'DIST'] \
                .agg(['mean', 'sum']) \
                .unstack(['ORG_AIR', 'CANCELLED'], fill_value=0) \
                .swaplevel(0, 1, axis='columns') \
                .head()
out[46]:

7. 为了更容易reshaping，重新命名索引层

# 读取college数据集，分组后，统计本科生的SAT数学成绩信息
 In[47]: college = pd.read_csv('data/college.csv')
 In[48]: cg = college.groupby(['STABBR', 'RELAFFIL'])['UGDS', 'SATMTMID'] \
                     .agg(['count', 'min', 'max']).head(6)
 In[49]: cg
out[49]:

# 行索引的两级都有名字，而列索引没有名字。用rename_axis给列索引的两级命名
 In[50]:cg = cg.rename_axis(['AGG_COLS', 'AGG_FUNCS'], axis='columns')
        cg
out[50]:

# 将AGG_FUNCS列移到行索引
 In[51]:cg.stack('AGG_FUNCS').head()
out[51]:

# stack默认是将列放到行索引的最内层，可以使用swaplevel改变层级
 In[52]:cg.stack('AGG_FUNCS').swaplevel('AGG_FUNCS', 'STABBR', axis='index').head()
out[52]:

# 在此前的基础上再做sort_index
 In[53]:cg.stack('AGG_FUNCS') \
          .swaplevel('AGG_FUNCS', 'STABBR', axis='index') \
          .sort_index(level='RELAFFIL', axis='index') \
          .sort_index(level='AGG_COLS', axis='columns').head(6)
out[53]:

# 对一些列做stack，对其它列做unstack
 In[54]:cg.stack('AGG_FUNCS').unstack(['RELAFFIL', 'STABBR'])
out[54]:

# 对所有列做stack，会返回一个Series
 In[55]:cg.stack(['AGG_FUNCS', 'AGG_COLS']).head(12)
out[55]:

# 删除行和列索引所有层级的名称
 In[56]:cg.rename_axis([None, None], axis='index').rename_axis([None, None], axis='columns')
out[56]:

8. 当多个变量被存储为列名时进行清理

# 读取weightlifting数据集
 In[57]:weightlifting = pd.read_csv('data/weightlifting_men.csv')
        weightlifting
out[57]:

# 用melt方法，将sex_age放入一个单独的列
 In[58]:wl_melt = weightlifting.melt(id_vars='Weight Category', 
                                     var_name='sex_age', 
                                     value_name='Qual Total')
        wl_melt.head()
out[58]:

# 用split方法将sex_age列分为两列
 In[59]:sex_age = wl_melt['sex_age'].str.split(expand=True)
        sex_age.head()
out[59]:      0         1
      0     M35     35-39
      1     M35     35-39
      2     M35     35-39
      3     M35     35-39
      4     M35     35-39

# 给列起名
 In[60]:sex_age.columns = ['Sex', 'Age Group']
        sex_age.head()
out[60]:

# 只取出字符串中的M
 In[61]:sex_age['Sex'] = sex_age['Sex'].str[0]
        sex_age.head()
out[61]:

# 用concat方法，将sex_age,与wl_cat_total连接起来
 In[62]:wl_cat_total = wl_melt[['Weight Category', 'Qual Total']]
        wl_tidy = pd.concat([sex_age, wl_cat_total], axis='columns')
        wl_tidy.head()
out[62]:

# 上面的结果也可以如下实现
 In[63]:cols = ['Weight Category', 'Qual Total']
        sex_age[cols] = wl_melt[cols]

# 也可以通过assign的方法，动态加载新的列
 In[64]: age_group = wl_melt.sex_age.str.extract('(\d{2}[-+](?:\d{2})?)', expand=False)
         sex = wl_melt.sex_age.str[0]
         new_cols = {'Sex':sex, 
                     'Age Group': age_group}
 In[65]: wl_tidy2 = wl_melt.assign(**new_cols).drop('sex_age', axis='columns')
         wl_tidy2.head()
out[65]:

# 判断两种方法是否等效
 In[66]: wl_tidy2.sort_index(axis=1).equals(wl_tidy.sort_index(axis=1))
out[66]: True

9. 当多个变量被存储为列的值时进行清理

# 读取restaurant_inspections数据集，将Date列的数据类型变为datetime64
 In[67]: inspections = pd.read_csv('data/restaurant_inspections.csv', parse_dates=['Date'])
         inspections.head(10)
out[67]:

# 用info列的所有值造一个新列。但是，Pandas不支持这种功能
 In[68]: inspections.pivot(index=['Name', 'Date'], columns='Info', values='Value')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in __init__(self, values, categories, ordered, fastpath)
    297             try:
--> 298                 codes, categories = factorize(values, sort=True)
    299             except TypeError:

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/algorithms.py in factorize(values, sort, order, na_sentinel, size_hint)
    559     check_nulls = not is_integer_dtype(original)
--> 560     labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls)
    561 

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_labels (pandas/_libs/hashtable.c:21922)()

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

During handling of the above exception, another exception occurred:

NotImplementedError                       Traceback (most recent call last)
<ipython-input-68-754f69d68d6c> in <module>()
----> 1 inspections.pivot(index=['Name', 'Date'], columns='Info', values='Value')

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in pivot(self, index, columns, values)
   3851         """
   3852         from pandas.core.reshape.reshape import pivot
-> 3853         return pivot(self, index=index, columns=columns, values=values)
   3854 
   3855     def stack(self, level=-1, dropna=True):

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/reshape/reshape.py in pivot(self, index, columns, values)
    375             index = self[index]
    376         indexed = Series(self[values].values,
--> 377                          index=MultiIndex.from_arrays([index, self[columns]]))
    378         return indexed.unstack(columns)
    379 

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/multi.py in from_arrays(cls, arrays, sortorder, names)
   1098         from pandas.core.categorical import _factorize_from_iterables
   1099 
-> 1100         labels, levels = _factorize_from_iterables(arrays)
   1101         if names is None:
   1102             names = [getattr(arr, "name", None) for arr in arrays]

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in _factorize_from_iterables(iterables)
   2191         # For consistency, it should return a list of 2 lists.
   2192         return [[], []]
-> 2193     return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in <listcomp>(.0)
   2191         # For consistency, it should return a list of 2 lists.
   2192         return [[], []]
-> 2193     return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in _factorize_from_iterable(values)
   2163         codes = values.codes
   2164     else:
-> 2165         cat = Categorical(values, ordered=True)
   2166         categories = cat.categories
   2167         codes = cat.codes

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in __init__(self, values, categories, ordered, fastpath)
    308 
    309                 # FIXME
--> 310                 raise NotImplementedError("> 1 ndim Categorical are not "
    311                                           "supported at this time")
    312 

NotImplementedError: > 1 ndim Categorical are not supported at this time

# 将'Name','Date', 'Info'作为所索引
 In[69]: inspections.set_index(['Name','Date', 'Info']).head(10)
out[69]:

# 用pivot，将info列中的值变为新的列
 In[70]: inspections.set_index(['Name','Date', 'Info']).unstack('Info').head()
out[70]:

# 用reset_index方法，使行索引层级与列索引相同
 In[71]: insp_tidy = inspections.set_index(['Name','Date', 'Info']) \
                                        .unstack('Info') \
                                        .reset_index(col_level=-1)
         insp_tidy.head()
out[71]:

# 除掉列索引的最外层，重命名行索引的层为None
 In[72]: insp_tidy.columns = insp_tidy.columns.droplevel(0).rename(None)
         insp_tidy.head()
out[72]:

# 使用squeeze方法，可以避免前面的多级索引
 In[73]: inspections.set_index(['Name','Date', 'Info']) \
                    .squeeze() \
                    .unstack('Info') \
                    .reset_index() \
                    .rename_axis(None, axis='columns')
out[73]:

# pivot_table需要传入聚合函数，才能产生一个单一值
 In[74]: inspections.pivot_table(index=['Name', 'Date'], 
                                 columns='Info', 
                                 values='Value', 
                                 aggfunc='first') \
                    .reset_index()\
                    .rename_axis(None, axis='columns')
out[74]:

10. 当两个或多个值存储于一个单元格时进行清理

# 读取texas_cities数据集
 In[75]: cities = pd.read_csv('data/texas_cities.csv')
         cities
out[75]:

# 将Geolocation分解为四个单独的列
 In[76]: geolocations = cities.Geolocation.str.split(pat='. ', expand=True)
         geolocations.columns = ['latitude', 'latitude direction', 'longitude', 'longitude direction']
         geolocations
out[76]:

# 转变数据类型
 In[77]: geolocations = geolocations.astype({'latitude':'float', 'longitude':'float'})
         geolocations.dtypes
out[77]: latitude               float64
         latitude direction      object
         longitude              float64
         longitude direction     object
         dtype: object

# 将新列与原先的city列连起来
 In[78]: cities_tidy = pd.concat([cities['City'], geolocations], axis='columns')
         cities_tidy
out[78]:

# 忽略，作者这里是写重复了
 In[79]: pd.concat([cities['City'], geolocations], axis='columns')
out[79]:

原理

# 函数to_numeric可以将每列自动变为整数或浮点数
 In[80]: temp = geolocations.apply(pd.to_numeric, errors='ignore')
         temp
out[80]:

# 再查看数据类型
 In[81]: temp.dtypes
out[81]: latitude               float64
         latitude direction      object
         longitude              float64
         longitude direction     object
         dtype: object

# |符，可以对多个标记进行分割
 In[82]: cities.Geolocation.str.split(pat='° |, ', expand=True)
out[82]:

# 更复杂的提取方式
 In[83]: cities.Geolocation.str.extract('([0-9.]+). (N|S), ([0-9.]+). (E|W)', expand=True)
out[83]:

11. 当多个变量被存储为列名和列值时进行清理

# 读取sensors数据集
 In[84]: sensors = pd.read_csv('data/sensors.csv')
         sensors
out[84]:

# 用melt清理数据
 In[85]: sensors.melt(id_vars=['Group', 'Property'], var_name='Year').head(6)
out[85]:

# 用pivot_table，将Property列转化为新的列名
 In[86]: sensors.melt(id_vars=['Group', 'Property'], var_name='Year') \
                .pivot_table(index=['Group', 'Year'], columns='Property', values='value') \
                .reset_index() \
                .rename_axis(None, axis='columns')
out[86]:

# 用stack和unstack实现上述方法
 In[87]: sensors.set_index(['Group', 'Property']) \
                .stack() \
                .unstack('Property') \
                .rename_axis(['Group', 'Year'], axis='index') \
                .rename_axis(None, axis='columns') \
                .reset_index()
out[87]:

12. 当多个观察单位被存储于同一张表时进行清理

# 读取movie_altered数据集
 In[88]: movie = pd.read_csv('data/movie_altered.csv')
         movie.head()
out[88]:

# 插入新的列，用来标识每一部电影
 In[89]: movie.insert(0, 'id', np.arange(len(movie)))
         movie.head()
out[89]:

# 用wide_to_long，将所有演员放到一列，将所有Facebook likes放到一列
 In[90]: stubnames = ['director', 'director_fb_likes', 'actor', 'actor_fb_likes']
         movie_long = pd.wide_to_long(movie, 
                                      stubnames=stubnames, 
                                      i='id', 
                                      j='num', 
                                      sep='_').reset_index()
         movie_long['num'] = movie_long['num'].astype(int)
         movie_long.head(9)
out[90]:

# 将这个数据分解成多个小表
 In[91]: movie_table = movie_long[['id','title', 'year', 'duration', 'rating']]
         director_table = movie_long[['id', 'director', 'num', 'director_fb_likes']]
         actor_table = movie_long[['id', 'actor', 'num', 'actor_fb_likes']]
 In[92]: movie_table.head(9)
out[90]:

 In[93]: director_table.head(9)
out[93]:

 In[94]: actor_table.head(9)
out[94]:

# 做一些去重和去除缺失值的工作
 In[95]: movie_table = movie_table.drop_duplicates().reset_index(drop=True)
         director_table = director_table.dropna().reset_index(drop=True)
         actor_table = actor_table.dropna().reset_index(drop=True)
 In[96]: movie_table.head()
out[96]:

 In[97]: director_table.head()
out[97]:

# 比较内存的使用量
 In[98]: movie.memory_usage(deep=True).sum()
out[98]: 2318234

 In[99]: movie_table.memory_usage(deep=True).sum() + \
         director_table.memory_usage(deep=True).sum() + \
         actor_table.memory_usage(deep=True).sum()
out[99]: 2624898

# 创建演员和导演的id列
 In[100]: director_cat = pd.Categorical(director_table['director'])
          director_table.insert(1, 'director_id', director_cat.codes)

          actor_cat = pd.Categorical(actor_table['actor'])
          actor_table.insert(1, 'actor_id', actor_cat.codes)

          director_table.head()
out[100]:

 In[101]: actor_table.head()
out[101]:

# 可以用这两张表生成要用的中间表。先来做director表
 In[102]: director_associative = director_table[['id', 'director_id', 'num']]
          dcols = ['director_id', 'director', 'director_fb_likes']
          director_unique = director_table[dcols].drop_duplicates().reset_index(drop=True)
          director_associative.head()         
out[102]:

 In[103]: director_unique.head()
out[103]:

# 再来做actor表
 In[104]: actor_associative = actor_table[['id', 'actor_id', 'num']]
          acols = ['actor_id', 'actor', 'actor_fb_likes']
          actor_unique = actor_table[acols].drop_duplicates().reset_index(drop=True)
          actor_associative.head()
out[104]:

 In[105]: actor_unique.head()
out[105]:

# 查看新的表所使用的内存量
 In[106]: movie_table.memory_usage(deep=True).sum() + \
          director_associative.memory_usage(deep=True).sum() + \
          director_unique.memory_usage(deep=True).sum() + \
          actor_associative.memory_usage(deep=True).sum() + \
          actor_unique.memory_usage(deep=True).sum()
out[106]: 1833402

 In[107]: movie_table.head()
out[107]:

# 可以通过将左右表组合起来形成movie表。首先将附表与actor/director表结合，然后将num列pivot，再加上列的前缀
 In[108]: actors = actor_associative.merge(actor_unique, on='actor_id') \
                                    .drop('actor_id', 1) \
                                    .pivot_table(index='id', columns='num', aggfunc='first')

          actors.columns = actors.columns.get_level_values(0) + '_' + \
                           actors.columns.get_level_values(1).astype(str)

          directors = director_associative.merge(director_unique, on='director_id') \
                                          .drop('director_id', 1) \
                                          .pivot_table(index='id', columns='num', aggfunc='first')

          directors.columns = directors.columns.get_level_values(0) + '_' + \
                              directors.columns.get_level_values(1).astype(str)
 In[109]: actors.head()
out[109]:

 In[110]: directors.head()
out[110]:

 In[111]: movie2 = movie_table.merge(directors.reset_index(), on='id', how='left') \
                              .merge(actors.reset_index(), on='id', how='left')
 In[112]: movie2.head()
out[112]:

 In[113]: movie.equals(movie2[movie.columns])
out[113]: True

最后编辑于：2018.10.26 18:21:32

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 158,736评论 4赞 362
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,167评论 1赞 291
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 108,442评论 0赞 243
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,902评论 0赞 204
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,302评论 3赞 287
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,573评论 1赞 216
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,847评论 2赞 312
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,562评论 0赞 197
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,260评论 1赞 241
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,531评论 2赞 245
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,021评论 1赞 258
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,367评论 2赞 253
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,016评论 3赞 235
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,068评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,827评论 0赞 194
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,610评论 2赞 274
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,514评论 2赞 269

《Pandas Cookbook》第08章数据清理

《Pandas Cookbook》第08章数据清理

1. 用stack清理变量值作为列名

更多

2. 用melt清理变量值作为列名

3. 同时stack多组变量

更多

4. 反转stacked数据

更多

5. 分组聚合后unstacking

更多

6. 用分组聚合实现透视表

更多

7. 为了更容易reshaping，重新命名索引层

更多

8. 当多个变量被存储为列名时进行清理

更多

9. 当多个变量被存储为列的值时进行清理

更多

10. 当两个或多个值存储于一个单元格时进行清理

原理

更多

11. 当多个变量被存储为列名和列值时进行清理

更多

12. 当多个观察单位被存储于同一张表时进行清理

推荐阅读更多精彩内容

《Pandas Cookbook》第08章 数据清理

1. 用stack清理变量值作为列名

更多

2. 用melt清理变量值作为列名

3. 同时stack多组变量

更多

4. 反转stacked数据

更多

5. 分组聚合后unstacking

更多

6. 用分组聚合实现透视表

更多

7. 为了更容易reshaping，重新命名索引层

更多

8. 当多个变量被存储为列名时进行清理

更多

9. 当多个变量被存储为列的值时进行清理

更多

10. 当两个或多个值存储于一个单元格时进行清理

原理

更多

11. 当多个变量被存储为列名和列值时进行清理

更多

12. 当多个观察单位被存储于同一张表时进行清理

推荐阅读更多精彩内容

《Pandas Cookbook》第08章数据清理