修订翻译《利用Python进行数据分析·第2版》第11章 时间序列

翻译原因

买了官方第2版徐敬一翻译的纸质书,翻译水平让我想骂人,尤其是“第11章 时间序列”简直让人摸不着头脑。另外,原版英文书中也有些许错误,不方便理解,特自行修订翻译第11章,如后面有余力再修订翻译其它章节。本文借鉴了官方第1版唐雪韬的版本和简书用户SeanCheney的版本,对部分内容进行了修订。

翻译原则

1 术语准确及词义宽窄适度

1.1 会尽量查看相关标准及专业网站、百科、词典等综合而定。重点参考网站是:

1.2 词义宽窄适度:中英文词汇的词义宽窄可能有所差异,因此不仅仅考虑英翻中,还要考虑能否中翻英回到原文的英文词汇,使中英文的词义宽窄尽量匹配。

2 词汇前后一致

2.1 某个英语词汇对应中文意思可能多个,为方便理解本文尽量只采用一个,且尽量做到前后一致。
2.2 作者可能用多个英文词汇表达同一个中文意思,为方便理解会尽量整合到一个中文词汇。例如option、parameter等统一翻译为“参数”。

3 阅读流畅

3.1 原文重句较多,为符合中文阅读习惯,会适当拆开;
3.2 原文有些口水话,会适当省略或意译。原文省略的部分词汇,会适当补足。例如作者会省略“...方法”的“方法”,只写“...”。
3.3 英文原书未进行逐级标题编号,为方便理解会进行逐级标题编号。

思维导图如下:

mindmaster版本、pdf版本和png原图下载地址:https://pan.baidu.com/s/1xBcIQB2Qi2kT0AHusyt8_w,提取码:s6kg

时间序列(time series)数据是结构化数据的一种重要形式,广泛应用于金融学、经济学、生态学、神经科学和物理学等多个领域。在多个时间点观察或测量到的任何数据都可以形成一个时间序列。很多时间序列是固定频率的(fixed frequency),也就是说,数据点按照某种规则定期出现,例如每15秒、每5分钟或每月一次。时间序列也可以是不规则的(irregular),没有固定的时间单位或单位之间的偏移量。如何标记和引用时间序列数据取决于应用场景,主要有以下几种:

  • 时间戳(timestamp),特定的时刻。
  • 固定时期(period),例如2007年1月或2010年全年。
  • 时间间隔(interval),由起始时间戳和结束时间戳表示。时期(period)可以看作是间隔(interval)的特例。
  • 实验时间(experiment time)经过时间(elapsed time),每个时间戳都是相对于特定起始时间的一个时间度量。例如,从放入烤箱时起,每秒钟饼干的直径。

Time series data is an important form of structured data in many different fields, such as finance, economics, ecology, neuroscience, and physics. Anything that is observed or measured at many points in time forms a time series. Many time series are fixed frequency, which is to say that data points occur at regular intervals according to some rule, such as every 15 seconds, every 5 minutes, or once per month. Time series can also be irregular without a fixed unit of time or offset between units. How you mark and refer to time series data depends on the application, and you may have one of the following:

  • Timestamps, specific instants in time
  • Fixed periods, such as the month January 2007 or the full year 2010
  • Intervals of time, indicated by a start and end timestamp. Periods can be thought of as special cases of intervals
  • Experiment or elapsed time; each timestamp is a measure of time relative to a particular start time (e.g., the diameter of a cookie baking each second since being placed in the oven)

虽然很多技术都可用于处理实验型的时间序列,其索引可能是一个整数或浮点数(表示从实验开始所经过的时间),但本章主要讲解前三种时间序列。最简单也最广泛使用的时间序列是被时间戳索引的时间序列
In this chapter, I am mainly concerned with time series in the first three categories, though many of the techniques can be applied to experimental time series where the index may be an integer or floating-point number indicating elapsed time from the start of the experiment. The simplest and most widely used kind of time series are those indexed by timestamp.

pandas也支持基于timedelta对象的索引,timedelta对象可能是表示实验时间经过时间的有用方式。在本书中我们不讲解timedelta索引,但你可以在pandas官方文档(http://pandas.pydata.org)中了解更多。
pandas also supports indexes based on timedeltas, which can be a useful way of representing experiment or elapsed time. We do not explore timedelta indexes in this book, but you can learn more in the pandas documentation.

pandas提供了很多内置的时间序列工具和数据算法。你可以高效地处理非常大的时间序列,并且对不规则的时间序列和固定频率的时间序列轻松地进行切片、切块、聚合和重采样。其中一些工具对于金融和经济应用场景特别有用,你当然也可以用它们来分析服务器日志数据。
pandas provides many built-in time series tools and data algorithms. You can efficiently work with very large time series and easily slice and dice, aggregate, and resample irregular- and fixed-frequency time series. Some of these tools are especially useful for financial and economics applications, but you could certainly use them to analyze server log data, too.

11.1 日期和时间数据类型及工具

11.1 Date and Time Data Types and Tools

Python标准库包含日期和时间数据(date and time data)的数据类型以及与日历相关的功能。我们主要会用到datetimetimecalendar模块。datetime.datetime(简写为datetime)类型是广泛使用的数据类型:
The Python standard library includes data types for date and time data, as well as calendar-related functionality. The datetime, time, and calendar modules are the main places to start. The datetime.datetime type, or simply datetime, is widely used:

import numpy as np
import pandas as pd
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
np.set_printoptions(precision=4, suppress=True)

In [10]: from datetime import datetime

In [11]: now_datetime = datetime.now() # gg注:为避免歧义,变量名从原文的now改为now_datetime

In [12]: now_datetime
Out[12]: datetime.datetime(2017, 9, 25, 14, 5, 52, 72973)

In [13]: now_datetime.year, now_datetime.month, now_datetime.day
Out[13]: (2017, 9, 25)

datetime对象存储日期以及精确到微秒的时间。timedelta对象表示两个datetime对象之间的时间差:
datetime stores both the date and time down to the microsecond. timedelta represents the temporal difference between two datetime objects:

In [14]: delta = datetime(2011, 1, 7) - datetime(2008, 6, 24, 8, 15)

In [15]: delta
Out[15]: datetime.timedelta(926, 56700)

In [16]: delta.days
Out[16]: 926

In [17]: delta.seconds
Out[17]: 56700

可以给datetime对象加上(或减去)一个timedelta对象或其倍数,这样会产生一个新对象:
You can add (or subtract) a timedelta or multiple thereof to a datetime object to yield a new shifted object:

In [18]: from datetime import timedelta

In [19]: start = datetime(2011, 1, 7)

In [20]: start + timedelta(12)
Out[20]: datetime.datetime(2011, 1, 19, 0, 0)

In [21]: start - 2 * timedelta(12)
Out[21]: datetime.datetime(2010, 12, 14, 0, 0)

表 11-1 总结了datetime模块中的数据类型。虽然本章主要讲解pandas中的数据类型和更高级别的时间序列操作,但你可能会在Python的很多其它地方遇到基于datetime的类型。
Table 11-1 summarizes the data types in the datetime module. While this chapter is mainly concerned with the data types in pandas and higher-level time series manipulation, you may encounter the datetime-based types in many other places in Python in the wild.
表11-1:datetime模块中的数据类型
Table 11-1. Types in datetime module


11.1.1 字符串和datetime之间的转换

Converting Between String and Datetime

使用str函数或strftime方法(传入一个格式规范),可以将datetime对象和pandas的Timestamp对象(稍后就会介绍)格式化为字符串:
You can format datetime objects and pandas Timestamp objects, which I’ll introduce later, as strings using str or the strftime method, passing a format specification:

In [22]: stamp = datetime(2011, 1, 3)

In [23]: str(stamp)
Out[23]: '2011-01-03 00:00:00'

In [24]: stamp.strftime('%Y-%m-%d')
Out[24]: '2011-01-03'

格式代码的完整清单见表11-2(转载自第2章)
See Table 11-2 for a complete list of the format codes (reproduced from Chapter 2).

表11-2:datetime格式规范(兼容ISO C89)
Table 11-2. Datetime format specification (ISO C89 compatible)


使用datetime.strptime函数和这些格式代码可以将字符串转换为日期:
You can use these same format codes to convert strings to dates using datetime.strptime:

In [25]: value = '2011-01-03'

In [26]: datetime.strptime(value, '%Y-%m-%d')
Out[26]: datetime.datetime(2011, 1, 3, 0, 0)

In [27]: datestrs = ['7/6/2011', '8/6/2011']

In [28]: [datetime.strptime(x, '%m/%d/%Y') for x in datestrs]
Out[28]: 
[datetime.datetime(2011, 7, 6, 0, 0),
 datetime.datetime(2011, 8, 6, 0, 0)]

datetime.strptime函数是解析已知格式日期的好方式。但是,每次都要编写格式规范可能有点烦人,尤其是对于常见的日期格式。在这种情况下,可以使用第三方dateutil包中的parser.parse方法(安装pandas时已自动安装好了):
datetime.strptime is a good way to parse a date with a known format. However, it can be a bit annoying to have to write a format spec each time, especially for common date formats. In this case, you can use the parser.parse method in the third-party dateutil package (this is installed automatically when you install pandas):

In [29]: from dateutil.parser import parse

In [30]: parse('2011-01-03')
Out[30]: datetime.datetime(2011, 1, 3, 0, 0)

dateutil包能够解析大部分人类可理解的日期表示形式:
dateutil is capable of parsing most human-intelligible date representations:

In [31]: parse('Jan 31, 1997 10:45 PM')
Out[31]: datetime.datetime(1997, 1, 31, 22, 45)

在国际语言环境中,日出现在月的前面很常见,可以传入dayfirst=True来表示这一点:
In international locales, day appearing before month is very common, so you can pass dayfirst=True to indicate this:

In [32]: parse('6/12/2011', dayfirst=True)
Out[32]: datetime.datetime(2011, 12, 6, 0, 0)

panda通常面向处理日期数组(array of dates),无论这些日期是作为DataFrame的轴索引还是列。pandas.to_datetime函数可以解析多种不同的日期表示形式。像ISO 8601这样的标准日期格式可以非常快速地解析:
pandas is generally oriented toward working with arrays of dates, whether used as an axis index or a column in a DataFrame. The to_datetime method parses many different kinds of date representations. Standard date formats like ISO 8601 can be parsed very quickly:

In [33]: datestrs = ['2011-07-06 12:00:00', '2011-08-06 00:00:00']

In [34]: pd.to_datetime(datestrs)
Out[34]: DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00'], dtype='dat
etime64[ns]', freq=None)

pandas.to_datetime函数还可以处理应被视为缺失的值(None、空字符串等):
It also handles values that should be considered missing (None, empty string, etc.):

In [35]: idx = pd.to_datetime(datestrs + [None])

In [36]: idx
Out[36]: DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00', 'NaT'], dty
pe='datetime64[ns]', freq=None)

In [37]: idx[2]
Out[37]: NaT

In [38]: pd.isnull(idx)
Out[38]: array([False, False,  True], dtype=bool)

NaT(Not a Time)是pandas中时间戳数据的null值。
NaT (Not a Time) is pandas’s null value for timestamp data.

dateutil.parser是一个有用但不完美的工具。值得注意的是,它会将一些原本不是日期的字符串识别为日期,例如,“42”会被解析为2042年的今天的日历日期。
dateutil.parser is a useful but imperfect tool. Notably, it will recognize some strings as dates that you might prefer that it didn’t—for example, '42' will be parsed as the year 2042 with today’s calendar date.

对于其它国家或语言的系统,datetime对象还有许多特定于语言环境的(locale-specific)格式化选项。例如,德语或法语系统月份的简称与英语系统相比将有所不同。清单见表11-3。
datetime objects also have a number of locale-specific formatting options for systems in other countries or languages. For example, the abbreviated month names will be different on German or French systems compared with English systems. See Table 11-3 for a listing.

表11-3:特定于语言环境的格式化选项
Table 11-3. Locale-specific date formatting


11.2 时间序列基础

11.2 Time Series Basics

pandas中时间序列对象的一个基本种类是被时间戳索引的Series对象,这些时间戳通常在panda外部表示为Python字符串或datetime对象
A basic kind of time series object in pandas is a Series indexed by timestamps, which is often represented external to pandas as Python strings or datetime objects:

In [39]: from datetime import datetime

In [40]: dates = [datetime(2011, 1, 2), datetime(2011, 1, 5),
   ....:          datetime(2011, 1, 7), datetime(2011, 1, 8),
   ....:          datetime(2011, 1, 10), datetime(2011, 1, 12)]

In [41]: ts = pd.Series(np.random.randn(6), index=dates)

In [42]: ts
Out[42]: 
2011-01-02   -0.204708
2011-01-05    0.478943
2011-01-07   -0.519439
2011-01-08   -0.555730
2011-01-10    1.965781
2011-01-12    1.393406
dtype: float64

在底层,这些datetime对象被放在一个DatetimeIndex中:
Under the hood, these datetime objects have been put in a DatetimeIndex:

In [43]: ts.index
Out[43]: 
DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
               '2011-01-10', '2011-01-12'],
              dtype='datetime64[ns]', freq=None)

与其它Series对象一样,不同索引的时间序列之间的算术运算会在日期上自动对齐:
Like other Series, arithmetic operations between differently indexed time series automatically align on the dates:

In [44]: ts + ts[::2]
Out[44]: 
2011-01-02   -0.409415
2011-01-05         NaN
2011-01-07   -1.038877
2011-01-08         NaN
2011-01-10    3.931561
2011-01-12         NaN
dtype: float64

ts[::2]将ts中的元素每两个选取出一个。
Recall that ts[::2] selects every second element in ts.

pandas使用NumPy的datetime64数据类型以纳秒的分辨率存储时间戳:
pandas stores timestamps using NumPy’s datetime64 data type at the nanosecond resolution:

In [45]: ts.index.dtype
Out[45]: dtype('<M8[ns]')

DatetimeIndex中的各个标量值是pandas的Timestamp对象:
Scalar values from a DatetimeIndex are pandas Timestamp objects:

In [46]: stamp = ts.index[0]

In [47]: stamp
Out[47]: Timestamp('2011-01-02 00:00:00')

只要有需要,Timestamp对象可以随时自动转换为datetime对象。此外,Timestamp对象还可以存储频率信息(如果有的话),且懂得如何执行时区转换以及其它种类的操作。稍后将对此进行详细讲解。
A Timestamp can be substituted anywhere you would use a datetime object. Additionally, it can store frequency information (if any) and understands how to do time zone conversions and other kinds of manipulations. More on both of these things later.

11.2.1 索引、选取、子集构造

Indexing, Selection, Subsetting

当你基于标签索引和选取数据时,时间序列的行为和任何其它的pandas.Series很像:
Time series behaves like any other pandas.Series when you are indexing and selecting data based on label:

In [48]: stamp = ts.index[2]

In [49]: ts[stamp]
Out[49]: -0.51943871505673811

为了方便起见,你还可以传入一个可解释为日期的字符串:
As a convenience, you can also pass a string that is interpretable as a date:

In [50]: ts['1/10/2011']
Out[50]: 1.9657805725027142

In [51]: ts['20110110']
Out[51]: 1.9657805725027142

对于较长的时间序列,可以传入“年”或“年月”来轻松地选取数据的切片(slices of data)
For longer time series, a year or only a year and month can be passed to easily select slices of data:

In [52]: longer_ts = pd.Series(np.random.randn(1000),
   ....:                       index=pd.date_range('1/1/2000', periods=1000))

In [53]: longer_ts
Out[53]: 
2000-01-01    0.092908
2000-01-02    0.281746
2000-01-03    0.769023
2000-01-04    1.246435
2000-01-05    1.007189
2000-01-06   -1.296221
2000-01-07    0.274992
2000-01-08    0.228913
2000-01-09    1.352917
2000-01-10    0.886429
                ...   
2002-09-17   -0.139298
2002-09-18   -1.159926
2002-09-19    0.618965
2002-09-20    1.373890
2002-09-21   -0.983505
2002-09-22    0.930944
2002-09-23   -0.811676
2002-09-24   -1.830156
2002-09-25   -0.138730
2002-09-26    0.334088
Freq: D, Length: 1000, dtype: float64

In [54]: longer_ts['2001']
Out[54]: 
2001-01-01    1.599534
2001-01-02    0.474071
2001-01-03    0.151326
2001-01-04   -0.542173
2001-01-05   -0.475496
2001-01-06    0.106403
2001-01-07   -1.308228
2001-01-08    2.173185
2001-01-09    0.564561
2001-01-10   -0.190481
                ...   
2001-12-22    0.000369
2001-12-23    0.900885
2001-12-24   -0.454869
2001-12-25   -0.864547
2001-12-26    1.129120
2001-12-27    0.057874
2001-12-28   -0.433739
2001-12-29    0.092698
2001-12-30   -1.397820
2001-12-31    1.457823
Freq: D, Length: 365, dtype: float64

在这里,字符串“2001”被解释为一个年份,并选取该时间段。如果指定月份,这也是有效的:
Here, the string '2001' is interpreted as a year and selects that time period. This also works if you specify the month:

In [55]: longer_ts['2001-05']
Out[55]: 
2001-05-01   -0.622547
2001-05-02    0.936289
2001-05-03    0.750018
2001-05-04   -0.056715
2001-05-05    2.300675
2001-05-06    0.569497
2001-05-07    1.489410
2001-05-08    1.264250
2001-05-09   -0.761837
2001-05-10   -0.331617
                ...   
2001-05-22    0.503699
2001-05-23   -1.387874
2001-05-24    0.204851
2001-05-25    0.603705
2001-05-26    0.545680
2001-05-27    0.235477
2001-05-28    0.111835
2001-05-29   -1.251504
2001-05-30   -2.949343
2001-05-31    0.634634
Freq: D, Length: 31, dtype: float64

使用datetime对象进行切片同样有效:
Slicing with datetime objects works as well:

In [56]: ts[datetime(2011, 1, 7):] # gg注:ts['2011-01-07':]也可
Out[56]: 
2011-01-07   -0.519439
2011-01-08   -0.555730
2011-01-10    1.965781
2011-01-12    1.393406
dtype: float64

因为大部分时间序列数据都是按时间顺序排列的,所以可以使用不存在于时间序列中的时间戳进行切片,以执行范围查询:
Because most time series data is ordered chronologically, you can slice with timestamps not contained in a time series to perform a range query:

In [57]: ts
Out[57]: 
2011-01-02   -0.204708
2011-01-05    0.478943
2011-01-07   -0.519439
2011-01-08   -0.555730
2011-01-10    1.965781
2011-01-12    1.393406
dtype: float64

In [58]: ts['1/6/2011':'1/11/2011']
Out[58]: 
2011-01-07   -0.519439
2011-01-08   -0.555730
2011-01-10    1.965781
dtype: float64

和以前一样,你可以传入字符串日期、datetime对象或时间戳。请记住,以这种方式进行切片会在源时间序列上产生视图,就像对NumPy数组进行切片一样。这意味着没有数据被复制,切片上的修改会反映在原始数据中。
As before, you can pass either a string date, datetime, or timestamp. Remember that slicing in this manner produces views on the source time series like slicing NumPy arrays. This means that no data is copied and modifications on the slice will be reflected in the original data.

有一个等价的实例方法,truncate方法,它在两个日期之间对Series进行切片:
There is an equivalent instance method, truncate, that slices a Series between two dates:

In [59]: ts.truncate(after='1/9/2011')
Out[59]: 
2011-01-02   -0.204708
2011-01-05    0.478943
2011-01-07   -0.519439
2011-01-08   -0.555730
dtype: float64

所有这些也适用于DataFrame,例如,对DataFrame的行进行索引:
All of this holds true for DataFrame as well, indexing on its rows:

In [60]: dates = pd.date_range('1/1/2000', periods=100, freq='W-WED')

In [61]: long_df = pd.DataFrame(np.random.randn(100, 4),
   ....:                        index=dates,
   ....:                        columns=['Colorado', 'Texas',
   ....:                                 'New York', 'Ohio'])

In [62]: long_df.loc['5-2001']
Out[62]: 
            Colorado     Texas  New York      Ohio
2001-05-02 -0.006045  0.490094 -0.277186 -0.707213
2001-05-09 -0.560107  2.735527  0.927335  1.513906
2001-05-16  0.538600  1.273768  0.667876 -0.969206
2001-05-23  1.676091 -0.817649  0.050188  1.951312
2001-05-30  3.260383  0.963301  1.201206 -1.852001

11.2.2 带有重复索引的时间序列

Time Series with Duplicate Indices

在某些应用场景中,可能会有多个数据观察结果落在同一个特定的时间戳上。 下面是一个例子:
In some applications, there may be multiple data observations falling on a particular timestamp. Here is an example:

In [63]: dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000',
   ....:                           '1/2/2000', '1/3/2000'])
In [64]: dup_ts = pd.Series(np.arange(5), index=dates)

In [65]: dup_ts
Out[65]: 
2000-01-01    0
2000-01-02    1
2000-01-02    2
2000-01-02    3
2000-01-03    4
dtype: int64

通过检查索引的is_unique属性,我们可以看出索引不是唯一的:
We can tell that the index is not unique by checking its is_unique property:

In [66]: dup_ts.index.is_unique
Out[66]: False

对该时间序列进行索引,要么产生标量值,要么产生切片,具体取决于时间戳是否重复:
Indexing into this time series will now either produce scalar values or slices depending on whether a timestamp is duplicated:

In [67]: dup_ts['1/3/2000']  # 不重复not duplicated
Out[67]: 4

In [68]: dup_ts['1/2/2000']  # 重复duplicated
Out[68]: 
2000-01-02    1
2000-01-02    2
2000-01-02    3
dtype: int64

假设您想聚合具有非唯一时间戳的数据。 一种方式是使用groupby方法并传入level=0:
Suppose you wanted to aggregate the data having non-unique timestamps. One way to do this is to use groupby and pass level=0:

In [69]: grouped = dup_ts.groupby(level=0)

In [70]: grouped.mean()
Out[70]: 
2000-01-01    0
2000-01-02    2
2000-01-03    4
dtype: int64

In [71]: grouped.count()
Out[71]: 
2000-01-01    1
2000-01-02    3
2000-01-03    1
dtype: int64

11.3 日期范围、频率和移动

11.3 Date Ranges, Frequencies, and Shifting

pandas中一般的时间序列被假定为不规则的,也就是说,它们没有固定的频率。对于很多应用场景而言,这已经足够了。但是,经常有需要处理固定频率(例如每日、每月、每15分钟)的应用场景,即使这意味着在时间序列中引入缺失值。幸运的是,pandas有一整套标准时间序列频率和工具,用于重采样、推断频率和生成固定频率的日期范围。例如,你可以通过调用resample方法将样本时间序列转换为固定频率(每日)的时间序列:
Generic time series in pandas are assumed to be irregular; that is, they have no fixed frequency. For many applications this is sufficient. However, it’s often desirable to work relative to a fixed frequency, such as daily, monthly, or every 15 minutes, even if that means introducing missing values into a time series. Fortunately pandas has a full suite of standard time series frequencies and tools for resampling, inferring frequencies, and generating fixed-frequency date ranges. For example, you can convert the sample time series to be fixed daily frequency by calling resample:

In [72]: ts
Out[72]: 
2011-01-02   -0.204708
2011-01-05    0.478943
2011-01-07   -0.519439
2011-01-08   -0.555730
2011-01-10    1.965781
2011-01-12    1.393406
dtype: float64

In [73]: resampler = ts.resample('D')

字符串“D”被解释为“每日”的频率。
The string 'D' is interpreted as daily frequency.

频率之间的转换(或重采样)是一个足够大的主题,稍后有一节来讲解(11.6节)。这里,我将向你展示如何使用基本频率(base frequency)及其倍数。
Conversion between frequencies or resampling is a big enough topic to have its own section later (Section 11.6, “Resampling and Frequency Conversion,” on page 348). Here I’ll show you how to use the base frequencies and multiples thereof.

11.3.1 生成日期范围

Generating Date Ranges

虽然我之前用它没有解释,pandas.date_range函数负责根据特定频率生成指定长度的DatetimeIndex:
While I used it previously without explanation, pandas.date_range is responsible for generating a DatetimeIndex with an indicated length according to a particular frequency:

In [74]: idx = pd.date_range('2012-04-01', '2012-06-01') # gg注:为避免歧义,变量名从原文的index改为idx

In [75]: idx
Out[75]: 
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20',
               '2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24',
               '2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28',
               '2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02',
               '2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06',
               '2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10',
               '2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14',
               '2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18',
               '2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22',
               '2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26',
               '2012-05-27', '2012-05-28', '2012-05-29', '2012-05-30',
               '2012-05-31', '2012-06-01'],
              dtype='datetime64[ns]', freq='D')

默认情况下,pandas.date_range函数生成“每日”的时间戳。如果只传入起始日期或结束日期,则必须传入要生成的时期数(number of periods)
By default, date_range generates daily timestamps. If you pass only a start or end date, you must pass a number of periods to generate:

In [76]: pd.date_range(start='2012-04-01', periods=20)
Out[76]: 
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20'],
              dtype='datetime64[ns]', freq='D')

In [77]: pd.date_range(end='2012-06-01', periods=20)
Out[77]: 
DatetimeIndex(['2012-05-13', '2012-05-14', '2012-05-15', '2012-05-16',
               '2012-05-17', '2012-05-18', '2012-05-19', '2012-05-20',
               '2012-05-21', '2012-05-22', '2012-05-23', '2012-05-24',
               '2012-05-25', '2012-05-26', '2012-05-27','2012-05-28',
               '2012-05-29', '2012-05-30', '2012-05-31', '2012-06-01'],
              dtype='datetime64[ns]', freq='D')

起始日期和结束日期为生成的日期索引定义了严格的边界。例如,如果你想要一个包含每月最后一个工作日的日期索引,只需要传入“BM”频率(每月最后一个工作日;更完整的频率清单见表11-4),这样只会包括日期间隔上或日期间隔内的日期:
The start and end dates define strict boundaries for the generated date index. For example, if you wanted a date index containing the last business day of each month, you would pass the 'BM' frequency (business end of month; see more complete listing of frequencies in Table 11-4) and only dates falling on or inside the date interval will be included:

In [78]: pd.date_range('2000-01-01', '2000-12-01', freq='BM')
Out[78]: 
DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-28',
               '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
               '2000-09-29', '2000-10-31', '2000-11-30'],
              dtype='datetime64[ns]', freq='BM')

表11-4:时间序列的基本频率(不全面)
Table 11-4. Base time series frequencies (not comprehensive)



pandas.date_range函数默认保留起始时间戳或结束时间戳的时间(如果有的话):
date_range by default preserves the time (if any) of the start or end timestamp:

In [79]: pd.date_range('2012-05-02 12:56:31', periods=5)
Out[79]: 
DatetimeIndex(['2012-05-02 12:56:31', '2012-05-03 12:56:31',
               '2012-05-04 12:56:31', '2012-05-05 12:56:31',
               '2012-05-06 12:56:31'],
              dtype='datetime64[ns]', freq='D')

有时,虽然起始日期或结束日期带有时间信息,但希望生成一组标准化到午夜的时间戳。为此,有一个normalize参数(gg注:option的直译是“选项”,为“词汇前后一致”本文采用意译“参数”):
Sometimes you will have start or end dates with time information but want to generate a set of timestamps normalized to midnight as a convention. To do this, there is a normalize option:

In [80]: pd.date_range('2012-05-02 12:56:31', periods=5, normalize=True)
Out[80]: 
DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05',
               '2012-05-06'],
              dtype='datetime64[ns]', freq='D')

11.3.2 频率和日期偏移量

Frequencies and Date Offsets

pandas中的频率由基本频率(base frequency)乘数(multiplier)组成。基本频率通常由一个字符串别名引用,例如“M”表示每月、“H”表示每小时。对于每个基本频率,都有一个定义为日期偏移量(date offset)的对象。例如,“每小时”的频率可以用Hour类表示:
Frequencies in pandas are composed of a base frequency and a multiplier. Base frequencies are typically referred to by a string alias, like 'M' for monthly or 'H' for hourly. For each base frequency, there is an object defined generally referred to as a date offset. For example, hourly frequency can be represented with the Hour class:

In [81]: from pandas.tseries.offsets import Hour, Minute

In [82]: one_hour = Hour() # gg注:为避免歧义,变量名从原文的hour改为one_hour

In [83]: one_hour
Out[83]: <Hour>

你可以传入一个整数来定义偏移量的倍数:
You can define a multiple of an offset by passing an integer:

In [84]: four_hours = Hour(4)

In [85]: four_hours
Out[85]: <4 * Hours>

在大部分应用场景中,不需要显式地创建这些对象,而是使用例如“H”或“4H”的字符串别名。在基本频率前放一个整数即可创建偏移量的倍数:
In most applications, you would never need to explicitly create one of these objects, instead using a string alias like 'H' or '4H'. Putting an integer before the base frequency creates a multiple:

In [86]: pd.date_range('2000-01-01', '2000-01-03 23:59', freq='4h')
Out[86]: 
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 04:00:00',
               '2000-01-01 08:00:00', '2000-01-01 12:00:00',
               '2000-01-01 16:00:00', '2000-01-01 20:00:00',
               '2000-01-02 00:00:00', '2000-01-02 04:00:00',
               '2000-01-02 08:00:00', '2000-01-02 12:00:00',
               '2000-01-02 16:00:00', '2000-01-02 20:00:00',
               '2000-01-03 00:00:00', '2000-01-03 04:00:00',
               '2000-01-03 08:00:00', '2000-01-03 12:00:00',
               '2000-01-03 16:00:00', '2000-01-03 20:00:00'],
              dtype='datetime64[ns]', freq='4H')

多个偏移量可以通过加法组合在一起:
Many offsets can be combined together by addition:

In [87]: Hour() + Minute(30) # gg注:结合上下文,作者想计算的是1h30min
Out[87]: <90 * Minutes>

类似地,你可以传入频率字符串,例如“1h30min”,该字符串将被有效地解析为相同的表达式:
Similarly, you can pass frequency strings, like '1h30min', that will effectively be parsed to the same expression:

In [88]: pd.date_range('2000-01-01', periods=10, freq='1h30min')
Out[88]: 
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:30:00',
               '2000-01-01 03:00:00', '2000-01-01 04:30:00',
               '2000-01-01 06:00:00', '2000-01-01 07:30:00',
               '2000-01-01 09:00:00', '2000-01-01 10:30:00',
               '2000-01-01 12:00:00', '2000-01-01 13:30:00'],
              dtype='datetime64[ns]', freq='90T')

有些频率描述的时间点不是均匀间隔的。例如,“M”(每月最后一个日历日)和“BM”(每月最后一个工作日)取决于一个月的天数,在后一种情况下,还要考虑这个月是否在周末结束。 我们将这些称为锚定偏移量(anchored offset)
Some frequencies describe points in time that are not evenly spaced. For example, 'M' (calendar month end) and 'BM' (last business/weekday of month) depend on the number of days in a month and, in the latter case, whether the month ends on a weekend or not. We refer to these as anchored offsets.

pandas中可用的频率代码和日期偏移量类型的清单,请参阅表11-4。
Refer back to Table 11-4 for a listing of frequency codes and date offset classes available in pandas.

用户可以自定义频率类(frequency class)来提供pandas中没有的日期逻辑,但完整细节不在本书的范围之内。
Users can define their own custom frequency classes to provide date logic not available in pandas, though the full details of that are outside the scope of this book.

11.3.2.1 WOM日期

Week of month dates

WOM(Week Of Month)是一个有用的频率类。这使你能够获得例如“每月第三个星期五”的日期:
One useful frequency class is “week of month,” starting with WOM. This enables you to get dates like the third Friday of each month:

In [89]: rng = pd.date_range('2012-01-01', '2012-09-01', freq='WOM-3FRI')

In [90]: list(rng)
Out[90]: 
[Timestamp('2012-01-20 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-02-17 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-03-16 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-04-20 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-05-18 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-06-15 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-07-20 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-08-17 00:00:00', freq='WOM-3FRI')]

11.3.3 对数据进行移动(超前和滞后)

Shifting (Leading and Lagging) Data

移动(shifting)是指通过时间向后和向前移动数据。Series和DataFrame都有一个shift方法用于进行朴素的(naive)向前或向后移动,而且保持索引不变:
“Shifting” refers to moving data backward and forward through time. Both Series and DataFrame have a shift method for doing naive shifts forward or backward, leaving the index unmodified:

In [91]: ts = pd.Series(np.random.randn(4),
   ....:                index=pd.date_range('1/1/2000', periods=4, freq='M'))

In [92]: ts
Out[92]: 
2000-01-31   -0.066748
2000-02-29    0.838639
2000-03-31   -0.117388
2000-04-30   -0.517795
Freq: M, dtype: float64

In [93]: ts.shift(2)
Out[93]: 
2000-01-31         NaN
2000-02-29         NaN
2000-03-31   -0.066748
2000-04-30    0.838639
Freq: M, dtype: float64

In [94]: ts.shift(-2)
Out[94]: 
2000-01-31   -0.117388
2000-02-29   -0.517795
2000-03-31         NaN
2000-04-30         NaN
Freq: M, dtype: float64

当我们这样进行移动时,会在时间序列的起始处或结束处引入缺失数据。
When we shift like this, missing data is introduced either at the start or the end of the time series.

shift方法的一个常见用途是计算一个时间序列或多个时间序列(如DataFrame的列)中的百分比变化(percent change)。这表示为:
A common use of shift is computing percent changes in a time series or multiple time series as DataFrame columns. This is expressed as:

ts / ts.shift(1) - 1

由于朴素的移动保持索引不变 ,因此一些数据被丢弃。如果频率已知,则可以将其传入shift方法以移动时间戳而不仅仅是数据
Because naive shifts leave the index unmodified, some data is discarded. Thus if the frequency is known, it can be passed to shift to advance the timestamps instead of simply the data:

In [95]: ts.shift(2, freq='M')
Out[95]: 
2000-03-31   -0.066748
2000-04-30    0.838639
2000-05-31   -0.117388
2000-06-30   -0.517795
Freq: M, dtype: float64

也可以传入其它频率,这样你就能灵活地对数据进行超前和滞后处理了:
Other frequencies can be passed, too, giving you some flexibility in how to lead and lag the data:

In [96]: ts.shift(3, freq='D') # gg注:效果等同于ts.shift(1, freq='3D')
Out[96]: 
2000-02-03   -0.066748
2000-03-03    0.838639
2000-04-03   -0.117388
2000-05-03   -0.517795
dtype: float64

In [97]: ts.shift(1, freq='90T') # gg注:效果等同于ts.shift(90, freq='T')
Out[97]: 
2000-01-31 01:30:00   -0.066748
2000-02-29 01:30:00    0.838639
2000-03-31 01:30:00   -0.117388
2000-04-30 01:30:00   -0.517795
Freq: M, dtype: float64

这里的“T”代表分钟。
The T here stands for minutes.

11.3.3.1 通过偏移量对日期进行移动

Shifting dates with offsets

pandas的日期偏移量还可以与datetime对象或Timestamp对象一起使用:
The pandas date offsets can also be used with datetime or Timestamp objects:

In [98]: from pandas.tseries.offsets import Day, MonthEnd

In [99]: now = datetime(2011, 11, 17)

In [100]: now + 3 * Day()
Out[100]: Timestamp('2011-11-20 00:00:00')

如果加的是锚定偏移量(例如MonthEnd),则第一个增量会将原日期“向前滚动”到符合频率规则的下一个日期:
If you add an anchored offset like MonthEnd, the first increment will “roll forward” a date to the next date according to the frequency rule:

In [101]: now + MonthEnd()
Out[101]: Timestamp('2011-11-30 00:00:00')

In [102]: now + MonthEnd(2)
Out[102]: Timestamp('2011-12-31 00:00:00')

通过锚定偏移量的rollforward方法和rollback方法,可显式地将日期向前或向后“滚动”:
Anchored offsets can explicitly “roll” dates forward or backward by simply using their rollforward and rollback methods, respectively:

In [103]: offset = MonthEnd()

In [104]: offset.rollforward(now)
Out[104]: Timestamp('2011-11-30 00:00:00')

In [105]: offset.rollback(now)
Out[105]: Timestamp('2011-10-31 00:00:00')

日期偏移量的一个创造性用法是与groupby方法一起使用rollforward方法或rollback方法:
A creative use of date offsets is to use these methods with groupby:

In [106]: ts = pd.Series(np.random.randn(20),
   .....:                index=pd.date_range('1/15/2000', periods=20, freq='4d'))

In [107]: ts
Out[107]: 
2000-01-15   -0.116696
2000-01-19    2.389645
2000-01-23   -0.932454
2000-01-27   -0.229331
2000-01-31   -1.140330
2000-02-04    0.439920
2000-02-08   -0.823758
2000-02-12   -0.520930
2000-02-16    0.350282
2000-02-20    0.204395
2000-02-24    0.133445
2000-02-28    0.327905
2000-03-03    0.072153
2000-03-07    0.131678
2000-03-11   -1.297459
2000-03-15    0.997747
2000-03-19    0.870955
2000-03-23   -0.991253
2000-03-27    0.151699
2000-03-31    1.266151
Freq: 4D, dtype: float64

In [108]: ts.groupby(offset.rollforward).mean()
Out[108]: 
2000-01-31   -0.005833
2000-02-29    0.015894
2000-03-31    0.150209
dtype: float64

当然,更简单更快捷的方式是使用resample方法(11.6节将对此进行详细讲解):
Of course, an easier and faster way to do this is using resample (we’ll discuss this in much more depth in Section 11.6, “Resampling and Frequency Conversion,” on page 348):

In [109]: ts.resample('M').mean()
Out[109]: 
2000-01-31   -0.005833
2000-02-29    0.015894
2000-03-31    0.150209
Freq: M, dtype: float64

11.4 时区处理

11.4 Time Zone Handling

处理时区(time zone)通常被认为是时间序列操作中最令人不快的部分之一。因此,很多人选择协调世界时(coordinated universal time, UTC)来处理时间序列。协调世界时是格林尼治标准时间(Greenwich Mean time, GMT)的继任者,也是目前的国际标准。时区是以与UTC的偏移量形式表示的。例如,在夏令时(daylight saving time, DST)期间纽约比UTC晚4个小时,而在全年其它时间则比UTC晚5个小时。
Working with time zones is generally considered one of the most unpleasant parts of time series manipulation. As a result, many time series users choose to work with time series in coordinated universal time or UTC, which is the successor to Greenwich Mean Time and is the current international standard. Time zones are expressed as offsets from UTC; for example, New York is four hours behind UTC during daylight saving time and five hours behind the rest of the year.

在Python中,时区信息来自第三方pytz库(可通过pip或conda安装),它公开了Olson数据库(世界时区信息的汇编)。这对历史数据特别重要,因为夏令时转变日期(甚至UTC偏移量)已经根据地方政府的突发奇想改变了很多次。在美国,夏令时转变日期自1900年以来已经改变了很多次!
In Python, time zone information comes from the third-party pytz library (installable with pip or conda), which exposes the Olson database, a compilation of world time zone information. This is especially important for historical data because the daylight saving time (DST) transition dates (and even UTC offsets) have been changed numerous times depending on the whims of local governments. In the United States, the DST transition times have been changed many times since 1900!

有关pytz库的详细信息,请查阅该库的官方文档。就本书而言,pandas封装了pytz库的功能,这样你就可以忽略它在时区名称之外的API。时区名称可以交互式地找到,也可以在官方文档中找到:
For detailed information about the pytz library, you’ll need to look at that library’s documentation. As far as this book is concerned, pandas wraps pytz’s functionality so you can ignore its API outside of the time zone names. Time zone names can be found interactively and in the docs:

In [110]: import pytz

In [111]: pytz.common_timezones[-5:]
Out[111]: ['US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific', 'UTC']

要从pytz库获取时区对象,请使用pytz.timezone函数:
To get a time zone object from pytz, use pytz.timezone:

In [112]: tz = pytz.timezone('America/New_York')

In [113]: tz
Out[113]: <DstTzInfo 'America/New_York' LMT-1 day, 19:04:00 STD>

pandas中的方法既可以接受时区名称也可以接受时区对象。
Methods in pandas will accept either time zone names or these objects.

11.4.1 时区本地化和转换

Time Zone Localization and Conversion

默认情况下,pandas中的时间序列是时区朴素的(time zone naive)。例如,考虑以下时间序列:
By default, time series in pandas are time zone naive. For example, consider the following time series:

In [114]: rng = pd.date_range('3/9/2012 9:30', periods=6, freq='D')

In [115]: ts = pd.Series(np.random.randn(len(rng)), index=rng)

In [116]: ts
Out[116]: 
2012-03-09 09:30:00   -0.202469
2012-03-10 09:30:00    0.050718
2012-03-11 09:30:00    0.639869
2012-03-12 09:30:00    0.597594
2012-03-13 09:30:00   -0.797246
2012-03-14 09:30:00    0.472879
Freq: D, dtype: float64

其索引的tz属性是None:
The index’s tz field is None:

In [117]: print(ts.index.tz)
None

可以生成带有时区集(time zone set)的日期范围:
Date ranges can be generated with a time zone set:

In [118]: pd.date_range('3/9/2012 9:30', periods=10, freq='D', tz='UTC')
Out[118]: 
DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00',
               '2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',
               '2012-03-15 09:30:00+00:00', '2012-03-16 09:30:00+00:00',
               '2012-03-17 09:30:00+00:00', '2012-03-18 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='D')

从朴素到本地化的转换是通过tz_localize方法处理的:
Conversion from naive to localized is handled by the tz_localize method:

In [119]: ts
Out[119]: 
2012-03-09 09:30:00   -0.202469
2012-03-10 09:30:00    0.050718
2012-03-11 09:30:00    0.639869
2012-03-12 09:30:00    0.597594
2012-03-13 09:30:00   -0.797246
2012-03-14 09:30:00    0.472879
Freq: D, dtype: float64

In [120]: ts_utc = ts.tz_localize('UTC')

In [121]: ts_utc
Out[121]: 
2012-03-09 09:30:00+00:00   -0.202469
2012-03-10 09:30:00+00:00    0.050718
2012-03-11 09:30:00+00:00    0.639869
2012-03-12 09:30:00+00:00    0.597594
2012-03-13 09:30:00+00:00   -0.797246
2012-03-14 09:30:00+00:00    0.472879
Freq: D, dtype: float64

In [122]: ts_utc.index
Out[122]: 
DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00',
               '2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='D')

一旦时间序列被本地化到某个特定的时区,就可以通过tz_convert方法将其转换到另一个时区:
Once a time series has been localized to a particular time zone, it can be converted to another time zone with tz_convert:

In [123]: ts_utc.tz_convert('America/New_York')
Out[123]: 
2012-03-09 04:30:00-05:00   -0.202469
2012-03-10 04:30:00-05:00    0.050718
2012-03-11 05:30:00-04:00    0.639869
2012-03-12 05:30:00-04:00    0.597594
2012-03-13 05:30:00-04:00   -0.797246
2012-03-14 05:30:00-04:00    0.472879
Freq: D, dtype: float64

在前面的时间序列中(它跨越了America/New_York时区的夏令时转变),我们可以将其本地化到美国东部标准时间(Eastern Standard Time, EST),然后转换到UTC或柏林时间:
In the case of the preceding time series, which straddles a DST transition in the America/New_York time zone, we could localize to EST and convert to, say, UTC or Berlin time:

In [124]: ts_eastern = ts.tz_localize('America/New_York')

In [125]: ts_eastern.tz_convert('UTC')
Out[125]: 
2012-03-09 14:30:00+00:00   -0.202469
2012-03-10 14:30:00+00:00    0.050718
2012-03-11 13:30:00+00:00    0.639869
2012-03-12 13:30:00+00:00    0.597594
2012-03-13 13:30:00+00:00   -0.797246
2012-03-14 13:30:00+00:00    0.472879
Freq: D, dtype: float64

In [126]: ts_eastern.tz_convert('Europe/Berlin')
Out[126]: 
2012-03-09 15:30:00+01:00   -0.202469
2012-03-10 15:30:00+01:00    0.050718
2012-03-11 14:30:00+01:00    0.639869
2012-03-12 14:30:00+01:00    0.597594
2012-03-13 14:30:00+01:00   -0.797246
2012-03-14 14:30:00+01:00    0.472879
Freq: D, dtype: float64

tz_localize和tz_convert也是DatetimeIndex的实例方法:
tz_localize and tz_convert are also instance methods on DatetimeIndex:

In [127]: ts.index.tz_localize('Asia/Shanghai')
Out[127]: 
DatetimeIndex(['2012-03-09 09:30:00+08:00', '2012-03-10 09:30:00+08:00',
               '2012-03-11 09:30:00+08:00', '2012-03-12 09:30:00+08:00',
               '2012-03-13 09:30:00+08:00', '2012-03-14 09:30:00+08:00'],
              dtype='datetime64[ns, Asia/Shanghai]', freq='D')

对朴素时间戳的本地化操作还会检查夏令时转变附近含混不清的或不存在的时间。
Localizing naive timestamps also checks for ambiguous or nonexistent times around daylight saving time transitions.

11.4.2 时区意识型Timestamp对象的运算

Operations with Time Zone−Aware Timestamp Objects

与时间序列和日期范围类似,单独的Timestamp对象也能被从朴素的本地化为时区意识型的(time zone-aware),并从一个时区转换到另一个时区:
Similar to time series and date ranges, individual Timestamp objects similarly can be localized from naive to time zone–aware and converted from one time zone to another:

In [128]: stamp = pd.Timestamp('2011-03-12 04:00')

In [129]: stamp_utc = stamp.tz_localize('utc')

In [130]: stamp_utc.tz_convert('America/New_York')
Out[130]: Timestamp('2011-03-11 23:00:00-0500', tz='America/New_York')

在创建Timestamp对象时,也可以传入一个时区参数:
You can also pass a time zone when creating the Timestamp:

In [131]: stamp_moscow = pd.Timestamp('2011-03-12 04:00', tz='Europe/Moscow')

In [132]: stamp_moscow
Out[132]: Timestamp('2011-03-12 04:00:00+0300', tz='Europe/Moscow')

时区意识型Timestamp对象在内部存储了一个UTC时间戳数值----自UNIX纪元(1970 年1月1日)算起的纳秒数。这个UTC时间戳数值在时区转换过程中是不变的:
Time zone–aware Timestamp objects internally store a UTC timestamp value as nanoseconds since the Unix epoch (January 1, 1970); this UTC value is invariant between time zone conversions:

In [133]: stamp_utc.value
Out[133]: 1299902400000000000

In [134]: stamp_utc.tz_convert('America/New_York').value
Out[134]: 1299902400000000000

当使用pandas的DateOffset对象执行时间算术运算时,pandas会尽可能遵从夏令时转变。这里我们创建恰好发生在夏令时转变前的时间戳。首先是转变到夏令时前的30分钟:
When performing time arithmetic using pandas's DateOffset objects, pandas respects daylight saving time transitions where possible. Here we construct timestamps that occur right before DST transitions (forward and backward). First, 30 minutes before transitioning to DST:

In [135]: from pandas.tseries.offsets import Hour

In [136]: stamp = pd.Timestamp('2012-03-11 01:30', tz='US/Eastern') # gg注:原英文书中有误,作者的意图是2012-03-11,而不是2012-03-12

In [137]: stamp
Out[137]: Timestamp('2012-03-12 01:30:00-0400', tz='US/Eastern')

In [138]: stamp + Hour()
Out[138]: Timestamp('2012-03-12 02:30:00-0400', tz='US/Eastern')

接着是从夏令时转出前的90分钟:
Then, 90 minutes before transitioning out of DST:

In [139]: stamp = pd.Timestamp('2012-11-04 00:30', tz='US/Eastern')

In [140]: stamp
Out[140]: Timestamp('2012-11-04 00:30:00-0400', tz='US/Eastern')

In [141]: stamp + 2 * Hour()
Out[141]: Timestamp('2012-11-04 01:30:00-0500', tz='US/Eastern')

11.4.3 不同时区之间的运算

Operations Between Different Time Zones

如果组合两个带有不同时区的时间序列,结果会是UTC。由于在底层时间戳是以UTC存储的,所以这是个简单运算,不需要转换。
If two time series with different time zones are combined, the result will be UTC. Since the timestamps are stored under the hood in UTC, this is a straightforward operation and requires no conversion to happen:

In [142]: rng = pd.date_range('3/7/2012 9:30', periods=10, freq='B')

In [143]: ts = pd.Series(np.random.randn(len(rng)), index=rng)

In [144]: ts
Out[144]: 
2012-03-07 09:30:00    0.522356
2012-03-08 09:30:00   -0.546348
2012-03-09 09:30:00   -0.733537
2012-03-12 09:30:00    1.302736
2012-03-13 09:30:00    0.022199
2012-03-14 09:30:00    0.364287
2012-03-15 09:30:00   -0.922839
2012-03-16 09:30:00    0.312656
2012-03-19 09:30:00   -1.128497
2012-03-20 09:30:00   -0.333488
Freq: B, dtype: float64

In [145]: ts1 = ts[:7].tz_localize('Europe/London')

In [146]: ts2 = ts1[2:].tz_convert('Europe/Moscow')

In [147]: result = ts1 + ts2

In [148]: result.index
Out[148]: 
DatetimeIndex(['2012-03-07 09:30:00+00:00', '2012-03-08 09:30:00+00:00',
               '2012-03-09 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',
               '2012-03-15 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='B')

11.5 时期及其算术运算

Periods and Period Arithmetic

时期(period)表示的是时间跨度(timespan),例如数日、数月、数季或数年。Period类(The Period class)表示的就是这种数据类型,其构造函数(pandas.Period)需要一个“字符串或整数”以及一个表11-4中的频率。
Periods represent timespans, like days, months, quarters, or years. The Period class represents this data type, requiring a string or integer and a frequency from Table 11-4:

In [149]: p = pd.Period(2007, freq='A-DEC') # gg注:p = pd.Period('2007', freq='A-DEC')效果一样

In [150]: p
Out[150]: Period('2007', 'A-DEC')

在这个例子中,Period对象表示的是从2007年1月1日到2007年12月31日(包含在内)的整个时间跨度。在Period对象上加上或减去一个整数,即可方便地达到根据其频率进行移动的效果。
In this case, the Period object represents the full timespan from January 1, 2007, to December 31, 2007, inclusive. Conveniently, adding and subtracting integers from periods has the effect of shifting by their frequency:

In [151]: p + 5
Out[151]: Period('2012', 'A-DEC')

In [152]: p - 2
Out[152]: Period('2005', 'A-DEC')

如果两个Period对象拥有相同的频率,则它们的差就是它们之间的单位数量:
If two periods have the same frequency, their difference is the number of units between them:

In [153]: pd.Period('2014', freq='A-DEC') - p
Out[153]: <7 * YearEnds: month=12> # gg注:原英文书中的输出“7”可能是老版本的

pandas.period_range函数可用于创建规则的时期范围(range of periods):
Regular ranges of periods can be constructed with the period_range function:

In [154]: rng = pd.period_range('2000-01-01', '2000-06-30', freq='M')

In [155]: rng
Out[155]: PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '20
00-06'], dtype='period[M]', freq='M')

PeriodIndex类存储的是Period对象的序列,它可以在任何pandas数据结构中作为轴索引:
The PeriodIndex class stores a sequence of periods and can serve as an axis index in any pandas data structure:

In [156]: pd.Series(np.random.randn(6), index=rng)
Out[156]: 
2000-01   -0.514551
2000-02   -0.559782
2000-03   -0.783408
2000-04   -1.797685
2000-05   -0.172670
2000-06    0.680215
Freq: M, dtype: float64

PeriodIndex类的构造函数(pandas.PeriodIndex)也可以使用字符串数组(array of strings)
If you have an array of strings, you can also use the PeriodIndex class:

In [157]: vals = ['2001Q3', '2002Q2', '2003Q1'] # gg注:为避免歧义,变量名从原文的values改为vals

In [158]: idx = pd.PeriodIndex(vals, freq='Q-DEC') # gg注:为避免歧义,变量名从原文的index改为idx

In [159]: idx
Out[159]: PeriodIndex(['2001Q3', '2002Q2', '2003Q1'], dtype='period[Q-DEC]', freq
='Q-DEC')

11.5.1 时期的频率转换

Period Frequency Conversion

Period对象和PeriodIndex对象都可以通过其asfreq方法被转换到别的频率。例如,假设我们有一个年度时期(annual period),希望将其换为当年年初或年末的一个月度时期(monthly period)。这非常简单:
Periods and PeriodIndex objects can be converted to another frequency with their asfreq method. As an example, suppose we had an annual period and wanted to convert it into a monthly period either at the start or end of the year. This is fairly straightforward:

In [160]: p = pd.Period('2007', freq='A-DEC')

In [161]: p
Out[161]: Period('2007', 'A-DEC')

In [162]: p.asfreq('M', how='start')  # gg注:p.asfreq(freq='M', how='start')
Out[162]: Period('2007-01', 'M')

In [163]: p.asfreq('M', how='end') # gg注:p.asfreq(freq='M', how='end')
Out[163]: Period('2007-12', 'M')

你可以将Period('2007', 'A-DEC')看作一种游标,该游标指向一个被划分为多个月度时期的时间跨度,如图11-1所示。对于一个不以十二月结束的财政年度(fiscal year),月度子时期(monthly subperiods)的归属情况就不一样了:
You can think of Period('2007', 'A-DEC') as being a sort of cursor pointing to a span of time, subdivided by monthly periods. See Figure 11-1 for an illustration of this. For a fiscal year ending on a month other than December, the corresponding monthly subperiods are different:

In [164]: p = pd.Period('2007', freq='A-JUN')

In [165]: p
Out[165]: Period('2007', 'A-JUN')

In [166]: p.asfreq('M', 'start')
Out[166]: Period('2006-07', 'M')

In [167]: p.asfreq('M', 'end')
Out[167]: Period('2007-06', 'M')
图11-1:Period频率转换图解 Figure 11-1. Period frequency conversion illustration

在将高频率转换到低频率肘,超时期(superperiod)是由子时期(subperiod )所属的位置决定的。例如,在A-JUN频率中,月份 “ 2007年8 月” 实际是“ 2008时期”的一部分:
When you are converting from high to low frequency, pandas determines the superperiod depending on where the subperiod “belongs.” For example, in A-JUN frequency, the month Aug-2007 is actually part of the 2008 period:

In [168]: p = pd.Period('Aug-2007', 'M')

In [169]: p.asfreq('A-JUN')
Out[169]: Period('2008', 'A-JUN')

完整的PeriodIndex对象或时间序列可以通过相同的语义进行类似地转换:
Whole PeriodIndex objects or time series can be similarly converted with the same semantics:

In [170]: rng = pd.period_range('2006', '2009', freq='A-DEC')

In [171]: ts = pd.Series(np.random.randn(len(rng)), index=rng)

In [172]: ts
Out[172]: 
2006    1.607578
2007    0.200381
2008   -0.834068
2009   -0.302988
Freq: A-DEC, dtype: float64

In [173]: ts.asfreq('M', how='start')
Out[173]: 
2006-01    1.607578
2007-01    0.200381
2008-01   -0.834068
2009-01   -0.302988
Freq: M, dtype: float64

在这里,年度时期(annual period)被替换为月度时期(monthly period),该月度时期对应于每个年度时期内的第一个月。如果我们想要每年的最后一个工作日,我们可以使用“B”频率并指定想要该时期的末尾:
Here, the annual periods are replaced with monthly periods corresponding to the first month falling within each annual period. If we instead wanted the last business day of each year, we can use the 'B' frequency and indicate that we want the end of the period:

In [174]: ts.asfreq('B', how='end')

Out[174]: 
2006-12-29    1.607578
2007-12-31    0.200381
2008-12-31   -0.834068
2009-12-31   -0.302988
Freq: B, dtype: float64

11.5.2 季度时期频率

Quarterly Period Frequencies

季度数据(quarterly data)在会计、金融等领域中很常见。许多季度数据都会涉及财政年度结束日(fiscal year end)的概念,通常是一年12个月中某月的最后一个日历日或工作日。因此,“2012Q4时期”根据财政年度结束日的不同会有不同的含义。pandas支持全部12个可能的季度频率,即Q-JAN到Q-DEC:
Quarterly data is standard in accounting, finance, and other fields. Much quarterly data is reported relative to a fiscal year end, typically the last calendar or business day of one of the 12 months of the year. Thus, the period 2012Q4 has a different meaning depending on fiscal year end. pandas supports all 12 possible quarterly frequencies as Q-JAN through Q-DEC:

In [175]: p = pd.Period('2012Q4', freq='Q-JAN')

In [176]: p
Out[176]: Period('2012Q4', 'Q-JAN')

在以1月结束的财政年度中,“2012Q4时期”是从11月到1月,你可以通过将其转换到日度频率(daily frequency)来查看。如图11-2所示。
In the case of fiscal year ending in January, 2012Q4 runs from November through January, which you can check by converting to daily frequency. See Figure 11-2 for an illustration.

图11.2:不同季度频率约定 Figure 11-2. Different quarterly frequency conventions

In [177]: p.asfreq('D', 'start')
Out[177]: Period('2011-11-01', 'D')

In [178]: p.asfreq('D', 'end')
Out[178]: Period('2012-01-31', 'D')

因此,可以进行简单的时期算术运算(period arithmetic)。例如,要获得该季度倒数第二个工作日下午4点的时间戳,你可以这样做:
Thus, it’s possible to do easy period arithmetic; for example, to get the timestamp at 4PM on the second-to-last business day of the quarter, you could do:

In [179]: p4pm = (p.asfreq('B', 'e') - 1).asfreq('T', 's') + 16 * 60

In [180]: p4pm
Out[180]: Period('2012-01-30 16:00', 'T')

In [181]: p4pm.to_timestamp()
Out[181]: Timestamp('2012-01-30 16:00:00')

可以使用pandas.period_range函数生成季度范围(quarterly range)。季度范围的算术运算也是一样的:
You can generate quarterly ranges using period_range. Arithmetic is identical, too:

In [182]: rng = pd.period_range('2011Q3', '2012Q4', freq='Q-JAN')

In [183]: ts = pd.Series(np.arange(len(rng)), index=rng)

In [184]: ts
Out[184]: 
2011Q3    0
2011Q4    1
2012Q1    2
2012Q2    3
2012Q3    4
2012Q4    5
Freq: Q-JAN, dtype: int64

In [185]: new_rng = (rng.asfreq('B', 'e') - 1).asfreq('T', 's') + 16 * 60

In [186]: ts.index = new_rng.to_timestamp()

In [187]: ts
Out[187]:
2010-10-28 16:00:00    0
2011-01-28 16:00:00    1
2011-04-28 16:00:00    2
2011-07-28 16:00:00    3
2011-10-28 16:00:00    4
2012-01-30 16:00:00    5
dtype: int64

11.5.3 将时间戳转换为时期(及其反向过程)

Converting Timestamps to Periods (and Back)

通过to_period方法,可以将被时间戳索引的Series对象和DataFrame对象转换到被时期索引:
Series and DataFrame objects indexed by timestamps can be converted to periods with the to_period method:

In [188]: rng = pd.date_range('2000-01-01', periods=3, freq='M')

In [189]: ts = pd.Series(np.random.randn(3), index=rng)

In [190]: ts
Out[190]: 
2000-01-31    1.663261
2000-02-29   -0.996206
2000-03-31    1.521760
Freq: M, dtype: float64

In [191]: pts = ts.to_period()

In [192]: pts
Out[192]: 
2000-01    1.663261
2000-02   -0.996206
2000-03    1.521760
Freq: M, dtype: float64

由于时期指的是非重叠的时间跨度,因此对于给定的频率,一个时间戳只能属于一个时期。虽然默认新PeriodIndex的频率是从时间戳推断而来的,但你可以指定任何频率。结果中允许存在重复时期:
Since periods refer to non-overlapping timespans, a timestamp can only belong to a single period for a given frequency. While the frequency of the new PeriodIndex is inferred from the timestamps by default, you can specify any frequency you want. There is also no problem with having duplicate periods in the result:

In [193]: rng = pd.date_range('1/29/2000', periods=6, freq='D')

In [194]: ts2 = pd.Series(np.random.randn(6), index=rng)

In [195]: ts2
Out[195]: 
2000-01-29    0.244175
2000-01-30    0.423331
2000-01-31   -0.654040
2000-02-01    2.089154
2000-02-02   -0.060220
2000-02-03   -0.167933
Freq: D, dtype: float64

In [196]: ts2.to_period('M')
Out[196]: 
2000-01    0.244175
2000-01    0.423331
2000-01   -0.654040
2000-02    2.089154
2000-02   -0.060220
2000-02   -0.167933
Freq: M, dtype: float64

要转换回时间戳,使用to_timestamp方法即可:
To convert back to timestamps, use to_timestamp:

In [197]: pts = ts2.to_period()

In [198]: pts
Out[198]: 
2000-01-29    0.244175
2000-01-30    0.423331
2000-01-31   -0.654040
2000-02-01    2.089154
2000-02-02   -0.060220
2000-02-03   -0.167933
Freq: D, dtype: float64

In [199]: pts.to_timestamp(how='end') # gg注:原英文书中的输出有误,只有日期无时间信息
Out[199]: 
2000-01-29 23:59:59.999999999    0.244175
2000-01-30 23:59:59.999999999    0.423331
2000-01-31 23:59:59.999999999   -0.654040
2000-02-01 23:59:59.999999999    2.089154
2000-02-02 23:59:59.999999999   -0.060220
2000-02-03 23:59:59.999999999   -0.167933
Freq: D, dtype: float64

11.5.4 从数组创建PeriodIndex

Creating a PeriodIndex from Arrays

固定频率的数据集有时会将时间跨度信息分开存储在多个列中。 例如,在下面这个宏观经济数据集中,年份和季度就在不同的列中:
Fixed frequency datasets are sometimes stored with timespan information spread across multiple columns. For example, in this macroeconomic dataset, the year and quarter are in different columns:

In [200]: data = pd.read_csv('examples/macrodata.csv')

In [201]: data.head(5)
Out[201]: 
     year  quarter   realgdp  realcons  realinv  realgovt  realdpi    cpi  \
0  1959.0      1.0  2710.349    1707.4  286.898   470.045   1886.9  28.98   
1  1959.0      2.0  2778.801    1733.7  310.859   481.301   1919.7  29.15   
2  1959.0      3.0  2775.488    1751.8  289.226   491.260   1916.4  29.35   
3  1959.0      4.0  2785.204    1753.7  299.356   484.052   1931.3  29.37   
4  1960.0      1.0  2847.699    1770.5  331.722   462.199   1955.5  29.54   
      m1  tbilrate  unemp      pop  infl  realint  
0  139.7      2.82    5.8  177.146  0.00     0.00  
1  141.7      3.08    5.1  177.830  2.34     0.74  
2  140.5      3.82    5.3  178.657  2.74     1.09  
3  140.0      4.33    5.6  179.386  0.27     4.06  
4  139.6      3.50    5.2  180.007  2.31     1.19  

In [202]: data.year
Out[202]: 
0      1959.0
1      1959.0
2      1959.0
3      1959.0
4      1960.0
5      1960.0
6      1960.0
7      1960.0
8      1961.0
9      1961.0
        ...  
193    2007.0
194    2007.0
195    2007.0
196    2008.0
197    2008.0
198    2008.0
199    2008.0
200    2009.0
201    2009.0
202    2009.0
Name: year, Length: 203, dtype: float64

In [203]: data.quarter
Out[203]: 
0      1.0
1      2.0
2      3.0
3      4.0
4      1.0
5      2.0
6      3.0
7      4.0
8      1.0
9      2.0
      ... 
193    2.0
194    3.0
195    4.0
196    1.0
197    2.0
198    3.0
199    4.0
200    1.0
201    2.0
202    3.0
Name: quarter, Length: 203, dtype: float64

通过将这些数组以及一个频率传入pandas.PeriodIndex函数,就可以将它们组合成DataFrame的索引:
By passing these arrays to PeriodIndex with a frequency, you can combine them to form an index for the DataFrame:

In [204]: idx = pd.PeriodIndex(year=data.year, quarter=data.quarter,
   .....:                        freq='Q-DEC') # gg注:为避免歧义,变量名从原文的index改为idx

In [205]: idx
Out[205]: 
PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', length=203, freq='Q-DEC')

In [206]: data.index = idx

In [207]: data.infl
Out[207]: 
1959Q1    0.00
1959Q2    2.34
1959Q3    2.74
1959Q4    0.27
1960Q1    2.31
1960Q2    0.14
1960Q3    2.70
1960Q4    1.21
1961Q1   -0.40
1961Q2    1.47
          ... 
2007Q2    2.75
2007Q3    3.45
2007Q4    6.38
2008Q1    2.82
2008Q2    8.53
2008Q3   -3.16
2008Q4   -8.79
2009Q1    0.94
2009Q2    3.37
2009Q3    3.56
Freq: Q-DEC, Name: infl, Length: 203, dtype: float64

11.6 重采样及频率转换

Resampling and Frequency Conversion

重采样(resampling)指的是将时间序列从一个频率转换到另一个频率的过程。将高频率数据聚合到低频率称为降采样(downsampling),而将低频率数据转换到高频率则称为升采样(upsampling)。并不是所有的重采样都能被划分到这两个大类中。例如,将W-WED(每周三)转换为W-FRI既不是降采样也不是升采样。
Resampling refers to the process of converting a time series from one frequency to another. Aggregating higher frequency data to lower frequency is called downsampling, while converting lower frequency to higher frequency is called upsampling. Not all resampling falls into either of these categories; for example, converting W-WED (weekly on Wednesday) to W-FRI is neither upsampling nor downsampling.

pandas对象都带有一个resample方法,它是各种频率转换工作的主力函数。resample方法有一个类似于groupby方法的API,先调用resample方法分组数据,然后再调用一个聚合函数:
pandas objects are equipped with a resample method, which is the workhorse function for all frequency conversion. resample has a similar API to groupby; you call resample to group the data, then call an aggregation function:

In [208]: rng = pd.date_range('2000-01-01', periods=100, freq='D')

In [209]: ts = pd.Series(np.random.randn(len(rng)), index=rng)

In [210]: ts
Out[210]: 
2000-01-01    0.631634
2000-01-02   -1.594313
2000-01-03   -1.519937
2000-01-04    1.108752
2000-01-05    1.255853
2000-01-06   -0.024330
2000-01-07   -2.047939
2000-01-08   -0.272657
2000-01-09   -1.692615
2000-01-10    1.423830
                ...   
2000-03-31   -0.007852
2000-04-01   -1.638806
2000-04-02    1.401227
2000-04-03    1.758539
2000-04-04    0.628932
2000-04-05   -0.423776
2000-04-06    0.789740
2000-04-07    0.937568
2000-04-08   -2.253294
2000-04-09   -1.772919
Freq: D, Length: 100, dtype: float64

In [211]: ts.resample('M').mean()
Out[211]: 
2000-01-31   -0.165893
2000-02-29    0.078606
2000-03-31    0.223811
2000-04-30   -0.063643
Freq: M, dtype: float64

In [212]: ts.resample('M', kind='period').mean()
Out[212]: 
2000-01   -0.165893
2000-02    0.078606
2000-03    0.223811
2000-04   -0.063643
Freq: M, dtype: float64

resample方法是一个灵活高效的方法,可用于处理非常大的时间序列。我将通过一系列的示例说明其用法。表11-5总结它的一些参数。
resample is a flexible and high-performance method that can be used to process very large time series. The examples in the following sections illustrate its semantics and use. Table 11-5 summarizes some of its options.

表11-5. resample方法的参数
Table 11-5. Resample method arguments


11.6.1 降采样

Downsampling

将数据聚合到规律的低频率是一件非常普通的时间序列处理任务。待聚合的数据不必拥有固定的频率,期望的频率会自动定义聚合的箱边缘(bin edge),这些箱边缘用于将时间序列拆分为多个片段。例如,要转换到月度频率('M'或'BM'),数据需要被划分到多个单月间隔(onemonth interval)中。各间隔都是半开半闭的(half-open)。一个数据点只能属于一个间隔,所有间隔的并集必须能组成整个时间范围(time frame,gg注:为方便理解采用“时间范围”,最精确的翻译是“时间框架”)。在用resample方法对数据进行降采样时,需要考虑两件事:

  • 各间隔哪端是闭合的。
  • 如何标记各聚合后的箱,采用间隔的起始还是末尾(gg注:即采用箱的左边缘还是右边缘)。
    Aggregating data to a regular, lower frequency is a pretty normal time series task. The data you’re aggregating doesn’t need to be fixed frequently; the desired frequency defines bin edges that are used to slice the time series into pieces to aggregate. For example, to convert to monthly, 'M' or 'BM', you need to chop up the data into onemonth intervals. Each interval is said to be half-open; a data point can only belong to one interval, and the union of the intervals must make up the whole time frame. There are a couple things to think about when using resample to downsample data:
  • Which side of each interval is closed
  • How to label each aggregated bin, either with the start of the interval or the end

为了说明,我们来看一些“1分钟”的数据:
To illustrate, let’s look at some one-minute data:

In [213]: rng = pd.date_range('2000-01-01', periods=12, freq='T')

In [214]: ts = pd.Series(np.arange(12), index=rng)

In [215]: ts
Out[215]: 
2000-01-01 00:00:00     0
2000-01-01 00:01:00     1
2000-01-01 00:02:00     2
2000-01-01 00:03:00     3
2000-01-01 00:04:00     4
2000-01-01 00:05:00     5
2000-01-01 00:06:00     6
2000-01-01 00:07:00     7
2000-01-01 00:08:00     8
2000-01-01 00:09:00     9
2000-01-01 00:10:00    10
2000-01-01 00:11:00    11
Freq: T, dtype: int64

假设你想通过求和的方式将这些数据聚合到“5分钟”的块中:
Suppose you wanted to aggregate this data into five-minute chunks or bars by taking the sum of each group:

In [216]: ts.resample('5min').sum()  # gg注:原英文书中有误,作者的意图是采用closed参数的默认值
Out[216]: 
2000-01-01 00:00:00    10
2000-01-01 00:05:00    35
2000-01-01 00:10:00    21
Freq: 5T, dtype: int32

传入的频率将会以“5分钟”的增量定义箱边缘。默认情况下,箱的左边缘是包含的(gg注:即左闭右开),因此00:00到00:05间隔是包含00:00的[1]。传入closed='right'会让间隔变成左开右闭的:
The frequency you pass defines bin edges in five-minute increments. By default, the left bin edge is inclusive, so the 00:00 value is included in the 00:00 to 00:05 interval[1]. Passing closed='right' changes the interval to be closed on the right:

In [217]: ts.resample('5min', closed='right').sum()
Out[217]: 
1999-12-31 23:55:00     0
2000-01-01 00:00:00    15
2000-01-01 00:05:00    40
2000-01-01 00:10:00    11
Freq: 5T, dtype: int64

结果的时间序列默认是以各箱左边缘的时间戳进行标记的。传入label='right'即可用箱的右边缘对其进行标记:
The resulting time series is labeled by the timestamps from the left side of each bin. By passing label='right' you can label them with the right bin edge:

In [218]: ts.resample('5min', closed='right', label='right').sum()
Out[218]: 
2000-01-01 00:00:00     0
2000-01-01 00:05:00    15
2000-01-01 00:10:00    40
2000-01-01 00:15:00    11
Freq: 5T, dtype: int64

图11-3是“1分钟”频率的数据被重采样到“5分钟”频率的示意图。
See Figure 11-3 for an illustration of minute frequency data being resampled to fiveminute frequency.

图11-3. 各种closed、label约定的“5分钟”重采样示意图 Figure 11-3. Five-minute resampling illustration of closed, label conventions

最后,你可能想对结果的索引进行一些移动,例如从右边缘减去一秒以便更容易明白该时间戳到底表示的是哪个间隔。只需要给loffset参数传入一个字符串或日期偏移量即可实现这个目的:
Lastly, you might want to shift the result index by some amount, say subtracting one second from the right edge to make it more clear which interval the timestamp refers to. To do this, pass a string or date offset to loffset:

In [219]: ts.resample('5min', closed='right',
   .....:             label='right', loffset='-1s').sum()
Out[219]: 
1999-12-31 23:59:59     0
2000-01-01 00:04:59    15
2000-01-01 00:09:59    40
2000-01-01 00:14:59    11
Freq: 5T, dtype: int32

也可以通过调用结果对象的shift方法来实现该效果,这样就不需要设置loffset参数了。
You also could have accomplished the effect of loffset by calling the shift method on the result without the loffset.

11.6.1.1 OHLC重采样

Open-High-Low-Close (OHLC) resampling

金融领域中有一种聚合时间序列的常见方式,即计算各时间段的四个值:开盘价(open)、最高价(high)、最低价(low)和收盘价(close)。使用ohlc聚合函数即可得到一个含有这四个值的DataFrame对象,只需要对数据进行一次扫描就可以有效地计算出结果:
(gg注:为方便理解对原英文书中的“the first (open), last (close), maximum (high), and minimal (low) values”的顺序进行了调整)
In finance, a popular way to aggregate a time series is to compute four values for each bucket: the first (open), maximum (high), minimal (low) and last (close) values. By using the ohlc aggregate function you will obtain a DataFrame having columns containing these four aggregates, which are efficiently computed in a single sweep of the data:

In [220]: ts.resample('5min').ohlc()
Out[220]: 
                     open  high  low  close
2000-01-01 00:00:00     0     4    0      4
2000-01-01 00:05:00     5     9    5      9
2000-01-01 00:10:00    10    11   10     11

11.6.2 升采样和插值

Upsampling and Interpolation

在将数据从低频率转换到高频率时,就不需要聚合了。我们来看一个带有一些周度数据(weekly data)的DataFrame对象:
When converting from a low frequency to a higher frequency, no aggregation is needed. Let’s consider a DataFrame with some weekly data:

In [221]: frame = pd.DataFrame(np.random.randn(2, 4),
   .....:                      index=pd.date_range('1/1/2000', periods=2,
   .....:                                          freq='W-WED'),
   .....:                      columns=['Colorado', 'Texas', 'New York', 'Ohio'])

In [222]: frame
Out[222]: 
            Colorado     Texas  New York      Ohio
2000-01-05 -0.896431  0.677263  0.036503  0.087102
2000-01-12 -0.046662  0.927238  0.482284 -0.867130

当你对这个数据进行聚合时,每组只有一个值,这样就会引入缺失值。我们使用asfreq方法转换到高频率,不经过聚合:
When you are using an aggregation function with this data, there is only one value per group, and missing values result in the gaps. We use the asfreq method to convert to the higher frequency without any aggregation:

In [223]: df_daily = frame.resample('D').asfreq()

In [224]: df_daily
Out[224]: 
            Colorado     Texas  New York      Ohio
2000-01-05 -0.896431  0.677263  0.036503  0.087102
2000-01-06       NaN       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN       NaN
2000-01-08       NaN       NaN       NaN       NaN
2000-01-09       NaN       NaN       NaN       NaN
2000-01-10       NaN       NaN       NaN       NaN
2000-01-11       NaN       NaN       NaN       NaN
2000-01-12 -0.046662  0.927238  0.482284 -0.867130

假设你想在“非星期三”向前填充各周度值(weekly value)。resample方法的填充和插值方式跟fillna方法和reindex方法的一样:
Suppose you wanted to fill forward each weekly value on the non-Wednesdays. The same filling or interpolation methods available in the fillna and reindex methods are available for resampling:

In [225]: frame.resample('D').ffill()
Out[225]: 
            Colorado     Texas  New York      Ohio
2000-01-05 -0.896431  0.677263  0.036503  0.087102
2000-01-06 -0.896431  0.677263  0.036503  0.087102
2000-01-07 -0.896431  0.677263  0.036503  0.087102
2000-01-08 -0.896431  0.677263  0.036503  0.087102
2000-01-09 -0.896431  0.677263  0.036503  0.087102
2000-01-10 -0.896431  0.677263  0.036503  0.087102
2000-01-11 -0.896431  0.677263  0.036503  0.087102
2000-01-12 -0.046662  0.927238  0.482284 -0.867130

同样,你可以选择只填充指定的时期数,以限制观测值的持续使用距离:
You can similarly choose to only fill a certain number of periods forward to limit how far to continue using an observed value:

In [226]: frame.resample('D').ffill(limit=2)
Out[226]:
            Colorado     Texas  New York      Ohio
2000-01-05 -0.896431  0.677263  0.036503  0.087102
2000-01-06 -0.896431  0.677263  0.036503  0.087102
2000-01-07 -0.896431  0.677263  0.036503  0.087102
2000-01-08       NaN       NaN       NaN       NaN
2000-01-09       NaN       NaN       NaN       NaN
2000-01-10       NaN       NaN       NaN       NaN
2000-01-11       NaN       NaN       NaN       NaN
2000-01-12 -0.046662  0.927238  0.482284 -0.867130

注意,新的日期索引完全没必要跟旧的重叠:
Notably, the new date index need not overlap with the old one at all:

In [227]: frame.resample('W-THU').ffill()
Out[227]: 
            Colorado     Texas  New York      Ohio
2000-01-06 -0.896431  0.677263  0.036503  0.087102
2000-01-13 -0.046662  0.927238  0.482284 -0.867130

11.6.3 通过时期进行重采样

Resampling with Periods

对被时期索引的数据进行重采样类似于被时间戳索引的数据:
Resampling data indexed by periods is similar to timestamps:

In [228]: frame = pd.DataFrame(np.random.randn(24, 4),
   .....:                      index=pd.period_range('1-2000', '12-2001',
   .....:                                            freq='M'),
   .....:                      columns=['Colorado', 'Texas', 'New York', 'Ohio'])

In [229]: frame[:5]
Out[229]: 
         Colorado     Texas  New York      Ohio
2000-01  0.493841 -0.155434  1.397286  1.507055
2000-02 -1.179442  0.443171  1.395676 -0.529658
2000-03  0.787358  0.248845  0.743239  1.267746
2000-04  1.302395 -0.272154 -0.051532 -0.467740
2000-05 -1.040816  0.426419  0.312945 -1.115689

In [230]: annual_frame = frame.resample('A-DEC').mean()

In [231]: annual_frame
Out[231]: 
      Colorado     Texas  New York      Ohio
2000  0.556703  0.016631  0.111873 -0.027445
2001  0.046303  0.163344  0.251503 -0.157276

升采样要稍微麻烦一些,因为你必须决定在新频率中时间跨度的哪端用于放置原始值,就像asfreq方法那样。convention参数默认'start',也可设置为'end':
Upsampling is more nuanced, as you must make a decision about which end of the timespan in the new frequency to place the values before resampling, just like the asfreq method. The convention argument defaults to 'start' but can also be 'end':

In [232]: annual_frame.resample('Q-DEC').ffill()
Out[232]: 
        Colorado     Texas  New York      Ohio
2000Q1  0.556703  0.016631  0.111873 -0.027445
2000Q2  0.556703  0.016631  0.111873 -0.027445
2000Q3  0.556703  0.016631  0.111873 -0.027445
2000Q4  0.556703  0.016631  0.111873 -0.027445
2001Q1  0.046303  0.163344  0.251503 -0.157276
2001Q2  0.046303  0.163344  0.251503 -0.157276
2001Q3  0.046303  0.163344  0.251503 -0.157276
2001Q4  0.046303  0.163344  0.251503 -0.157276

In [233]: annual_frame.resample('Q-DEC', convention='end').ffill()
Out[233]: 
        Colorado     Texas  New York      Ohio
2000Q4  0.556703  0.016631  0.111873 -0.027445
2001Q1  0.556703  0.016631  0.111873 -0.027445
2001Q2  0.556703  0.016631  0.111873 -0.027445
2001Q3  0.556703  0.016631  0.111873 -0.027445
2001Q4  0.046303  0.163344  0.251503 -0.157276

由于时期指的是时间跨度,所以升采样和降采样的规则就比较严格:

  • 在降采样中,目标频率必须是源频率的子时期(subperiod)
  • 在升采样中,目标频率必须是源频率的超时期(superperiod)
    Since periods refer to timespans, the rules about upsampling and downsampling are more rigid:
  • In downsampling, the target frequency must be a subperiod of the source frequency.
  • In upsampling, the target frequency must be a superperiod of the source frequency.
    如果不满足这些规则,就会引发异常。这主要影响季度频率、年度频率和周度频率。例如,由Q-MAR定义的时间跨度只能升采样为A-MAR、A-JUN、A-SEP、A-DEC等:
    If these rules are not satisfied, an exception will be raised. This mainly affects the quarterly, annual, and weekly frequencies; for example, the timespans defined by Q-MAR only line up with A-MAR, A-JUN, A-SEP, and A-DEC:
In [234]: annual_frame.resample('Q-MAR').ffill()
Out[234]: 
        Colorado     Texas  New York      Ohio
2000Q4  0.556703  0.016631  0.111873 -0.027445
2001Q1  0.556703  0.016631  0.111873 -0.027445
2001Q2  0.556703  0.016631  0.111873 -0.027445
2001Q3  0.556703  0.016631  0.111873 -0.027445
2001Q4  0.046303  0.163344  0.251503 -0.157276
2002Q1  0.046303  0.163344  0.251503 -0.157276
2002Q2  0.046303  0.163344  0.251503 -0.157276
2002Q3  0.046303  0.163344  0.251503 -0.157276

11.7 移动窗口函数

Moving Window Functions

用于时间序列运算的数组转换的一个重要类别是:在滑动窗口(sliding window,可以带有指数衰减权重)上计算的各种统计函数。可以用于平滑噪声数据(noisy data)缺口数据(gappy data)。我将它们称为移动窗口函数(moving window function),尽管包括不定长窗口的函数,例如指数加权移动平均。跟其它统计函数一样,移动窗口函数也会自动排除缺失数据
An important class of array transformations used for time series operations are statistics and other functions evaluated over a sliding window or with exponentially decaying weights. This can be useful for smoothing noisy or gappy data. I call these moving window functions, even though it includes functions without a fixed-length window like exponentially weighted moving average. Like other statistical functions, these also automatically exclude missing data.
开始之前,我们加载一些时间序列数据,将其重采样为工作日频率:
Before digging in, we can load up some time series data and resample it to business day frequency:

In [235]: close_px_all = pd.read_csv('examples/stock_px_2.csv',
   .....:                            parse_dates=True, index_col=0)

In [236]: close_px = close_px_all[['AAPL', 'MSFT', 'XOM']]

In [237]: close_px = close_px.resample('B').ffill()

现在引入rolling函数,它与resample方法和groupby方法很像。可以在Series对象或DataFrame对象上沿着一个窗口(window,表示为时期数,见图11-4)调用它:
I now introduce the rolling operator, which behaves similarly to resample and groupby. It can be called on a Series or DataFrame along with a window (expressed as a number of periods; see Figure 11-4 for the plot created):

In [238]: close_px.AAPL.plot()
Out[238]: <matplotlib.axes._subplots.AxesSubplot at 0x7f2f2570cf98>

In [239]: close_px.AAPL.rolling(250).mean().plot() # gg注:等价于close_px.AAPL.rolling(window=250).mean().plot()

图11-4 苹果公司股价250日的移动平均线 Figure 11-4. Apple Price with 250-day MA

表达式rolling(250)与groupby方法很像,但不是对其直接分组而是创建一个对象,该对象允许在250日的滑动窗口上分组。然后,我们就得到了苹果公司股价250日的移动平均线
The expression rolling(250) is similar in behavior to groupby, but instead of grouping it creates an object that enables grouping over a 250-day sliding window. So here we have the 250-day moving window average of Apple’s stock price.
默认情况下,rolling函数要求窗口中的所有值都是非NA。可以修改该行为以解决缺失数据的问题,尤其是在时间序列的开始,少于窗口时期数的数据(见图11-5):
By default rolling functions require all of the values in the window to be non-NA. This behavior can be changed to account for missing data and, in particular, the fact that you will have fewer than window periods of data at the beginning of the time series (see Figure 11-5):

In [241]: appl_std250 = close_px.AAPL.rolling(250, min_periods=10).std()

In [242]: appl_std250[5:12]
Out[242]: 
2003-01-09         NaN
2003-01-10         NaN
2003-01-13         NaN
2003-01-14         NaN
2003-01-15    0.077496
2003-01-16    0.074760
2003-01-17    0.112368
Freq: B, Name: AAPL, dtype: float64

In [243]: appl_std250.plot()

图11-5 苹果公司250日的日收益标准差 Figure 11-5. Apple 250-day daily return standard deviation

为了计算扩展窗口均值(expanding window mean),使用expanding函数代替rolling函数。扩展均值从时间序列的起始处开始时间窗口,并增加窗口的大小,直到它包含整个时间序列。apple_std250时间序列的扩展窗口均值如下:
In order to compute an expanding window mean, use the expanding operator instead of rolling. The expanding mean starts the time window from the beginning of the time series and increases the size of the window until it encompasses the whole series. An expanding window mean on the apple_std250 time series looks like this:

In [244]: expanding_mean = appl_std250.expanding().mean()

在DataFrame对象上调用移动窗口函数,会将转换应用到每一列(见图11-6):
Calling a moving window function on a DataFrame applies the transformation to each column (see Figure 11-6):

In [246]: close_px.rolling(60).mean().plot(logy=True)
图11-6 各股价60日的移动平均线(对数Y轴) Figure 11-6. Stocks prices 60-day MA (log Y-axis)

rolling函数也可以接受一个字符串,该字符串表示固定大小的时间偏移量而不是固定数量的时期(gg注:即window参数可以等于字符串例如“20D”’)。使用这种表示法对不规则的时间序列很有用。这些字符串也可以传递给resample方法。例如,我们可以计算20日的滚动均值,如下所示:
The rolling function also accepts a string indicating a fixed-size time offset rather than a set number of periods. Using this notation can be useful for irregular time series. These are the same strings that you can pass to resample. For example, we could compute a 20-day rolling mean like so:

In [247]: close_px.rolling('20D').mean() # gg注:等价于close_px.rolling(window='20D').mean()
Out[247]:
                  AAPL       MSFT        XOM
2003-01-02    7.400000  21.110000  29.220000
2003-01-03    7.425000  21.125000  29.230000
2003-01-06    7.433333  21.256667  29.473333
2003-01-07    7.432500  21.425000  29.342500
2003-01-08    7.402000  21.402000  29.240000
2003-01-09    7.391667  21.490000  29.273333
2003-01-10    7.387143  21.558571  29.238571
2003-01-13    7.378750  21.633750  29.197500
2003-01-14    7.370000  21.717778  29.194444
2003-01-15    7.355000  21.757000  29.152000
...                ...        ...        ...
2011-10-03  398.002143  25.890714  72.413571
2011-10-04  396.802143  25.807857  72.427143
2011-10-05  395.751429  25.729286  72.422857
2011-10-06  394.099286  25.673571  72.375714
2011-10-07  392.479333  25.712000  72.454667
2011-10-10  389.351429  25.602143  72.527857
2011-10-11  388.505000  25.674286  72.835000
2011-10-12  388.531429  25.810000  73.400714
2011-10-13  388.826429  25.961429  73.905000
2011-10-14  391.038000  26.048667  74.185333
[2292 rows x 3 columns]

11.7.1 指数加权函数

Exponentially Weighted Functions

一种使用固定大小窗口的方式是赋予观察结果同等权重,另一种方式是指定一个衰减因子(decay factor)常量,以便赋予近期的观察结果更多的权重。指定衰减因子的方式有很多。常见的方式是使用跨度(span),它使结果类似于一个窗口大小等于跨度的简单移动窗口函数。
An alternative to using a static window size with equally weighted observations is to specify a constant decay factor to give more weight to more recent observations. There are a couple of ways to specify the decay factor. A popular one is using a span, which makes the result comparable to a simple moving window function with window size equal to the span.
由于指数加权统计会赋予近期的观察结果更多的权重,因此它比等权统计更快“适应”变化。
Since an exponentially weighted statistic places more weight on more recent observations, it “adapts” faster to changes compared with the equal-weighted version.
除了rolling函数和expanding函数,pandas还有ewm函数。下面这个例子比较了苹果公司股价60日的简单移动平均线和span=60的指数加权移动平均线(见图11-7):
pandas has the ewm operator to go along with rolling and expanding. Here’s an example comparing a 60-day moving average of Apple’s stock price with an EW moving average with span=60 (see Figure 11-7):

In [249]: aapl_px = close_px.AAPL['2006':'2007']

In [250]: ma60 = aapl_px.rolling(60, min_periods=20).mean() # gg注:原英文书中有误,作者的意图是60而不是30

In [251]: ewma60 = aapl_px.ewm(span=60).mean() # gg注:原英文书中有误,作者的意图是60而不是30

In [252]: ma60.plot(style='k--', label='Simple MA')
Out[252]: <matplotlib.axes._subplots.AxesSubplot at 0x7f2f252161d0>

In [253]: ewma60.plot(style='k-', label='EW MA')
Out[253]: <matplotlib.axes._subplots.AxesSubplot at 0x7f2f252161d0>

In [254]: plt.legend()
图11-7 简单移动平均线VS指数加权移动平均线 Figure 11-7. Simple moving average versus exponentially weighted

11.7.2 二元移动窗口函数

Binary Moving Window Functions

一些统计运算符(例如相关系数和协方差)需要在两个时间序列上运算。例如,金融分析师通常对某只股票与标普500等基准指数的相关系数感兴趣。为了解这一点,先计算所有感兴趣的时间序列的百分比变化:
Some statistical operators, like correlation and covariance, need to operate on two time series. As an example, financial analysts are often interested in a stock’s correlation to a benchmark index like the S&P 500. To have a look at this, we first compute the percent change for all of our time series of interest:
(gg注:结合上下文,作者想计算的是相关系数correlation coefficient,但他写作时省略了“coefficient ”,翻译时进行补足)

In [256]: spx_px = close_px_all['SPX']

In [257]: spx_rets = spx_px.pct_change()

In [258]: returns = close_px.pct_change()

在调用rolling函数后,corr聚合函数计算与spx_rets的滚动相关系数(结果见图11-8):
The corr aggregation function after we call rolling can then compute the rolling correlation with spx_rets (see Figure 11-8 for the resulting plot):

In [259]: corr_coefficient  = returns.AAPL.rolling(125, min_periods=100).corr(spx_rets) # gg注:为避免歧义,变量名从原文的corr改为corr_coefficient

In [260]: corr_coefficient.plot()

图11-8 苹果公司与标普500六个月的收益相关系数 Figure 11-8. Six-month AAPL return correlation to S&P 500

假设你想一次性计算多只股票与标普500的相关系数。虽然编写一个循环并新建一个DataFrame对象不是什么难事,但比较啰嗦。其实,只需传入一个Series对象和一个DataFrame对象,rolling(...).corr将自动计算该Series对象(本例中就是spx_rets)与DataFrame对象中每列的相关系数(结果见图11-9):
Suppose you wanted to compute the correlation of the S&P 500 index with many stocks at once. Writing a loop and creating a new DataFrame would be easy but might get repetitive, so if you pass a Series and a DataFrame, rolling(...).corr(gg注:原英文书中rooling_corr是老版的函数,新版已取消)will compute the correlation of the Series (spx_rets, in this case) with each column in the DataFrame (see Figure 11-9 for the plot of the result):

In [262]: corr_coefficient2 = returns.rolling(125, min_periods=100).corr(spx_rets) # gg注:为避免歧义,变量名从原文的corr改为corr_coefficient2

In [263]: corr_coefficient2.plot()
图11-9 3只股票与标普500六个月的收益相关系数 Figure 11-9. Six-month return correlations to S&P 500

11.7.3 用户定义的移动窗口函数

User-Defined Moving Window Functions

在rolling及其相关函数上的apply方法,让你能够在移动窗口上应用自己设计的数组函数。唯一的要求是:该函数从数组的每部分产生一个单值。例如,当使用rolling(...).quantile(q)计算样本分位数时,我们可能对某个特定值在样本中的百分等级感兴趣。scipy.stats.percentileofscore函数就能达到这个目的(结果见图11-10):
The apply method on rolling and related methods provides a means to apply an array function of your own devising over a moving window. The only requirement is that the function produce a single value (a reduction) from each piece of the array. For example, while we can compute sample quantiles using rolling(...).quantile(q), we might be interested in the percentile rank of a particular value over the sample. The scipy.stats.percentileofscore function does just this (see Figure 11-10 for the resulting plot):

In [265]: from scipy.stats import percentileofscore

In [266]: score_at_2percent = lambda x: percentileofscore(x, 0.02)

In [267]: result = returns.AAPL.rolling(250).apply(score_at_2percent)

In [268]: result.plot()
图11-10 一年窗口的苹果公司股价2%收益的百分等级 Figure 11-10. Percentile rank of 2% AAPL return over one-year window

如果你没安装SciPy,可以使用conda或pip安装。
If you don’t have SciPy installed already, you can install it with conda or pip.

11.8 本章小结

Conclusion

与前面章节讲解的其它类型的数据相比,时间序列数据需要不同类型的分析和数据转换工具。
Time series data calls for different types of analysis and data transformation tools than the other types of data we have explored in previous chapters.
在接下来的章节中,我们将继续介绍一些高级的pandas方法并展示如何开始使用建模库statsmodels和scikit-learn等建模库。
In the following chapters, we will move on to some advanced pandas methods and show how to start using modeling libraries like statsmodels and scikit-learn.


  1. closed参数和label参数的默认值可能会让部分用户感到奇怪。在实际使用中,这两个参数比较随意。对于某些目标频率,closed='left'更好。而对于其它频率,closed='right'更好。你真正应该关注的是如何对数据分段。
    The choice of the default values for closed and label might seem a bit odd to some users. In practice the choice is somewhat arbitrary; for some target frequencies, closed='left' is preferable, while for others closed='right' makes more sense. The important thing is that you keep in mind exactly how you are segmenting the data.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 158,560评论 4 361
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 67,104评论 1 291
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 108,297评论 0 243
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 43,869评论 0 204
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 52,275评论 3 287
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 40,563评论 1 216
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 31,833评论 2 312
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 30,543评论 0 197
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 34,245评论 1 241
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 30,512评论 2 244
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 32,011评论 1 258
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 28,359评论 2 253
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 33,006评论 3 235
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 26,062评论 0 8
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 26,825评论 0 194
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 35,590评论 2 273
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 35,501评论 2 268

推荐阅读更多精彩内容