Pandas快速入门1简介 2-序列Series 3-数据帧DataFrame

Pandas简介

Pandas是一个开源的库，主要是为了方便和直观地处理关系型或标记型数据。它提供了各种数据结构和操作，用于处理数字数据和时间序列。这个库是建立在NumPy库之上的。Pandas的速度很快，有很高的性能和生产力。

历史

Pandas最初是由Wes McKinney在2008年开发的，当时他在AQR资本管理公司工作。他说服了AQR允许他开放Pandas的源代码。另一位AQR员工Chang She在2012年加入，成为该库的第二个主要贡献者。随着时间的推移，许多版本的pandas已经发布。pandas的最新版本是1.5.3，于2023年1月18日发布。

优势

处理和分析数据的速度和效率。
可以加载来自不同文件对象的数据。
易于处理浮点和非浮点数据中的缺失数据（以NaN表示）。
规模可变性：可以从DataFrame和更高维度的对象中插入和删除列。
数据集的合并和连接。
对数据集进行灵活的重塑和透视
提供时间序列功能。
强大的分组功能，用于对数据集进行分割-应用-合并的操作。

参考资料

本文涉及的python中文资源请在github上点赞，谢谢！
本文相关书籍下载

Introducing Pandas DataFrame for Python data analysis | InfoWorld

快速入门

Pandas通常提供两种数据结构来处理数据，它们是：

Series(序列)
DataFrame(数据帧)

序列是一维标签数组，能够容纳任何类型的数据（整数、字符串、浮点、python对象等）。轴的标签统称为索引。序列只不过是excel表格中的一个列。标签不需要是唯一的，但必须是哈希类型。序列同时支持整数和基于标签的索引，并提供了大量的方法来执行索引操作。

序列通过从现有的存储中加载数据集来创建，可以是SQL数据库、CSV文件、Excel文件，也可从列表、字典和标量值中创建。

import pandas as pd
import numpy as np
 
 
# Creating empty series
ser = pd.Series()
   
print(ser)
 
# simple array
data = np.array(['g', 'e', 'e', 'k', 's'])
   
ser = pd.Series(data)
print(ser)

输出：

Series([], dtype: float64)
0    g
1    e
2    e
3    k
4    s

DataFrame是一个二维的大小可调整的，可能是异质的表格数据结构，有标记的轴（行和列）。DataFrame由三个主要部分组成：数据、行和列。

DataFrame通过从现有的存储中加载数据集来创建，可以是SQL数据库、CSV文件、Excel文件，也可以从列表、字典、以及字典列表中创建。

import pandas as pd
   
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

frame = pd.DataFrame(data)
print(frame)

输出：

    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2

为什么Pandas被用于数据科学

Pandas一般用于数据科学，但你想过为什么吗？这是因为Pandas是和其他用于数据科学的库一起使用的。它建立在NumPy库的基础上，这意味着NumPy的很多结构都在Pandas中使用或复制。Pandas产生的数据经常被用作Matplotlib的绘图、SciPy的统计分析和Scikit-learn的机器学习算法的输入。

练习1

序列

创建

# import pandas as pd
import pandas as pd
 
# simple array
data = [1, 2, 3, 4]
 
ser = pd.Series(data)
print(ser)

输出：

0    1
1    2
2    3
3    4
dtype: int64

import pandas as pd
songs2 = pd.Series([145, 142, 38, 13],
     name='counts')

访问元素

有两种方法可以让我们访问系列的元素，它们是：

用位置访问系列中的元素
使用标签（索引）访问元素

import pandas as pd
import numpy as np
 
ser = pd.Series([1,2,3,4,5,6,7,8], index=(0,1,3,4,5,6,7,8))
print(ser)
  

# 使用索引元素访问元素,注意这里的数字是索引，不是序号
print(ser[3])
  
# 获取前面5个元素
print(ser[:5])  
print(ser.head(5))
print(ser.head())

print(ser[1:4])

print(ser.loc[3]) # 取index为3的元素
print(ser.iloc[3]) # 取第4个元素

输出：

0    1
1    2
3    3
4    4
5    5
6    6
7    7
8    8
dtype: int64
3
0    1
1    2
3    3
4    4
5    5
dtype: int64
0    1
1    2
3    3
4    4
5    5
dtype: int64
0    1
1    2
3    3
4    4
5    5
dtype: int64
1    2
3    3
4    4
dtype: int64
3
4

二进制操作

我们可以对数列进行二进制操作，如加法、减法和许多其他操作。

# -*- coding: utf-8 -*-
# importing pandas module  
import pandas as pd  
 
# creating a series
data = pd.Series([5, 2, 3,7], index=['a', 'b', 'c', 'd'])
 
# creating a series
data1 = pd.Series([1, 6, 4, 9], index=['a', 'b', 'd', 'e'])
 
print(data, "\n\n", data1)

# adding two series using
# .add 注意不会修改原序列
print(data.add(data1, fill_value=0))

# adding two series using
# .add
print(data.sub(data1, fill_value=0))

输出：

a    5
b    2
c    3
d    7
dtype: int64 

 a    1
b    6
d    4
e    9
dtype: int64
a     6.0
b     8.0
c     3.0
d    11.0
e     9.0
dtype: float64
a    4.0
b   -4.0
c    3.0
d    3.0
e   -9.0
dtype: float64

转换操作

在转换操作中，我们进行各种操作，如改变系列的数据类型，将序列改为列表等。为了执行转换操作，我们有各种有助于转换的函数，如.astype(), .tolist()等。

# -*- coding: utf-8 -*-
import pandas as pd  

data = pd.read_csv("nba.csv") 
    
# dropping null value columns to avoid errors 
data.dropna(inplace = True) 
   
# storing dtype before converting 
before = data.dtypes 
   
# converting dtypes using astype 
data["Salary"]= data["Salary"].astype(int) 
data["Number"]= data["Number"].astype(str) 
   
# storing dtype after converting 
after = data.dtypes 
   
# printing to compare 
print("BEFORE CONVERSION\n", before, "\n") 
print("AFTER CONVERSION\n", after, "\n") 
   
# storing dtype before operation 
dtype_before = type(data["Salary"]) 
   
# converting to list 
salary_list = data["Salary"].tolist() 
   
# storing dtype after operation 
dtype_after = type(salary_list) 
   
# printing dtype 
print("Data type before converting = {}\nData type after converting = {}"
      .format(dtype_before, dtype_after))

输出：

BEFORE CONVERSION
 Name         object
Team         object
Number      float64
Position     object
Age         float64
Height       object
Weight      float64
College      object
Salary      float64
dtype: object 

AFTER CONVERSION
 Name         object
Team         object
Number       object
Position     object
Age         float64
Height       object
Weight      float64
College      object
Salary        int32
dtype: object 

Data type before converting = <class 'pandas.core.series.Series'>
Data type after converting = <class 'list'>

Pandas类型	Python类型
object	string
int64	int
float64	float
datetime64	datetime

延伸练习（可选）

>>> import pandas as pd
>>> songs2 = pd.Series([145, 142, 38, 13],
...      name='counts')

>>> songs2
0    145
1    142
2     38
3     13
Name: counts, dtype: int64
>>>
>>> songs2.index
RangeIndex(start=0, stop=4, step=1)
>>> songs3 = pd.Series([145, 142, 38, 13],
...      name='counts',
...      index=['Paul', 'John', 'George', 'Ringo'])
>>> songs3
Paul      145
John      142
George     38
Ringo      13
Name: counts, dtype: int64

>>> songs3.index
Index(['Paul', 'John', 'George', 'Ringo'], dtype='object')
>>> class Foo:
...     pass
...
>>> ringo = pd.Series(
...      ['Richard', 'Starkey', 13, Foo()],
...      name='ringo') # 数据不一定是数字或同质的。
>>> ringo
0                                        Richard
1                                        Starkey
2                                             13
3    <__main__.Foo object at 0x0000021510766CD0>
Name: ringo, dtype: object
>>> import numpy as np # float64支持NaN，而int64不支持, Int64 支持NaN
>>> nan_series = pd.Series([2, np.nan],
...    index=['Ono', 'Clapton'])
>>> nan_series
Ono        2.0
Clapton    NaN
dtype: float64
>>> nan_series.count()
1
>>> nan_series.size
2
>>> nan_series2 = pd.Series([2, None],
...    index=['Ono', 'Clapton'],
...    dtype='Int64')
>>> nan_series2
Ono           2
Clapton    <NA>
dtype: Int64
>>> nan_series2.count()
1
>>> nan_series.astype('Int64')
Ono           2
Clapton    <NA>
dtype: Int64
>>> import numpy as np
>>> numpy_ser = np.array([145, 142, 38, 13])
>>> songs3[1]
142
>>> numpy_ser[1]
142
>>> songs3.mean()
84.5
>>> numpy_ser.mean()
84.5
>>> mask = songs3 > songs3.median()  # boolean array
>>> mask
Paul       True
John       True
George    False
Ringo     False
Name: counts, dtype: bool
>>> songs3[mask]
Paul    145
John    142
Name: counts, dtype: int64
# 分类数据节约内存，还可以排序
>>> numpy_ser[numpy_ser > np.median(numpy_ser)]
array([145, 142])
>>> s = pd.Series(['m', 'l', 'xs', 's', 'xl'], dtype='category')
>>> s
0     m
1     l
2    xs
3     s
4    xl
dtype: category
Categories (5, object): ['l', 'm', 's', 'xl', 'xs']
>>> s.cat.ordered
False
>>> s2 = pd.Series(['m', 'l', 'xs', 's', 'xl'])
>>> size_type = pd.api.types.CategoricalDtype(
...     categories=['s','m','l'], ordered=True)
>>> s3 = s2.astype(size_type)
>>> s3
0      m
1      l
2    NaN
3      s
4    NaN
dtype: category
Categories (3, object): ['s' < 'm' < 'l']
>>> s3 > 's'
0     True
1     True
2    False
3    False
4    False
dtype: bool
>>> s.cat.reorder_categories(['xs','s','m','l', 'xl'],
...                          ordered=True)
0     m
1     l
2    xs
3     s
4    xl
dtype: category
Categories (5, object): ['xs' < 's' < 'm' < 'l' < 'xl']
>>> s3.str.upper()
0      M
1      L
2    NaN
3      S
4    NaN
dtype: object

序列属性

index

索引（轴标签）。

>>> s = pd.Series([1, 2, 3])
>>> s.index
RangeIndex(start=0, stop=3, step=1)

array

PandasArray

>>> pd.Series([1, 2, 3]).array
<PandasArray>
[1, 2, 3]
Length: 3, dtype: int64

values

根据dtype，以ndarray或ndarray-like形式返回系列。建议改用：Series.array or Series.to_numpy()。

dtype

返回基础数据的dtype对象。

>>> s = pd.Series([1, 2, 3])
>>> s.dtypes
dtype('int64')

shape

返回基础数据的形状的元组。

>>> s = pd.Series([1, 2, 3])
>>> s.shape
(3,)

nbytes

返回基础数据中的字节数。

>>> s = pd.Series([1, 2, 3])
>>> s.nbytes
24

ndim

底层数据的维数，根据定义为1。

size

返回基础数据中的元素数量。

>>> s = pd.Series([1, 2, 3])
>>> s.size
3

转置，即自己，没有意义。

memory_usage

内存使用

>>> s = pd.Series(range(3))
>>> s.memory_usage()
152
>>> s.memory_usage(index=False)
24
>>> s = pd.Series(["a", "b"])
>>> s.values
array(['a', 'b'], dtype=object)
>>> s.memory_usage()
144
>>> s.memory_usage(deep=True)

hasnans

如果有任何NaN，则返回True。

empty

为空

>>> ser_empty = pd.Series({'A' : []})
>>> ser_empty
A    []
dtype: object
>>> ser_empty.empty
False
>>> ser_empty = pd.Series()
<stdin>:1: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
>>> ser_empty.empty
True

dtypes

返回基础数据的dtype对象。没有意义，使用dtype即可。

name

返回该系列的名称。

>>> s = pd.Series([1, 2, 3])
>>> s
0    1
1    2
2    3
dtype: int64
>>> s.name
>>> s.name = "wechat: pythontesting"
>>> s
0    1
1    2
2    3
Name: wechat: pythontesting, dtype: int64

flags

获取与此pandas对象相关的属性。

>>> df = pd.DataFrame({"A": [1, 2]})
>>> df.flags
<Flags(allows_duplicate_labels=True)>
>>> df.flags.allows_duplicate_labels
True
>>> df.flags.allows_duplicate_labels = False
>>> df.flags["allows_duplicate_labels"]
False

访问一个行/列标签对的单个值。没有意义。完全可以用loc替代

attrs

全局属性的字典。实验功能。

axes

索引列表。

>>> s = pd.Series([1, 2, 3])
>>> s.axes
[RangeIndex(start=0, stop=3, step=1)]

通过整数位置访问一个行/列对的单个值。

iloc

整数位置的索引，用于按位置选择。

is_monotonic_decrease

对象中的值是否单调递减的，返回布尔值。

is_monotonic_increasing

对象中的值是否单调递增，返回布尔值。

is_unique

对象中的值是否唯一，则返回布尔值。

>>> s = pd.Series([1, 2, 3])
>>> s.is_unique
True

通过索引访问。

序列方法

abs()

绝对值

add(other[, level, fill_value, axis])

加

add_prefix(prefix[, axis])

用字符串前缀为标签加前缀。

add_suffix(s suffix[, axis])

用字符串后缀的标签。

agg([func, axis])

使用一个或多个操作对指定的轴进行聚合。

aggreg([func, axis])

在指定的轴上使用一个或多个操作进行聚合。

align(other[, join, axis, level, copy, ...])

用指定的连接方法将两个对象在其轴上对齐。

all([axis, bool_only, skipna])

所有元素是否为真。

any(*[, axis, bool_only, skipna])

任何元素是否为真。

apply(func[, convert_dtype, args])

对系列的值调用函数。

argmax([axis, skipna])

最大值的int位置。

argmin([axis, skipna])

返回系列中最小值的int位置。

argsort([axis, kind, order])

返回将对系列值进行排序的整数索引。

asfreq(freq[, method, how, normalize, ...])

将时间序列转换为指定的频率。

asof(where[, subset])

返回在where之前没有任何NaN的最后一行（s）。

astype(dtype[, copy, errors])

类型转换

at_time(time[, asof, axis])

选择一天中特定时间的值（例如，9:30AM）。

autocorr([lag])

计算 lag-N自相关。

backfill(*[, axis, inplace, limit, downcast])

(DEPRECATED）DataFrame.fillna()的同义词，方法='bfill'。

between(left, right[, inclusive])

返回相当于左<=系列<=右的布尔系列。

between_time(start_time, end_time[, ...])

选择一天中特定时间之间的数值（例如，9:00-9:30 AM）。

bfill(*[, axis, inplace, limit, downcast])

与方法='bfill'的DataFrame.fillna()同义。

bool()

返回单元素系列或DataFrame的bool值。

pandas.core.arrays.categorical.CategoricalAccessor的别名。

clip([lower, upper, axis, inplace])

在输入阈值处Trim数值。

combine(other, func[, fill_value])

根据func，将序列与序列或标量结合起来。

combine_first(other)

用'other'中相同位置的值更新空元素。

compare(other[, align_axis, keep_shape, ...])

与另一个系列进行比较，并显示其差异。

convert_dtypes([infer_objects, ...])

使用支持pd.NA的dtypes将列转换为可能的最佳dtypes。

copy([deep])

对这个对象的索引和数据进行复制。

corr(other[, method, min_periods])

计算与其他系列的相关关系，排除缺失值。

count()

返回系列中非NA/null观察值的数量。

cov(other[, min_periods, ddof])

计算与系列的协方差，排除缺失值。

cummax([axis, skipna])

返回 DataFrame 或 Series 轴上的累积最大值。

cummin([axis, skipna])

返回DataFrame或Series轴上的累积最小值。

cumprod([axis, skipna])

返回 DataFrame 或 Series 轴上的累积乘积。

cumsum([axis, skipna])

返回 DataFrame 或 Series 轴上的累积和。

describe([percentiles, include, exclude])

生成描述性统计。

diff([periods])

元素的第一个离散差值。

div(other[, level, fill_value, axis])

返回系列和其他的浮动除法，从元素开始（二进制运算符truediv）。

divide(other[, level, fill_value, axis])

返回系列和其他元素的浮动除法（二进制运算符 truediv）。

divmod(other[, level, fill_value, axis])

返回系列和其他的整数除法和模数，以元素为单位（二进制运算符divmod）。

dot(other)

计算系列与其他列之间的点积。

drop([labels, axis, index, columns, level, ...])

返回删除指定索引标签的系列。

drop_duplicates(*[, keep, inplace, ignore_index])

返回删除了重复值的系列。

droplevel(level[, axis])

返回删除了所要求的索引/列级的系列/数据帧。

dropna(*[, axis, inplace, how, ignore_index])

返回一个去除缺失值的新系列。

pandas.core.indexes.accessors.CombinedDatetimelikeProperties的别名。

duplicated([keep])

表示重复的系列值。

eq(other[, level, fill_value, axis])

返回series和other的等值，从元素上看（二进制运算符eq）。

equals(other)

测试两个对象是否包含相同的元素。

ewm([com, span, halflife, alpha, ...])

提供指数加权（EW）的计算方法。

expanding([min_periods, axis, method])

提供扩展窗口计算。

explode([ignore_index])

将列表状的每个元素转化为行。

factorize([sort, use_na_sentinel])

将对象编码为一个枚举类型或分类变量。

ffill(*[, axis, inplace, limit, downcast])

与DataFrame.fillna()同义，方法='fill'。

fillna([value, method, axis, inplace, ...])

使用指定的方法填充NA/NaN值。

filter([items, like, regex, axis])

根据指定的索引标签对数据帧的行或列进行子集。

first(offset)

根据日期偏移量，选择时间序列数据的初始时段。

first_valid_index()

返回第一个非NA值的索引，如果没有找到非NA值，则返回None。

floordiv(other[, level, fill_value, axis])

返回系列和其他的整数除法，从元素开始（二进制运算符 floordiv）。

ge(other[, level, fill_value, axis])

=

get(key[, default])

从对象中获取给定键的项目（例如：DataFrame列）。

groupby([by, axis, level, as_index, sort, ...])

使用映射器或通过一系列列对系列进行分组。

gt(other[, level, fill_value, axis])

head([n])

返回前n行。

hist([by, ax, grid, xlabelsize, xrot, ...])

使用matplotlib绘制输入序列的柱状图。

idxmax([axis, skipna])

返回最大值的行标签。

idxmin([axis, skipna])

返回最小值的行标签。

infer_objects([copy])

试图为对象列推断出更好的dtypes。

info([verbose, buf, max_cols, memory_usage, ...])

打印一个系列的简明摘要。

interpolate([method, axis, limit, inplace, ...])

使用插值方法填充NaN值。

isin(values)

系列中的元素是否包含在value中。

isna()

检测缺失的值。

isnull()

Series.isnull是Series.isna的一个别名。

item()

以Python标量形式返回基础数据的第一个元素。

items()

迭代(index, value)

keys()

返回索引的别名。

kurt([axis, skipna, numeric_only])

返回所请求的轴上的无偏向峰度。

kurtosis([axis, skipna, numeric_only])

返回请求的轴上的无偏的峰度。

last(offset)

根据日期偏移量选择时间序列数据的最后时段。

last_valid_index()

返回最后一个非NA值的索引，如果没有找到非NA值，则返回None。

le(other[, level, fill_value, axis])

lt(other[, level, fill_value, axis])

map(arg[, na_action])

根据输入的映射或函数映射系列的值。

mask(cond[, other, inplace, axis, level])

替换条件为True的值。

max([axis, skipna, numeric_only])

返回所请求的轴上数值的最大值。

mean([axis, skipna, numeric_only])

返回请求的坐标轴上的数值的平均值。

median([axis, skipna, numeric_only])

返回请求的轴上的数值的中位数。

memory_usage([index, deep])

返回该系列的内存使用情况。

min([axis, skipna, numeric_only])

返回请求的轴上的最小值。

mod(other[, level, fill_value, axis])

返回系列和其他的模数，从元素上看（二进制运算符mod）。

mode([dropna])

返回系列的模式（s）。

mul(other[, level, fill_value, axis])

乘法

multiply(other[, level, fill_value, axis])

乘法

ne(other[, level, fill_value, axis])

不等于

nlargest([n, keep])

返回最大的n个元素。

notna()

检测现有的非空数值。

notnull()

Series.notnull是Series.notna的别名。

nsmallest([n, keep])

返回最小的n个元素。

nunique([dropna])

返回对象中唯一元素的数量。

pad(*[, axis, inplace, limit, downcast])

(DEPRECATED) 与DataFrame.fillna()的同义词，方法='ffill'。

pct_change([period, fill_method, limit, freq])

当前元素和之前元素之间的百分比变化。

pipe(func, *args, **kwargs)

应用期待Series或DataFrames的可连锁函数。

plot

pandas.plotting._core.PlotAccessor的别名。

pop(item)

出栈

pow(other[, level, fill_value, axis])

乘方

prod([axis, skipna, numeric_only, min_count])

返回请求的轴上的数值的乘积。

product([axis, skipna, numeric_only, min_count])

返回所请求的轴上的值的乘积。

quantile([q, interpolation])

返回给定四分位数的值。

radd(other[, level, fill_value, axis])

返回系列和其他的加法，从元素上看（二进制运算符radd）。

rank([axis, method, numeric_only, ...])

计算沿轴的数字数据等级（1到n）。

ravel([order])

以ndarray或ExtensionArray的形式返回扁平化的基础数据。

rdiv(other[, level, fill_value, axis])

返回系列和其他的浮动除法，从元素上看（二进制运算符truediv）。

rdivmod(other[, level, fill_value, axis])

返回系列和其他的整数除法和模数，按元素计算（二进制运算符 rdivmod）。

reindex([index, axis, method, copy, level, ...])

使系列符合新的索引，可选择填充逻辑。

reindex_like(other[, method, copy, limit, ...])

返回一个与其他对象的索引相匹配的对象。

rename([index, axis, copy, inplace, level, ...])

改变系列索引的标签或名称。

rename_axis([mapper, index, axis, copy, inplace])

设置索引或列的轴的名称。

reorder_levels(order)

使用输入顺序重新排列索引级别。

repeat(repeats[, axis])

重复元素。

replace([to_replace, value, inplace, limit, ...])

用值替换to_replace中给出的值。

resample(rule[, axis, closed, label, ...])

重新取样时间序列数据。

reset_index([level, drop, name, inplace, ...])

生成新的DataFrame或Series，并重置索引。

rfloordiv(other[, level, fill_value, axis])

返回系列和其他的整数除法，从元素开始（二进制运算符rfloordiv）。

rmod(other[, level, fill_value, axis])

返回系列和其他的模数，按元素排列（二进制运算符 rmod）。

rmul(other[, level, fill_value, axis])

返回系列和其他元素的乘法（二进制运算符rmul）。

rolling(window[, min_periods, center, ...])

提供滚动窗口计算。

round([decimals])

将每个值四舍五入到给定的小数。

rpow(other[, level, fill_value, axis])

返回系列和其他的指数幂，从元素开始（二进制运算rpow）。

rsub(other[, level, fill_value, axis])

返回系列和其他元素的减法（二进制运算符 rsub）。

rtruediv(other[, level, fill_value, axis])

返回系列和其他元素的浮动除法（二进制运算符 rtruediv）。

sample([n, frac, replace, weights, ...])

返回一个对象轴的随机抽样项目。

searchsorted(value[, side, sorter])

找到应该插入元素的索引，以维持秩序。

sem([axis, skipna, ddof, numeric_only])

返回请求的轴上的平均值的无偏标准误差。

set_axis(labsels, *[, axis, copy])

为给定的轴指定所需的索引。

set_flags(*[, copy, allows_duplicate_labels])

返回一个带有更新标志的新对象。

shift([period, freq, axis, fill_value])

通过所需的周期数和可选的时间频率来转移索引。

skew([axis, skipna, numeric_only])

返回所要求的轴的无偏斜度。

sort_index(*[, axis, level, ascending, ...])

按索引标签对系列进行排序。

sort_values(*[, axis, ascending, inplace, ...])

按数值排序。

sparse

pandas.core.arrays.sparse.accessor.SparseAccessor的别名。

squeeze([axis])

将一维的轴对象挤压成标量。

std([axis, skipna, ddof, numeric_only])

返回请求的轴上的样本标准差。

pandas.core.strings.accessor.StringMethods的别名。

sub(other[, level, fill_value, axis])

减法。

subtract(other[, level, fill_value, axis])

减法

sum([axis, skipna, numeric_only, min_count])

返回请求的轴上的数值之和。

swapaxes(axis1, axis2[, copy])

交换轴并适当地交换轴的值。

swaplevel([i, j, copy])

在MultiIndex中交换级别i和j。

tail([n])

返回最后的n行。

take(indices[, axis])

返回沿轴给定的位置索引中的元素。

to_clipboard([excel, sep])

复制对象到系统剪贴板。

to_csv([path_or_buf, sep, na_rep, ...])

将对象写入逗号分隔的值（csv）文件。

to_dict([into])

将系列转换为{label -> value} dict或类似dict的对象。

to_excel(excel_writer[, sheet_name, na_rep, ...])

将对象写到Excel表格中。

to_frame([name])

转换为数据帧。

to_hdf(path_or_buf, key[, mode, complevel, ...])

使用HDFStore将包含的数据写入一个HDF5文件。

to_json([path_or_buf, orient, date_format, ...])

将对象转换为JSON字符串。

to_latex([buf, columns, header, index, ...])
将对象渲染成LaTeX表格、长表或嵌套表。
to_list()

返回一个数值的列表。

to_markdown([buf, mode, index, storage_options])

以Markdown友好格式打印系列。

to_numpy([dtype, copy, na_value])

转为NumPy数组

to_period([freq, copy])

将系列从DatetimeIndex转换成PeriodIndex。

to_pickle(path[, compression, protocol, ...])

Pickle（序列化）对象到文件。

to_sql(name, con[, schema, if_exists, ...])

将存储在DataFrame中的记录写到SQL数据库中。

to_string([buf, na_rep, float_format, ...])

渲染系列的字符串表示法。

to_timestamp([freq, how, copy])

Cast to DatetimeIndex of Timestamps, at beginning of period.

to_xarray()

从pandas对象中返回一个xarray对象。

tolist()

返回一个数值的列表。

transform(func[, axis])

对自己调用func，产生一个与自己轴线形状相同的系列。

transpose(*args, **kwargs)

返回转置，顾名思义就是自我。

truediv(other[, level, fill_value, axis])

返回系列和其他的浮动除法，从元素上看（二进制运算符truediv）。

Truncate([before, after, axis, copy])

在某个索引值之前和之后截断一个系列或数据框架。

tz_convert(tz[, axis, level, copy])

将有tz意识的轴转换为目标时区。

tz_localize(tz[, axis, level, copy, ...])

将系列或数据框架的无零时差的索引定位到目标时区。

unique()

返回系列对象的唯一值。

unstack([level, fill_value])

解除堆叠，也称为透视，用MultiIndex生成DataFrame的系列。

update(other)

使用所传递的系列的值，在原地修改系列。

value_counts([normalize, sort, ascending, ...])

返回一个包含唯一值计数的系列。

var([axis, skipna, ddof, numeric_only])

返回请求的轴上的无偏方差。

view([dtype])

创建一个系列的新视图。

where(cond[, other, inplace, axis, level])

替换条件为False的值。

xs(key[, axis, level, drop_level])

从系列/数据框架中返回截面。

DataFrame

DataFrame是二维的大小可变的，可能是异质的表格数据结构，有标记的轴（行和列）。数据框架是一个二维的数据结构，也就是说，数据是以表格的方式排列在行和列中。Pandas DataFrame由三个主要部分组成：数据、行和列。

创建DataFrame

import pandas as pd
 
# intialise data of lists.
data = {'Name':['Tom', 'nick', 'krish', 'jack'],
        'Age':[20, 21, 19, 18]}
 
# Create DataFrame
df = pd.DataFrame(data)
 
# Print the output.
print(df)

输出：

    Name  Age
0    Tom   20
1   nick   21
2  krish   19
3   jack   18

处理行和列

DataFrame是一个二维数据结构，即数据以表格的方式排列在行和列中。我们可以对行/列进行基本操作，如选择、删除、添加和重命名。

通过调用列的名称来访问这些列。

# -*- coding: utf-8 -*-
import pandas as pd
 
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
 
# Convert the dictionary into DataFrame 
df = pd.DataFrame(data)
print("DataFrame:")
print(df)
 
# 选择一列
print("\n单列选择:")
print(df['Name'])
print(type(df['Name'])) # 返回序列
print(df.Name)

# 选择两列
print("\n多列选择:")
print(df[['Name', 'Qualification']])
print(type(df[['Name', 'Qualification']])) # 返回数据帧

subset_loc = df.loc[0]
subset_head = df.head(n=1)
print("\n\nloc的类型为序列Series")
print(type(subset_loc))

print("\n\nhead的类型为数据帧DataFrame")
print(type(subset_head))

print("\n\n行选择")
print(df.loc[0]) # index为0的行
print(df.iloc[1]) # 第2行

print("\n\n子DataFrame选择")
print(df.loc[[0,1],['Name', 'Qualification']])
print(df.iloc[[0,1],[0,1]])
print(df.loc[[0,2]]) #选择第1和3行

print("\n\n元素选择")
print(df.loc[1,'Address'])
print(df.iloc[1,2])
print(df.at[1,'Address'])
print(df.iat[1,2])

print("\n\n获取列名")
print(df.columns)
print(df.keys())
print(sorted(df))

# 重命名列名
print("\n重命名列名")
print(df.rename(columns={'Name': 'TFR'}))

# 区间
print("\n重命名列名，注意loc包含最后一个元素，iloc不包含")
print(df.loc[0:2])
print(df.iloc[0:2])

输出：

DataFrame:
     Name  Age    Address Qualification
0     Jai   27      Delhi           Msc
1  Princi   24     Kanpur            MA
2  Gaurav   22  Allahabad           MCA
3    Anuj   32    Kannauj           Phd

单列选择:
0       Jai
1    Princi
2    Gaurav
3      Anuj
Name: Name, dtype: object
<class 'pandas.core.series.Series'>
0       Jai
1    Princi
2    Gaurav
3      Anuj
Name: Name, dtype: object

多列选择:
     Name Qualification
0     Jai           Msc
1  Princi            MA
2  Gaurav           MCA
3    Anuj           Phd
<class 'pandas.core.frame.DataFrame'>


loc的类型为序列Series
<class 'pandas.core.series.Series'>


head的类型为数据帧DataFrame
<class 'pandas.core.frame.DataFrame'>


行选择
Name               Jai
Age                 27
Address          Delhi
Qualification      Msc
Name: 0, dtype: object
Name             Princi
Age                  24
Address          Kanpur
Qualification        MA
Name: 1, dtype: object


子DataFrame选择
     Name Qualification
0     Jai           Msc
1  Princi            MA
     Name  Age
0     Jai   27
1  Princi   24
     Name  Age    Address Qualification
0     Jai   27      Delhi           Msc
2  Gaurav   22  Allahabad           MCA


元素选择
Kanpur
Kanpur
Kanpur
Kanpur


获取列名
Index(['Name', 'Age', 'Address', 'Qualification'], dtype='object')
Index(['Name', 'Age', 'Address', 'Qualification'], dtype='object')
['Address', 'Age', 'Name', 'Qualification']

重命名列名
      TFR  Age    Address Qualification
0     Jai   27      Delhi           Msc
1  Princi   24     Kanpur            MA
2  Gaurav   22  Allahabad           MCA
3    Anuj   32    Kannauj           Phd

重命名列名，注意loc包含最后一个元素，iloc不包含
     Name  Age    Address Qualification
0     Jai   27      Delhi           Msc
1  Princi   24     Kanpur            MA
2  Gaurav   22  Allahabad           MCA
     Name  Age Address Qualification
0     Jai   27   Delhi           Msc
1  Princi   24  Kanpur            MA

行的选择：DataFrame.loc[]方法用于从Pandas DataFrame检索行。行也可以通过传递整数位置给iloc[]函数来选择。

注意：序列名后的中括号默认为index，数据帧名后的中括号默认为列名。

处理缺失数据

缺少的数据在pandas中也被称为NA（Not Available）值。

使用isnull()和notnull()检查缺失值：
为了检查Pandas DataFrame中的缺失值，我们使用了 isnull() 和 notnull() 函数。这两个函数都有助于检查一个值是否是NaN。这些函数也可以在Pandas系列中使用，以便在一个系列中找到空值。

# -*- coding: utf-8 -*-
import pandas as pd
 
# importing numpy as np
import numpy as np
 
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score':[np.nan, 40, 80, 98]}
 
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
print(df.isnull())
 
# filling missing value using fillna()  
print(df.fillna(0))
print(df.dropna()) # 保留行列都不为0者

输出：

   First Score  Second Score  Third Score
0        False         False         True
1        False         False        False
2         True         False        False
3        False          True        False
   First Score  Second Score  Third Score
0        100.0          30.0          0.0
1         90.0          45.0         40.0
2          0.0          56.0         80.0
3         95.0           0.0         98.0
   First Score  Second Score  Third Score
1         90.0          45.0         40.0

使用fillna()、replace()和interpolate()填充缺失值：
为了填补数据集中的空值，我们使用fillna()、replace()和interpolate()函数，这些函数用它们自己的一些值替换NaN值。所有这些函数都有助于在DataFrame的数据集中填充空值。Interpolate()使用各种插值技术来填补缺失的值，而不是硬编码的值。

对行和列进行迭代

在行上迭代：
为了迭代行，我们可以使用三个函数iteritems(), iterrows(), itertuples() 。这三个函数将有助于对行的迭代。

# -*- coding: utf-8 -*-
import pandas as pd
  
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
        'degree': ["MBA", "BCA", "M.Tech", "MBA"],
        'score':[90, 40, 80, 98]}
 
# creating a dataframe from a dictionary 
df = pd.DataFrame(dict)
 
print(df)

for i, j in df.iterrows():
    print(i, j)
    print()
    
# creating a list of dataframe columns
print("列迭代")
columns = list(df)
print(columns)
 
for i in columns: # 使用df.columns也类似
    # printing the third element of the column
    print (df[i][2])

输出：

     name  degree  score
0  aparna     MBA     90
1  pankaj     BCA     40
2  sudhir  M.Tech     80
3   Geeku     MBA     98
0 name      aparna
degree       MBA
score         90
Name: 0, dtype: object

1 name      pankaj
degree       BCA
score         40
Name: 1, dtype: object

2 name      sudhir
degree    M.Tech
score         80
Name: 2, dtype: object

3 name      Geeku
degree      MBA
score        98
Name: 3, dtype: object

列迭代
['name', 'degree', 'score']
sudhir
M.Tech
80

小技巧

显示所有列

pd.set_option("display.max.columns", None) # 所有列
pd.set_option("display.max.columns", 8) # 8列

浮点数显示精度

pd.set_option("display.precision", 2) # 设置浮点数的精度

最后编辑于：2023.04.04 22:49:18

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 159,569评论 4赞 363
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,499评论 1赞 294
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 109,271评论 0赞 244
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 44,087评论 0赞 209
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,474评论 3赞 287
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,670评论 1赞 222
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,911评论 2赞 313
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,636评论 0赞 202
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,397评论 1赞 246
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,607评论 2赞 246
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,093评论 1赞 261
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,418评论 2赞 254
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,074评论 3赞 237
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,092评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,865评论 0赞 196
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,726评论 2赞 276
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,627评论 2赞 270