如何使用pandas进行数据测试

pandas学习记录

环境配置

1.python3.6.5 pandas0.19.2

pip install pandas

1.通过pandas读取csv文件,及常用的csv方法

import pandas as pd

csv_path = './test.csv'

file = pd.read_csv(csv_path, skiprows=1, na_values="missing")

print(file)

# import进入pandas库,将csv文件的路径放入一个变量,使用read_csv的方法读取csv文件

# skiprows 用于指定跳过csv文件的头部前几行,na_values 用于指定占位符

print(file.head(5))

# 取文件的前五行数据

# 显示所有列

pd.set_option('display.max_columns', None)

# 显示所有行

pd.set_option('display.max_rows', None)

# 设置value的显示长度为100,默认为50

pd.set_option('max_colwidth', 100)

test = []  # 新建一个空的列表

for index, row in file.iterrows():  # 使用iterrows方法遍历,该方法会返回两个对象,index和row

    if row["org_id"] == 13486:  # 判断返回的row对象中指定字段是否存在

        test.append(row)  # 如果存在将整行数据添加到test列表中

test = pd.DataFrame(test)  # 遍历结束后把列表转为DataFrame对象

print(test)

test.to_csv("aaa.csv")  # 将test写入csv

left = pd.DataFrame({'id': [1, 1], 'key': ['foo', 'foo'], 'lval': [1, 2]})

right = pd.DataFrame({'id': [1, 1], 'key': ['foo', 'foo'], 'rval': [4, 5]})

# print(pd.merge(left, right, on='id'))

# merge join连接两个dataframe对象, on= 通过指定的字段连接

df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',

                        'foo', 'bar', 'foo', 'foo'],

                  'B': ['one', 'one', 'two', 'three',

                        'two', 'two', 'one', 'three'],

                  'C': np.random.randn(8),

                  'D': np.random.randn(8)})

# print(df)

# print(df.groupby(['A', 'B']).sum())

# groupby 求和

set_option()的所有属性:

Available options:

- display.[chop_threshold, colheader_justify, column_space, date_dayfirst,

  date_yearfirst, encoding, expand_frame_repr, float_format, height, large_repr]

- display.latex.[escape, longtable, repr]

- display.[line_width, max_categories, max_columns, max_colwidth,

  max_info_columns, max_info_rows, max_rows, max_seq_items, memory_usage,

  mpl_style, multi_sparse, notebook_repr_html, pprint_nest_depth, precision,

  show_dimensions]

- display.unicode.[ambiguous_as_wide, east_asian_width]

- display.[width]

- io.excel.xls.[writer]

- io.excel.xlsm.[writer]

- io.excel.xlsx.[writer]

- io.hdf.[default_format, dropna_table]

- mode.[chained_assignment, sim_interactive, use_inf_as_null]

Parameters

----------

pat : str

    Regexp which should match a single option.

    Note: partial matches are supported for convenience, but unless you use the

    full option name (e.g. x.y.z.option_name), your code may break in future

    versions if new options with similar names are introduced.

value :

    new value of option.

Returns

-------

None

Raises

------

OptionError if no such option exists

Notes

-----

The available options with its descriptions:

display.chop_threshold : float or None

    if set to a float value, all float values smaller then the given threshold

    will be displayed as exactly 0 by repr and friends.

    [default: None] [currently: None]

display.colheader_justify : 'left'/'right'

    Controls the justification of column headers. used by DataFrameFormatter.

    [default: right] [currently: right]

display.column_space No description available.

    [default: 12] [currently: 12]

display.date_dayfirst : boolean

    When True, prints and parses dates with the day first, eg 20/01/2005

    [default: False] [currently: False]

display.date_yearfirst : boolean

    When True, prints and parses dates with the year first, eg 2005/01/20

    [default: False] [currently: False]

display.encoding : str/unicode

    Defaults to the detected encoding of the console.

    Specifies the encoding to be used for strings returned by to_string,

    these are generally strings meant to be displayed on the console.

    [default: UTF-8] [currently: UTF-8]

display.expand_frame_repr : boolean

    Whether to print out the full DataFrame repr for wide DataFrames across

    multiple lines, `max_columns` is still respected, but the output will

    wrap-around across multiple "pages" if its width exceeds `display.width`.

    [default: True] [currently: True]

display.float_format : callable

    The callable should accept a floating point number and return

    a string with the desired format of the number. This is used

    in some places like SeriesFormatter.

    See formats.format.EngFormatter for an example.

    [default: None] [currently: None]

display.height : int

    Deprecated.

    [default: 60] [currently: 60]

    (Deprecated, use `display.max_rows` instead.)

display.large_repr : 'truncate'/'info'

    For DataFrames exceeding max_rows/max_cols, the repr (and HTML repr) can

    show a truncated table (the default from 0.13), or switch to the view from

    df.info() (the behaviour in earlier versions of pandas).

    [default: truncate] [currently: truncate]

display.latex.escape : bool

    This specifies if the to_latex method of a Dataframe uses escapes special

    characters.

    method. Valid values: False,True

    [default: True] [currently: True]

display.latex.longtable :bool

    This specifies if the to_latex method of a Dataframe uses the longtable

    format.

    method. Valid values: False,True

    [default: False] [currently: False]

display.latex.repr : boolean

    Whether to produce a latex DataFrame representation for jupyter

    environments that support it.

    (default: False)

    [default: False] [currently: False]

display.line_width : int

    Deprecated.

    [default: 80] [currently: 80]

    (Deprecated, use `display.width` instead.)

display.max_categories : int

    This sets the maximum number of categories pandas should output when

    printing out a `Categorical` or a Series of dtype "category".

    [default: 8] [currently: 8]

display.max_columns : int

    If max_cols is exceeded, switch to truncate view. Depending on

    `large_repr`, objects are either centrally truncated or printed as

    a summary view. 'None' value means unlimited.

    In case python/IPython is running in a terminal and `large_repr`

    equals 'truncate' this can be set to 0 and pandas will auto-detect

    the width of the terminal and print a truncated object which fits

    the screen width. The IPython notebook, IPython qtconsole, or IDLE

    do not run in a terminal and hence it is not possible to do

    correct auto-detection.

    [default: 20] [currently: 20]

display.max_colwidth : int

    The maximum width in characters of a column in the repr of

    a pandas data structure. When the column overflows, a "..."

    placeholder is embedded in the output.

    [default: 50] [currently: 200]

display.max_info_columns : int

    max_info_columns is used in DataFrame.info method to decide if

    per column information will be printed.

    [default: 100] [currently: 100]

display.max_info_rows : int or None

    df.info() will usually show null-counts for each column.

    For large frames this can be quite slow. max_info_rows and max_info_cols

    limit this null check only to frames with smaller dimensions than

    specified.

    [default: 1690785] [currently: 1690785]

display.max_rows : int

    If max_rows is exceeded, switch to truncate view. Depending on

    `large_repr`, objects are either centrally truncated or printed as

    a summary view. 'None' value means unlimited.

    In case python/IPython is running in a terminal and `large_repr`

    equals 'truncate' this can be set to 0 and pandas will auto-detect

    the height of the terminal and print a truncated object which fits

    the screen height. The IPython notebook, IPython qtconsole, or

    IDLE do not run in a terminal and hence it is not possible to do

    correct auto-detection.

    [default: 60] [currently: 60]

display.max_seq_items : int or None

    when pretty-printing a long sequence, no more then `max_seq_items`

    will be printed. If items are omitted, they will be denoted by the

    addition of "..." to the resulting string.

    If set to None, the number of items to be printed is unlimited.

    [default: 100] [currently: 100]

display.memory_usage : bool, string or None

    This specifies if the memory usage of a DataFrame should be displayed when

    df.info() is called. Valid values True,False,'deep'

    [default: True] [currently: True]

display.mpl_style : bool

    Setting this to 'default' will modify the rcParams used by matplotlib

    to give plots a more pleasing visual style by default.

    Setting this to None/False restores the values to their initial value.

    [default: None] [currently: None]

display.multi_sparse : boolean

    "sparsify" MultiIndex display (don't display repeated

    elements in outer levels within groups)

    [default: True] [currently: True]

display.notebook_repr_html : boolean

    When True, IPython notebook will use html representation for

    pandas objects (if it is available).

    [default: True] [currently: True]

display.pprint_nest_depth : int

    Controls the number of nested levels to process when pretty-printing

    [default: 3] [currently: 3]

display.precision : int

    Floating point output precision (number of significant digits). This is

    only a suggestion

    [default: 6] [currently: 6]

display.show_dimensions : boolean or 'truncate'

    Whether to print out dimensions at the end of DataFrame repr.

    If 'truncate' is specified, only print out the dimensions if the

    frame is truncated (e.g. not display all rows and/or columns)

    [default: truncate] [currently: truncate]

display.unicode.ambiguous_as_wide : boolean

    Whether to use the Unicode East Asian Width to calculate the display text

    width.

    Enabling this may affect to the performance (default: False)

    [default: False] [currently: False]

display.unicode.east_asian_width : boolean

    Whether to use the Unicode East Asian Width to calculate the display text

    width.

    Enabling this may affect to the performance (default: False)

    [default: False] [currently: False]

display.width : int

    Width of the display in characters. In case python/IPython is running in

    a terminal this can be set to None and pandas will correctly auto-detect

    the width.

    Note that the IPython notebook, IPython qtconsole, or IDLE do not run in a

    terminal and hence it is not possible to correctly detect the width.

    [default: 80] [currently: 80]

io.excel.xls.writer : string

    The default Excel writer engine for 'xls' files. Available options:

    'xlwt' (the default).

    [default: xlwt] [currently: xlwt]

io.excel.xlsm.writer : string

    The default Excel writer engine for 'xlsm' files. Available options:

    'openpyxl' (the default).

    [default: openpyxl] [currently: openpyxl]

io.excel.xlsx.writer : string

    The default Excel writer engine for 'xlsx' files. Available options:

    'xlsxwriter' (the default), 'openpyxl'.

    [default: xlsxwriter] [currently: xlsxwriter]

io.hdf.default_format : format

    default format writing format, if None, then

    put will default to 'fixed' and append will default to 'table'

    [default: None] [currently: None]

io.hdf.dropna_table : boolean

    drop ALL nan rows when appending to a table

    [default: False] [currently: False]

mode.chained_assignment : string

    Raise an exception, warn, or no action if trying to use chained assignment,

    The default is warn

    [default: warn] [currently: warn]

mode.sim_interactive : boolean

    Whether to simulate interactive mode for purposes of testing

    [default: False] [currently: False]

mode.use_inf_as_null : boolean

    True means treat None, NaN, INF, -INF as null (old way),

    False means None and NaN are null, but INF, -INF are not null

    (new way).

    [default: False] [currently: False]

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 159,015评论 4 362
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 67,262评论 1 292
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 108,727评论 0 243
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 43,986评论 0 205
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 52,363评论 3 287
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 40,610评论 1 219
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 31,871评论 2 312
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 30,582评论 0 198
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 34,297评论 1 242
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 30,551评论 2 246
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 32,053评论 1 260
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 28,385评论 2 253
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 33,035评论 3 236
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 26,079评论 0 8
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 26,841评论 0 195
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 35,648评论 2 274
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 35,550评论 2 270

推荐阅读更多精彩内容

  • pyspark.sql模块 模块上下文 Spark SQL和DataFrames的重要类: pyspark.sql...
    mpro阅读 9,391评论 0 13
  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,106评论 0 10
  • 超高速音视频编码器用法: ffmpeg [options] [[infile options] -i infile...
    吉凶以情迁阅读 4,406评论 0 4
  • 一九六〇年。 人们把所有能填进肚子的东西,都吃光了。开始是榆树叶子,槐树叶子,后来是杨树叶子...
    怪叟阅读 686评论 1 2
  • 不知从何时起,孩子们读书打卡的照片,视频火爆朋友圈,一时间兴起了读书打卡的热潮。 关于读书打卡这件...
    离别沉默阅读 2,234评论 2 2