pandas系列11-cut/stack/melt

pandas系列10-数值操作2

本文是书《对比Excel，轻松学习Python数据分析》的第二篇，主要内容包含

区间切分
插入数据（行或列）
转置
索引重塑
长宽表转换

区间切分

Excel

Excel中区间切分使用的是if函数

1	=IF(A2<4,"<4",IF(A2<7,"4-6",">=7"))

python

栗子

Pandas中进行区间切分使用的是cut()方法，方法中有个bins参数来指明区间

cut()

下面看看官网上对cut函数的详解

1 2	pandas.cut(x, bins, right: bool = True, labels=None, retbins: bool = False, precision: int = 3, include_lowest: bool = False, duplicates: str = 'raise')

x：The input array to be binned. Must be 1-dimensional.待切分的数据，必须是一维的
bins：int, sequence of scalars, or IntervalIndex（间隔指数）.The criteria to bin by：指定切分的区间，有3种标准
- int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.
- sequence of scalars : Defines the bin edges allowing for non-uniform width. No extension of the range of x is done.
- IntervalIndex : Defines the exact bins to be used. Note that IntervalIndex for bins must be non-overlapping.
right：bool, default True：切分的时候保持左开右闭，右边是关闭状态（默认）

Indicates whether bins includes the rightmost edge or not. If right == True (the default), then the bins [1, 2, 3, 4] indicate (1,2], (2,3], (3,4]. This argument is ignored when bins is an IntervalIndex.

labels：array or False, default None：将分割好的区间用标签来代替

Specifies the labels for the returned bins. Must be the same length as the resulting bins. If False, returns only integer indicators of the bins. This affects the type of the output container (see below). This argument is ignored when bins is an IntervalIndex. If True, raises an error.

retbins：bool, default False：是否返回bins

Whether to return the bins or not. Useful when bins is provided as a scalar.

qcut

不需要事先指明切分区间，只需要指明切分的份数即可，依据的原则是每个份数中的数据尽可能相等

插入新行或列

Excel

Excel直接在确定要加入的某行或者列的前面，在菜单栏中选择加入即可

Python

Python中通过insert方法实现：指明要插入的位置、插入后新列的列名、以及要插入的数据

1	df.insert(2,"score",np.random.randint(80,100,10)) # 第2列之后插入名为score的一列数据

pandas中还可以通过直接给某列字段赋值的方式实现

行列互换

行列互换实际上就是转置的意思

excel

现将要转换的数据进行复制
在粘贴的时候勾选$\color{red}{选择性粘贴}$，再选择转置即可

转置后的效果图

Python

pandas中的转置只需要调用.T方法即可

索引重塑

所谓的索引重塑就是将原来的索引重新进行构造。两种常见的表示数据的结构：

表格型
树形

下面👇是表格型的示意图，通过一个行坐标和列坐标来确定一个数据

下面👇是树形的结构示意图：将原来表格型的列索引也变成了行索引，其实就是给表格型数据建立层次化索引

把数据从表格型数据转换到树形数据的过程，称之为重塑reshape

stack

该过程在Excel中无法实现，在pandas中是通过$\color{red}{stack}$方法实现的

unstack

将树形数据转成表格型数据

长宽表转换

长表和宽表

长表：很多行记录

宽表：属性特别多

Excel中的长宽表转换是直接通过复制和粘贴实现的。Python中的实现是通过stack()和melt()方法。在转换的过程中，宽表和长表中必须要有相同的列。比如将下图的宽表转成长表

宽表：

长表：

实现过程

stack方法

melt()

主要参数及解释

Name	Description	Type/Default Value	Required / Optional
frame		DataFrame	Required
id_vars	Column(s) to use as identifier variables.	tuple, list, or ndarray	Optional
value_vars	Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.	tuple, list, or ndarray	Optional
var_name	Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.	scalar	Required
value_name	Name to use for the ‘value’ column.	scalar Default Value: ‘value’	Required
col_level	If columns are a MultiIndex then use this level to melt.	int or string	Optional

company和name是行索引
Year是列属性
Sale是值