Fork me on GitHub

pandas系列6-重塑reshape

重新排列表格型数据的基础运算称之为重塑reshape或者轴向旋转pivot

  • stack:将数据的列旋转成行,AB由列属性变成行索引
  • unstack:将数据的行旋转成列,AB由行索引变成列属性

重点知识

  • stack和unstack的用法
  • 如何实现行和列的位置互换

层次化索引 MultiIndex

  • 数据分散在不同的文件或者数据库中
  • 层次化索引在⼀个轴上拥有多个(两个以上)索引级别
  • 低维度形式处理高维度数据
1
2
3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
1
2
3
4
5
6
7
8
data = pd.DataFrame(np.arange(6).reshape(2,3)
,index=pd.Index(['Inhio','Colorado'], name='state')
,columns=pd.Index(['one', 'two', 'three']
,name='number' # name 参数在column里面,不是单独的,相当于是给columns起名字
)
)

data
number one two three
state
Inhio 0 1 2
Colorado 3 4 5

列行的互转

1
2
res = data.stack()   # 列转成行
res
1
2
3
4
5
6
7
8
state     number
Inhio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int32
1
res.index
1
2
3
MultiIndex(levels=[['Inhio', 'Colorado'], ['one', 'two', 'three']],
codes=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]],
names=['state', 'number'])
1
type(res)     # res 是S型对象
1
pandas.core.series.Series
1
res.unstack()   # 行转成列:one、two、three 变成列属性
number one two three
state
Inhio 0 1 2
Colorado 3 4 5
1
2
3
res.unstack(1)   # 默认操作最内层
# res.unstack()
# res.unstack('number') 同上;恢复原状
number one two three
state
Inhio 0 1 2
Colorado 3 4 5

实现行索引和列属性的位置互换

1
2
res.unstack(0)   # 实现了将行索引和列属性的位置互换
# res.unstack('state') 同上功能
state Inhio Colorado
number
one 0 3
two 1 4
three 2 5
1
2
s1 = pd.Series([0,1,2,3], index=['a','b','c','d'])
s2 = pd.Series([4,5,6], index=['c','d','e'])
1
2
data1 = pd.concat([s1,s2], keys=['one', 'two'])
data1
1
2
3
4
5
6
7
8
one  a    0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
1
data1.unstack()  # 行索引转成列属性,unstack引入缺失值
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
1
data1.unstack().stack()   # 操作可逆,stack操作默认过滤缺失值
1
2
3
4
5
6
7
8
one  a    0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
dtype: float64
1
data1.unstack().stack(dropna=False)   # 通过dropna参数保留缺失值
1
2
3
4
5
6
7
8
9
10
11
one  a    0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64

官网demo

1
2
3
4
5
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two',
'one', 'two', 'one', 'two']]))
tuples
1
2
3
4
5
6
7
8
[('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]

zip用法

1
2
3
4
zip(*[['bar', 'bar', 'baz', 'baz',
'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two',
'one', 'two', 'one', 'two']])
1
<zip at 0x1d68ccb0d08>
1
2
list(zip(*[["name", "age", "fee"],    # zip函数的用法
["xiaoming", 18, "28.22"]]))
1
[('name', 'xiaoming'), ('age', 18), ('fee', '28.22')]
1
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])  # 错层次索引如何创建
1
2
# index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
index
1
2
3
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
codes=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=['first', 'second'])
1
2
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df
A B
first second
bar one -0.904164 0.001987
two -0.299222 -1.895165
baz one 2.029077 -0.618868
two -3.423209 0.621895
foo one 0.278142 -0.029705
two -1.875680 -1.225554
qux one 0.593413 -1.241949
two -0.727231 0.394393
1
2
df2 = df[:4]
df2
A B
first second
bar one -0.904164 0.001987
two -0.299222 -1.895165
baz one 2.029077 -0.618868
two -3.423209 0.621895
1
2
stacked = df2.stack()   # 列---->行
stacked
1
2
3
4
5
6
7
8
9
10
first  second   
bar one A -0.904164
B 0.001987
two A -0.299222
B -1.895165
baz one A 2.029077
B -0.618868
two A -3.423209
B 0.621895
dtype: float64
1
2
3
stacked.unstack()  # 行----->列
# stacked.unstack(0)
# stacked.unstack('first')
first bar baz
second
one A -0.904164 2.029077
B 0.001987 -0.618868
two A -0.299222 -3.423209
B -1.895165 0.621895
1
2
stacked.unstack(1)
# stacked.unstack('second')
second one two
first
bar A -0.904164 -0.299222
B 0.001987 -1.895165
baz A 2.029077 -3.423209
B -0.618868 0.621895

关于unstack的数字标签

  • unstack(1) =unstack("second"),默认是最里层标签
  • unstack(0) =unstack("first")

pivot

本质

DF的pivot本质上就是set_index先创建层次化索引,再利用unstack进行重塑。

Pandas透视表详解

左边的表格类似于是Excel或者MySQL中的存储形式,通过轴向转换变成右边的DataFrame型数据。

1
2
3
df[df['bar'] == 'A']     # select out everything for variable A 

df.pivot(index='foo', columns='bar', values='baz') # 分别指定行索引、列属性还有value值

官网demo

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
In [1]: df
Out[1]:
date variable value
0 2000-01-03 A 0.469112
1 2000-01-04 A -0.282863
2 2000-01-05 A -1.509059
3 2000-01-03 B -1.135632
4 2000-01-04 B 1.212112
5 2000-01-05 B -0.173215
6 2000-01-03 C 0.119209
7 2000-01-04 C -1.044236
8 2000-01-05 C -0.861849
9 2000-01-03 D -2.104569
10 2000-01-04 D -0.494929
11 2000-01-05 D 1.071804

In [2]: df.pivot(index='date', columns='variable', values='value')
Out[2]:
variable A B C D
date
2000-01-03 0.469112 -1.135632 0.119209 -2.104569
2000-01-04 -0.282863 1.212112 -1.044236 -0.494929
2000-01-05 -1.509059 -0.173215 -0.861849 1.071804

In [3]: df[df['variable'] == 'A']
Out[3]:
date variable value
0 2000-01-03 A 0.469112
1 2000-01-04 A -0.282863
2 2000-01-05 A -1.509059

In [4]: df['value2'] = df['value'] * 2
In [5]: pivoted = df.pivot(index='date', columns='variable')

In [6]: pivoted
Out[6]:
value value2
variable A B C D A B C D
date
2000-01-03 0.469112 -1.135632 0.119209 -2.104569 0.938225 -2.271265 0.238417 -4.209138
2000-01-04 -0.282863 1.212112 -1.044236 -0.494929 -0.565727 2.424224 -2.088472 -0.989859
2000-01-05 -1.509059 -0.173215 -0.861849 1.071804 -3.018117 -0.346429 -1.723698 2.143608

本文标题:pandas系列6-重塑reshape

发布时间:2019年10月11日 - 08:10

原始链接:http://www.renpeter.cn/2019/10/11/pandas%E7%B3%BB%E5%88%976-%E9%87%8D%E5%A1%91reshaping.html

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

Coffee or Tea