Fork me on GitHub

pandas系列6-重塑reshape

Posted on 2019-10-11 | In pandas , 文档 |

Words count in article: 1.5k | Reading time ≈ 8

重新排列表格型数据的基础运算称之为重塑reshape或者轴向旋转pivot

stack：将数据的列旋转成行，AB由列属性变成行索引
unstack:将数据的行旋转成列，AB由行索引变成列属性

重点知识

stack和unstack的用法
如何实现行和列的位置互换

层次化索引 `MultiIndex`

数据分散在不同的文件或者数据库中
层次化索引在⼀个轴上拥有多个（两个以上）索引级别
低维度形式处理高维度数据

1
2
3

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data = pd.DataFrame(np.arange(6).reshape(2,3)
                   ,index=pd.Index(['Inhio','Colorado'], name='state')
                   ,columns=pd.Index(['one', 'two', 'three']
                   ,name='number'   # name 参数在column里面，不是单独的，相当于是给columns起名字
                     )    
                   )

data

number	one	two	three
state
Inhio	0	1	2
Colorado	3	4	5

列行的互转

1 2	res = data.stack() # 列转成行 res

state     number
Inhio     one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int32

res.index

1
2
3

MultiIndex(levels=[['Inhio', 'Colorado'], ['one', 'two', 'three']],
           codes=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]],
           names=['state', 'number'])

1	type(res) # res 是S型对象

1	pandas.core.series.Series

1	res.unstack() # 行转成列：one、two、three 变成列属性

number	one	two	three
state
Inhio	0	1	2
Colorado	3	4	5

1
2
3

res.unstack(1)   # 默认操作最内层
# res.unstack()  
# res.unstack('number')  同上；恢复原状

number	one	two	three
state
Inhio	0	1	2
Colorado	3	4	5

实现行索引和列属性的位置互换

1 2	res.unstack(0) # 实现了将行索引和列属性的位置互换 # res.unstack('state') 同上功能

state	Inhio	Colorado
number
one	0	3
two	1	4
three	2	5

1 2	s1 = pd.Series([0,1,2,3], index=['a','b','c','d']) s2 = pd.Series([4,5,6], index=['c','d','e'])

1 2	data1 = pd.concat([s1,s2], keys=['one', 'two']) data1

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

1	data1.unstack() # 行索引转成列属性，unstack引入缺失值

	a	b	c	d	e
one	0.0	1.0	2.0	3.0	NaN
two	NaN	NaN	4.0	5.0	6.0

1	data1.unstack().stack() # 操作可逆，stack操作默认过滤缺失值

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64

1	data1.unstack().stack(dropna=False) # 通过dropna参数保留缺失值

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64

官网demo

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))
tuples

[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

`zip`用法

zip(*[['bar', 'bar', 'baz', 'baz',
   'foo', 'foo', 'qux', 'qux'],
  ['one', 'two', 'one', 'two',
   'one', 'two', 'one', 'two']])

1	<zip at 0x1d68ccb0d08>

1 2	list(zip(*[["name", "age", "fee"], # zip函数的用法 ["xiaoming", 18, "28.22"]]))

1	[('name', 'xiaoming'), ('age', 18), ('fee', '28.22')]

1	index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second']) # 错层次索引如何创建

1 2	# index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second']) index

1
2
3

MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           codes=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])

1 2	df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B']) df

		A	B
first	second
bar	one	-0.904164	0.001987
bar	two	-0.299222	-1.895165
baz	one	2.029077	-0.618868
baz	two	-3.423209	0.621895
foo	one	0.278142	-0.029705
foo	two	-1.875680	-1.225554
qux	one	0.593413	-1.241949
qux	two	-0.727231	0.394393

1 2	df2 = df[:4] df2

		A	B
first	second
bar	one	-0.904164	0.001987
bar	two	-0.299222	-1.895165
baz	one	2.029077	-0.618868
baz	two	-3.423209	0.621895

1 2	stacked = df2.stack() # 列---->行 stacked

first  second   
bar    one     A   -0.904164
               B    0.001987
       two     A   -0.299222
               B   -1.895165
baz    one     A    2.029077
               B   -0.618868
       two     A   -3.423209
               B    0.621895
dtype: float64

1
2
3

stacked.unstack()  # 行----->列
# stacked.unstack(0)
# stacked.unstack('first')

	first	bar	baz
second
one	A	-0.904164	2.029077
one	B	0.001987	-0.618868
two	A	-0.299222	-3.423209
two	B	-1.895165	0.621895

1 2	stacked.unstack(1) # stacked.unstack('second')

	second	one	two
first
bar	A	-0.904164	-0.299222
bar	B	0.001987	-1.895165
baz	A	2.029077	-3.423209
baz	B	-0.618868	0.621895

关于unstack的数字标签

unstack(1) =unstack("second")，默认是最里层标签
unstack(0) =unstack("first")

pivot

本质

DF的pivot本质上就是set_index先创建层次化索引，再利用unstack进行重塑。

Pandas透视表详解

左边的表格类似于是Excel或者MySQL中的存储形式，通过轴向转换变成右边的DataFrame型数据。

1
2
3

df[df['bar'] == 'A']     # select out everything for variable A 

df.pivot(index='foo', columns='bar', values='baz')   # 分别指定行索引、列属性还有value值

官网demo

In [1]: df
Out[1]: 
         date variable     value
0  2000-01-03        A  0.469112
1  2000-01-04        A -0.282863
2  2000-01-05        A -1.509059
3  2000-01-03        B -1.135632
4  2000-01-04        B  1.212112
5  2000-01-05        B -0.173215
6  2000-01-03        C  0.119209
7  2000-01-04        C -1.044236
8  2000-01-05        C -0.861849
9  2000-01-03        D -2.104569
10 2000-01-04        D -0.494929
11 2000-01-05        D  1.071804

In [2]: df.pivot(index='date', columns='variable', values='value')
Out[2]: 
variable           A         B         C         D
date                                              
2000-01-03  0.469112 -1.135632  0.119209 -2.104569
2000-01-04 -0.282863  1.212112 -1.044236 -0.494929
2000-01-05 -1.509059 -0.173215 -0.861849  1.071804

In [3]: df[df['variable'] == 'A']
Out[3]: 
        date variable     value
0 2000-01-03        A  0.469112
1 2000-01-04        A -0.282863
2 2000-01-05        A -1.509059
    
In [4]: df['value2'] = df['value'] * 2
In [5]: pivoted = df.pivot(index='date', columns='variable')

In [6]: pivoted
Out[6]: 
               value                                  value2                              
variable           A         B         C         D         A         B         C         D
date                                                                                      
2000-01-03  0.469112 -1.135632  0.119209 -2.104569  0.938225 -2.271265  0.238417 -4.209138
2000-01-04 -0.282863  1.212112 -1.044236 -0.494929 -0.565727  2.424224 -2.088472 -0.989859
2000-01-05 -1.509059 -0.173215 -0.861849  1.071804 -3.018117 -0.346429 -1.723698  2.143608

本文标题:pandas系列6-重塑reshape

发布时间:2019年10月11日 - 08:10

原始链接:http://www.renpeter.cn/2019/10/11/pandas%E7%B3%BB%E5%88%976-%E9%87%8D%E5%A1%91reshaping.html

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际转载请保留原文链接及作者。

Coffee or Tea