Fork me on GitHub

Pandas类型操作

pandas数据类型操作

介绍Pandas中3个常见的数据类型操作方法:

  • to_numeric
  • astype
  • to_datetime
  • select_dtypes

1
2
import pandas as pd
import numpy as np

Pandas字段类型

to_numeric()

官网地址:https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html

1
2
3
pandas.to_numeric(arg,  # scalar, list, tuple, 1-d array, or Series
errors='raise', # ‘ignore’, ‘raise’, ‘coerce’;默认是raise
downcast=None)

errors的3种取值情况:

  • ignore:无效解析时直接返回输入
  • raise:无效解析引发异常
  • coerce:无效解析设置为NaN

downcast的使用:

  1. 字符串形式,默认是None,可以是‘integer’, ‘signed’, ‘unsigned’, or ‘float’

  2. 如果不是None,且已经转化成了某个数值型,才会向下个等级的数值类型转化

  3. 不同的数值类型

    • 有符号整型:integer or signed,最小等级为np.int8

    • 无符号整型:unsigned,最小等级为np.uint8

    • 浮点型: 最小等级为np.float32

案例:假“数值型”

1
2
s = pd.Series(["2.0", '1', -3, 5.0])  # 数值(类似)
s
0    2.0
1      1
2     -3
3    5.0
dtype: object

默认是object类型,也就是字符串。下面转成数值型:

1
2
3
# 1、默认转成float64

pd.to_numeric(s)
0    2.0
1    1.0
2   -3.0
3    5.0
dtype: float64
1
2
3
# 2、指定类型

pd.to_numeric(s, downcast="integer")
0    2
1    1
2   -3
3    5
dtype: int8
1
2
3
# 3、指定类型

pd.to_numeric(s, downcast="signed")
0    2
1    1
2   -3
3    5
dtype: int8
1
2
3
# 4、指定类型

pd.to_numeric(s, downcast="unsigned")
0    2.0
1    1.0
2   -3.0
3    5.0
dtype: float64
1
2
3
# 5、指定类型

pd.to_numeric(s, downcast="float")
0    2.0
1    1.0
2   -3.0
3    5.0
dtype: float32

案例2:数值+字符串

1
2
s1 = pd.Series(["2.0", 'pandas', -3, 5.0])  # 数值+字符串
s1
0       2.0
1    pandas
2        -3
3       5.0
dtype: object
1
# pd.to_numeric(s1)   # 默认是会抛出异常
1
2
3
# 忽略异常

pd.to_numeric(s1, errors="ignore")
0       2.0
1    pandas
2        -3
3       5.0
dtype: object
1
# pd.to_numeric(s1, errors="raise")   # 无效解析引发异常
1
2
3
# 无效解析设置为None

pd.to_numeric(s1, errors="coerce")
0    2.0
1    NaN
2   -3.0
3    5.0
dtype: float64
1
2
3
# 无效解析设置为None

pd.to_numeric(s1, errors="coerce", downcast="float")
0    2.0
1    NaN
2   -3.0
3    5.0
dtype: float32
1
2
3
# 无效解析设置为None,最后用0代替

pd.to_numeric(s1, errors="coerce").fillna(0)
0    2.0
1    0.0
2   -3.0
3    5.0
dtype: float64

案例3-数值型

1
2
s2 = pd.Series([1,2.0,3.0], dtype="float64")
s2
0    1.0
1    2.0
2    3.0
dtype: float64
1
2
s3 = pd.to_numeric(s2, downcast="float")
s3
0    1.0
1    2.0
2    3.0
dtype: float32
1
2
s4 = pd.to_numeric(s2, downcast="integer")
s4
0    1
1    2
2    3
dtype: int8

类型转化的优势之一:节省内存资源。比较上面3种不同数值类型下的数据所占内存大小:

1
2
3
print("memory of float64: ", s2.memory_usage())
print("memory of float32: ", s3.memory_usage())
print("memory of int8: ", s4.memory_usage())
memory of float64:  152
memory of float32:  140
memory of int8:  131

astype

另一种转化的方法:astype

1
s2
0    1.0
1    2.0
2    3.0
dtype: float64
1
s2.astype("float32")
0    1.0
1    2.0
2    3.0
dtype: float32
1
s2.astype("int64")
0    1
1    2
2    3
dtype: int64
1
s2.astype("int32")
0    1
1    2
2    3
dtype: int32
1
s2.astype("category")
/Applications/downloads/anaconda/anaconda3/lib/python3.7/site-packages/pandas/io/formats/format.py:1429: FutureWarning: Index.ravel returning ndarray is deprecated; in a future version this will return a view on self.
  for val, m in zip(values.ravel(), mask.ravel())





0    1.0
1    2.0
2    3.0
dtype: category
Categories (3, float64): [1.0, 2.0, 3.0]

to_datetime()

1
2
3
4
5
6
7
8
9
10
11
pandas.to_datetime(arg,
errors='raise',
dayfirst=False,
yearfirst=False,
utc=None,
format=None,
exact=True,
unit=None,
infer_datetime_format=False,
origin='unix',
cache=True)
1
2
3
4
5
df = pd.DataFrame({"Year":[2022,2021,2022],
"Month":[1,3,5],
"Day":["10","12","28"]
})
df
Year Month Day
0 2022 1 10
1 2021 3 12
2 2022 5 28
1
df.dtypes
Year      int64
Month     int64
Day      object
dtype: object

直接拼接会报错:字符串和数值型不能直接相加。

1
# df["Date"] = df["Year"] + df["Month"] + df["Day"]
1
2
df["Date"] = pd.to_datetime(df)
df
Year Month Day Date
0 2022 1 10 2022-01-10
1 2021 3 12 2021-03-12
2 2022 5 28 2022-05-28
1
df.dtypes
Year              int64
Month             int64
Day              object
Date     datetime64[ns]
dtype: object
1
pd.to_datetime("10/2/21")  # 默认
Timestamp('2021-10-02 00:00:00')
1
pd.to_datetime("10-2-21")  # 默认
Timestamp('2021-10-02 00:00:00')
1
pd.to_datetime("10/2/21",dayfirst=True)
Timestamp('2021-02-10 00:00:00')
1
pd.to_datetime("10/2/21",yearfirst=True)
Timestamp('2010-02-21 00:00:00')
1
pd.to_datetime("22-01-21",dayfirst=True)
Timestamp('2021-01-22 00:00:00')
1
pd.to_datetime("22-01-21",yearfirst=True)
Timestamp('2022-01-21 00:00:00')
1
pd.to_datetime('20220107', format='%Y%m%d', errors='ignore')
Timestamp('2022-01-07 00:00:00')
1
pd.to_datetime('20220107112347', errors='ignore')
Timestamp('2022-01-07 11:23:47')
1
pd.to_datetime('20220107112233', format='%Y%m%d%H%M%S')
Timestamp('2022-01-07 11:22:33')

select_dtypes

筛选指定类型下的数据信息

1
df.dtypes
Year              int64
Month             int64
Day              object
Date     datetime64[ns]
dtype: object
1
df.select_dtypes(include=["int"])
Year Month
0 2022 1
1 2021 3
2 2022 5
1
df.select_dtypes(include=["object"])
Day
0 10
1 12
2 28
1
df.select_dtypes(include=["O"])  # 效果同上
Day
0 10
1 12
2 28
1
2
3
# 排除object字段类型

df.select_dtypes(exclude=["object"])
Year Month Date
0 2022 1 2022-01-10
1 2021 3 2021-03-12
2 2022 5 2022-05-28
1
2
3
# 排除object + int字段类型

df.select_dtypes(exclude=["object","int"])
Date
0 2022-01-10
1 2021-03-12
2 2022-05-28

本文标题:Pandas类型操作

发布时间:2022年08月31日 - 09:08

原始链接:http://www.renpeter.cn/2022/08/31/Pandas%E7%B1%BB%E5%9E%8B%E6%93%8D%E4%BD%9C.html

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

Coffee or Tea