pandas数据类型操作
介绍Pandas中3个常见的数据类型操作方法:
- to_numeric
- astype
- to_datetime
- select_dtypes
1 | import pandas as pd |
Pandas字段类型
to_numeric()
官网地址:https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html
1 | pandas.to_numeric(arg, # scalar, list, tuple, 1-d array, or Series |
errors的3种取值情况:
- ignore:无效解析时直接返回输入
- raise:无效解析引发异常
- coerce:无效解析设置为NaN
downcast的使用:
-
字符串形式,默认是None,可以是‘integer’, ‘signed’, ‘unsigned’, or ‘float’
-
如果不是None,且已经转化成了某个数值型,才会向下个等级的数值类型转化
-
不同的数值类型
-
有符号整型:integer or signed,最小等级为np.int8
-
无符号整型:unsigned,最小等级为np.uint8
-
浮点型: 最小等级为np.float32
-
案例:假“数值型”
1 | s = pd.Series(["2.0", '1', -3, 5.0]) # 数值(类似) |
0 2.0
1 1
2 -3
3 5.0
dtype: object
默认是object类型,也就是字符串。下面转成数值型:
1 | # 1、默认转成float64 |
0 2.0
1 1.0
2 -3.0
3 5.0
dtype: float64
1 | # 2、指定类型 |
0 2
1 1
2 -3
3 5
dtype: int8
1 | # 3、指定类型 |
0 2
1 1
2 -3
3 5
dtype: int8
1 | # 4、指定类型 |
0 2.0
1 1.0
2 -3.0
3 5.0
dtype: float64
1 | # 5、指定类型 |
0 2.0
1 1.0
2 -3.0
3 5.0
dtype: float32
案例2:数值+字符串
1 | s1 = pd.Series(["2.0", 'pandas', -3, 5.0]) # 数值+字符串 |
0 2.0
1 pandas
2 -3
3 5.0
dtype: object
1 | # pd.to_numeric(s1) # 默认是会抛出异常 |
1 | # 忽略异常 |
0 2.0
1 pandas
2 -3
3 5.0
dtype: object
1 | # pd.to_numeric(s1, errors="raise") # 无效解析引发异常 |
1 | # 无效解析设置为None |
0 2.0
1 NaN
2 -3.0
3 5.0
dtype: float64
1 | # 无效解析设置为None |
0 2.0
1 NaN
2 -3.0
3 5.0
dtype: float32
1 | # 无效解析设置为None,最后用0代替 |
0 2.0
1 0.0
2 -3.0
3 5.0
dtype: float64
案例3-数值型
1 | s2 = pd.Series([1,2.0,3.0], dtype="float64") |
0 1.0
1 2.0
2 3.0
dtype: float64
1 | s3 = pd.to_numeric(s2, downcast="float") |
0 1.0
1 2.0
2 3.0
dtype: float32
1 | s4 = pd.to_numeric(s2, downcast="integer") |
0 1
1 2
2 3
dtype: int8
类型转化的优势之一:节省内存资源。比较上面3种不同数值类型下的数据所占内存大小:
1 | print("memory of float64: ", s2.memory_usage()) |
memory of float64: 152
memory of float32: 140
memory of int8: 131
astype
另一种转化的方法:astype
1 | s2 |
0 1.0
1 2.0
2 3.0
dtype: float64
1 | s2.astype("float32") |
0 1.0
1 2.0
2 3.0
dtype: float32
1 | s2.astype("int64") |
0 1
1 2
2 3
dtype: int64
1 | s2.astype("int32") |
0 1
1 2
2 3
dtype: int32
1 | s2.astype("category") |
/Applications/downloads/anaconda/anaconda3/lib/python3.7/site-packages/pandas/io/formats/format.py:1429: FutureWarning: Index.ravel returning ndarray is deprecated; in a future version this will return a view on self.
for val, m in zip(values.ravel(), mask.ravel())
0 1.0
1 2.0
2 3.0
dtype: category
Categories (3, float64): [1.0, 2.0, 3.0]
to_datetime()
1 | pandas.to_datetime(arg, |
1 | df = pd.DataFrame({"Year":[2022,2021,2022], |
Year | Month | Day | |
---|---|---|---|
0 | 2022 | 1 | 10 |
1 | 2021 | 3 | 12 |
2 | 2022 | 5 | 28 |
1 | df.dtypes |
Year int64
Month int64
Day object
dtype: object
直接拼接会报错:字符串和数值型不能直接相加。
1 | # df["Date"] = df["Year"] + df["Month"] + df["Day"] |
1 | df["Date"] = pd.to_datetime(df) |
Year | Month | Day | Date | |
---|---|---|---|---|
0 | 2022 | 1 | 10 | 2022-01-10 |
1 | 2021 | 3 | 12 | 2021-03-12 |
2 | 2022 | 5 | 28 | 2022-05-28 |
1 | df.dtypes |
Year int64
Month int64
Day object
Date datetime64[ns]
dtype: object
1 | pd.to_datetime("10/2/21") # 默认 |
Timestamp('2021-10-02 00:00:00')
1 | pd.to_datetime("10-2-21") # 默认 |
Timestamp('2021-10-02 00:00:00')
1 | pd.to_datetime("10/2/21",dayfirst=True) |
Timestamp('2021-02-10 00:00:00')
1 | pd.to_datetime("10/2/21",yearfirst=True) |
Timestamp('2010-02-21 00:00:00')
1 | pd.to_datetime("22-01-21",dayfirst=True) |
Timestamp('2021-01-22 00:00:00')
1 | pd.to_datetime("22-01-21",yearfirst=True) |
Timestamp('2022-01-21 00:00:00')
1 | pd.to_datetime('20220107', format='%Y%m%d', errors='ignore') |
Timestamp('2022-01-07 00:00:00')
1 | pd.to_datetime('20220107112347', errors='ignore') |
Timestamp('2022-01-07 11:23:47')
1 | pd.to_datetime('20220107112233', format='%Y%m%d%H%M%S') |
Timestamp('2022-01-07 11:22:33')
select_dtypes
筛选指定类型下的数据信息
1 | df.dtypes |
Year int64
Month int64
Day object
Date datetime64[ns]
dtype: object
1 | df.select_dtypes(include=["int"]) |
Year | Month | |
---|---|---|
0 | 2022 | 1 |
1 | 2021 | 3 |
2 | 2022 | 5 |
1 | df.select_dtypes(include=["object"]) |
Day | |
---|---|
0 | 10 |
1 | 12 |
2 | 28 |
1 | df.select_dtypes(include=["O"]) # 效果同上 |
Day | |
---|---|
0 | 10 |
1 | 12 |
2 | 28 |
1 | # 排除object字段类型 |
Year | Month | Date | |
---|---|---|---|
0 | 2022 | 1 | 2022-01-10 |
1 | 2021 | 3 | 2021-03-12 |
2 | 2022 | 5 | 2022-05-28 |
1 | # 排除object + int字段类型 |
Date | |
---|---|
0 | 2022-01-10 |
1 | 2021-03-12 |
2 | 2022-05-28 |