Python-for-data-分类Category
本文中介绍的是分类数据$\color{red}{category}$的使用
分裂数据Categorical
1 | import pandas as pd |
使用背景和目标
一个列中经常会包含重复值,这些重复值是一个小型的不同值的集合。
unique()
和value_counts()
能够从数组中提取到不同的值并分别计算它们的频率
1 | values = pd.Series(["apple","orange","apple","apple"] * 2) |
0 apple
1 orange
2 apple
3 apple
4 apple
5 orange
6 apple
7 apple
dtype: object
1 | pd.unique(values) # 查看不同的取值情况 |
array(['apple', 'orange'], dtype=object)
1 | pd.value_counts(values) # 查看每个值的个数 |
apple 6
orange 2
dtype: int64
维度表
维度表包含了不同的值,将主要观测值存储为引用维度表的整数键
1 | values = pd.Series([0,1,0,0] * 2) |
1 | values |
0 0
1 1
2 0
3 0
4 0
5 1
6 0
7 0
dtype: int64
1 | dim |
0 apple
1 orange
dtype: object
take方法-分类(字典编码展现)
不同值的数组被称之为数据的类别、字典或者层级
1 | dim.take(values) |
0 apple
1 orange
0 apple
0 apple
0 apple
1 orange
0 apple
0 apple
dtype: object
使用Categorical类型
1 | fruits = ["apple","orange","apple","apple"] * 2 |
1 | df |
basket_id | fruit | count | weight | |
---|---|---|---|---|
0 | 0 | apple | 14 | 0.569836 |
1 | 1 | orange | 12 | 1.239917 |
2 | 2 | apple | 13 | 2.587898 |
3 | 3 | apple | 10 | 2.768119 |
4 | 4 | apple | 6 | 3.867747 |
5 | 5 | orange | 8 | 0.194426 |
6 | 6 | apple | 12 | 2.686968 |
7 | 7 | apple | 9 | 0.113434 |
1 | df["fruit"] |
0 apple
1 orange
2 apple
3 apple
4 apple
5 orange
6 apple
7 apple
Name: fruit, dtype: object
如何生成Categorical实例
1 | fruit_cat = df["fruit"].astype("category") # 调用函数改变 |
0 apple
1 orange
2 apple
3 apple
4 apple
5 orange
6 apple
7 apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]
1 | c = fruit_cat.values |
[apple, orange, apple, apple, apple, orange, apple, apple]
Categories (2, object): [apple, orange]
两个属性:categories + codes
1 | print(c.categories) |
Index(['apple', 'orange'], dtype='object')
-----
[0 1 0 0 0 1 0 0]
1 | # 将DF的一列转成Categorical对象 |
1 | df.fruit |
0 apple
1 orange
2 apple
3 apple
4 apple
5 orange
6 apple
7 apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]
从其他序列生成pd.Categorical对象
1 | my_categories = pd.Categorical(['foo','bar','baz','foo','bar']) |
[foo, bar, baz, foo, bar]
Categories (3, object): [bar, baz, foo]
已知分类编码数据的情况:from_codes
1 | categories = ["foo","bar","baz"] |
1 | my_code |
[foo, bar, foo, foo, bar, foo, bar, foo]
Categories (3, object): [foo, bar, baz]
显式指定分类顺序:ordered = True
如果不指定顺序,分类转换是无序的。我们可以自己显式地指定
1 | ordered_cat = pd.Categorical.from_codes(codes,categories # 指定分类用的数据 |
[foo, bar, foo, foo, bar, foo, bar, foo]
Categories (3, object): [foo < bar < baz]
未排序的实例通过as_ordered排序
1 | # 未排序的实例通过as_ordered来进行排序 |
[foo, bar, baz, foo, bar]
Categories (3, object): [bar < baz < foo]
Categorical对象来进行计算
1 | np.random.seed(12345) # 设置随机种子 |
array([-0.20470766, 0.47894334, -0.51943872, -0.5557303 , 1.96578057])
qcut()函数-四分位数
1 | # 计算四位分箱 |
[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.9499999999999997, -0.684], (-0.0101, 0.63], (0.63, 3.928]]
Length: 1000
Categories (4, interval[float64]): [(-2.9499999999999997, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]
四分位数名称 labels
1 | bins = pd.qcut(draws,4,labels=["Q1","Q2","Q3","Q4"]) |
[Q2, Q3, Q2, Q2, Q4, ..., Q3, Q2, Q1, Q3, Q4]
Length: 1000
Categories (4, object): [Q1 < Q2 < Q3 < Q4]
1 | bins.codes[:10] |
array([1, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8)
结合groupby提取汇总信息
1 | bins = pd.Series(bins, name="quartile") |
quartile | count | min | max | |
---|---|---|---|---|
0 | Q1 | 250 | -2.949343 | -0.685484 |
1 | Q2 | 250 | -0.683066 | -0.010115 |
2 | Q3 | 250 | -0.010032 | 0.628894 |
3 | Q4 | 250 | 0.634238 | 3.927528 |
1 | results["quartile"] # 保留原始中的分类信息 |
0 Q1
1 Q2
2 Q3
3 Q4
Name: quartile, dtype: category
Categories (4, object): [Q1 < Q2 < Q3 < Q4]
分类提高性能
如果在特定的数据集上做了大量的数据分析,将数据转成分类数据有大大提高性能
1 | N = 10000000 |
0 foo
1 bar
2 baz
3 qux
4 foo
...
9999995 qux
9999996 foo
9999997 bar
9999998 baz
9999999 qux
Length: 10000000, dtype: object
转成分类数据
1 | # 转成分类数据 |
0 foo
1 bar
2 baz
3 qux
4 foo
...
9999995 qux
9999996 foo
9999997 bar
9999998 baz
9999999 qux
Length: 10000000, dtype: category
Categories (4, object): [bar, baz, foo, qux]
内存比较
1 | labels.memory_usage() |
80000128
1 | categories.memory_usage() |
10000320
分类转换的开销
1 | %time _ = labels.astype("category") |
CPU times: user 374 ms, sys: 34.8 ms, total: 409 ms
Wall time: 434 ms
分类方法
1 | s = pd.Series(["a","b","c","d"] * 2) |
0 a
1 b
2 c
3 d
4 a
5 b
6 c
7 d
dtype: category
Categories (4, object): [a, b, c, d]
cat属性
特殊属性cat提供了对分类方法的访问
- codes
- categories
- set_categories
1 | cat_s.cat.codes |
0 0
1 1
2 2
3 3
4 0
5 1
6 2
7 3
dtype: int8
1 | cat_s.cat.categories |
Index(['a', 'b', 'c', 'd'], dtype='object')
数据的实际类别超出给定的个数
1 | actual_categories = ["a","b","c","d","e"] |
0 a
1 b
2 c
3 d
4 a
5 b
6 c
7 d
dtype: category
Categories (5, object): [a, b, c, d, e]
1 | cat_s2.value_counts() |
d 2
c 2
b 2
a 2
e 0
dtype: int64
去除不在数据中的类别
1 | cat_s3 = cat_s[cat_s.isin(["a","b"])] |
0 a
1 b
4 a
5 b
dtype: category
Categories (4, object): [a, b, c, d]
1 | # c、d没有出现,直接删除 |
0 a
1 b
4 a
5 b
dtype: category
Categories (2, object): [a, b]
如何创建虚拟变量:get_dummies()
在机器学习或统计数据中,通常需要将分类数据转成虚拟变量,也称之为one-hot编码
1 | cat_s = pd.Series(["a","b","c","d"] * 2, dtype="category") |
0 a
1 b
2 c
3 d
4 a
5 b
6 c
7 d
dtype: category
Categories (4, object): [a, b, c, d]
1 | pd.get_dummies(cat_s) |
a | b | c | d | |
---|---|---|---|---|
0 | 1 | 0 | 0 | 0 |
1 | 0 | 1 | 0 | 0 |
2 | 0 | 0 | 1 | 0 |
3 | 0 | 0 | 0 | 1 |
4 | 1 | 0 | 0 | 0 |
5 | 0 | 1 | 0 | 0 |
6 | 0 | 0 | 1 | 0 |
7 | 0 | 0 | 0 | 1 |