本文中记录了最近工作在处理数据的时候遇到的一个需求案例:按照指定的需求填充数据。数据是自己模拟的,类似于业务上的数据。
模拟数据
说明
数据
在一个DataFrame数据框中,有time、userid两个字段,分别代表日期和姓名,都有重复值
需求
增加3个字段:二十九、三十、三十一
。它们的取值要求如下(取值只有0和1):
- 如果某个人在29号有登陆,则他的全部记录的二十九字段填充为1,否则为0;
- 30和31号也是类似的要求
模拟数据
1 | import numpy as np |
1 | df = pd.DataFrame({"time":["2020-05-28","2020-05-28","2020-05-28","2020-05-29","2020-05-29","2020-05-30","2020-05-30","2020-05-31","2020-05-31"], |
1 | df |
time | userid | 二十九 | 三十 | 三十一 | |
---|---|---|---|---|---|
0 | 2020-05-28 | xiaoming | NaN | NaN | NaN |
1 | 2020-05-28 | zhangsan | NaN | NaN | NaN |
2 | 2020-05-28 | lisi | NaN | NaN | NaN |
3 | 2020-05-29 | zhangsan | NaN | NaN | NaN |
4 | 2020-05-29 | wangwu | NaN | NaN | NaN |
5 | 2020-05-30 | lisi | NaN | NaN | NaN |
6 | 2020-05-30 | zhoujun | NaN | NaN | NaN |
7 | 2020-05-31 | wangwu | NaN | NaN | NaN |
8 | 2020-05-31 | xiaoming | NaN | NaN | NaN |
解决过程
1 | for i in range(len(df)): |
1 | df |
time | userid | 二十九 | 三十 | 三十一 | |
---|---|---|---|---|---|
0 | 2020-05-28 | xiaoming | NaN | NaN | 1.0 |
1 | 2020-05-28 | zhangsan | 1.0 | NaN | NaN |
2 | 2020-05-28 | lisi | NaN | 1.0 | NaN |
3 | 2020-05-29 | zhangsan | 1.0 | NaN | NaN |
4 | 2020-05-29 | wangwu | 1.0 | NaN | 1.0 |
5 | 2020-05-30 | lisi | NaN | 1.0 | NaN |
6 | 2020-05-30 | zhoujun | NaN | 1.0 | NaN |
7 | 2020-05-31 | wangwu | 1.0 | NaN | 1.0 |
8 | 2020-05-31 | xiaoming | NaN | NaN | 1.0 |
1 | df1 = df[df['userid'].isin(["zhangsan"])] |
Int64Index([1, 3], dtype='int64')
其他字段
其余信息直接用fillna
方法填充0即可
1 | df.fillna(0) |
time | userid | 二十九 | 三十 | 三十一 | |
---|---|---|---|---|---|
0 | 2020-05-28 | xiaoming | 0.0 | 0.0 | 1.0 |
1 | 2020-05-28 | zhangsan | 1.0 | 0.0 | 0.0 |
2 | 2020-05-28 | lisi | 0.0 | 1.0 | 0.0 |
3 | 2020-05-29 | zhangsan | 1.0 | 0.0 | 0.0 |
4 | 2020-05-29 | wangwu | 1.0 | 0.0 | 1.0 |
5 | 2020-05-30 | lisi | 0.0 | 1.0 | 0.0 |
6 | 2020-05-30 | zhoujun | 0.0 | 1.0 | 0.0 |
7 | 2020-05-31 | wangwu | 1.0 | 0.0 | 1.0 |
8 | 2020-05-31 | xiaoming | 0.0 | 0.0 | 1.0 |