Pandas求解差集、交集、并集
本文讲解的是如何利用Pandas函数求解两个DataFrame的差集、交集、并集。
模拟数据
模拟一份简单的数据:
In [1]:
1 | import pandas as pd |
In [2]:
1 | df1 = pd.DataFrame({"col1":[1,2,3,4,5], |
In [3]:
1 | df1 |
Out[3]:
col1 | col2 | |
---|---|---|
0 | 1 | 6 |
1 | 2 | 7 |
2 | 3 | 8 |
3 | 4 | 9 |
4 | 5 | 10 |
In [4]:
1 | df2 |
Out[4]:
col1 | col2 | |
---|---|---|
0 | 1 | 6 |
1 | 3 | 8 |
2 | 7 | 10 |
两个DataFrame的相同部分:
差集
方法1:concat + drop_duplicates
In [5]:
1 | df3 = pd.concat([df1,df2]) |
Out[5]:
col1 | col2 | |
---|---|---|
0 | 1 | 6 |
1 | 2 | 7 |
2 | 3 | 8 |
3 | 4 | 9 |
4 | 5 | 10 |
0 | 1 | 6 |
1 | 3 | 8 |
2 | 7 | 10 |
In [6]:
1 | # 结果1 |
Out[6]:
col1 | col2 | |
---|---|---|
1 | 2 | 7 |
3 | 4 | 9 |
4 | 5 | 10 |
2 | 7 | 10 |
方法2:append + drop_duplicates
In [7]:
1 | df4 = df1.append(df2) |
Out[7]:
col1 | col2 | |
---|---|---|
0 | 1 | 6 |
1 | 2 | 7 |
2 | 3 | 8 |
3 | 4 | 9 |
4 | 5 | 10 |
0 | 1 | 6 |
1 | 3 | 8 |
2 | 7 | 10 |
In [8]:
1 | # 结果2 |
Out[8]:
col1 | col2 | |
---|---|---|
1 | 2 | 7 |
3 | 4 | 9 |
4 | 5 | 10 |
2 | 7 | 10 |
交集
方法1:merge
In [9]:
1 | # 结果 |
Out[9]:
col1 | col2 | |
---|---|---|
0 | 1 | 6 |
1 | 3 | 8 |
方法2:concat + duplicated + loc
In [10]:
1 | df6 = pd.concat([df1,df2]) |
Out[10]:
col1 | col2 | |
---|---|---|
0 | 1 | 6 |
1 | 2 | 7 |
2 | 3 | 8 |
3 | 4 | 9 |
4 | 5 | 10 |
0 | 1 | 6 |
1 | 3 | 8 |
2 | 7 | 10 |
In [11]:
1 | s = df6.duplicated(subset=['col1','col2'], keep='first') |
Out[11]:
1 | 0 False |
In [12]:
1 | # 结果 |
Out[12]:
col1 | col2 | |
---|---|---|
0 | 1 | 6 |
1 | 3 | 8 |
方法3:concat + groupby + query
In [13]:
1 |
|
Out[13]:
col1 | col2 | |
---|---|---|
0 | 1 | 6 |
1 | 2 | 7 |
2 | 3 | 8 |
3 | 4 | 9 |
4 | 5 | 10 |
0 | 1 | 6 |
1 | 3 | 8 |
2 | 7 | 10 |
In [14]:
1 | df9 = df6.groupby(["col1", "col2"]).size().reset_index() |
Out[14]:
col1 | col2 | count | |
---|---|---|---|
0 | 1 | 6 | 2 |
1 | 2 | 7 | 1 |
2 | 3 | 8 | 2 |
3 | 4 | 9 | 1 |
4 | 5 | 10 | 1 |
5 | 7 | 10 | 1 |
In [15]:
1 | df10 = df9.query("count > 1")[["col1", "col2"]] |
Out[15]:
col1 | col2 | |
---|---|---|
0 | 1 | 6 |
2 | 3 | 8 |
并集
方法1:concat + drop_duplicates
In [16]:
1 | df11 = pd.concat([df1,df2]) |
Out[16]:
col1 | col2 | |
---|---|---|
0 | 1 | 6 |
1 | 2 | 7 |
2 | 3 | 8 |
3 | 4 | 9 |
4 | 5 | 10 |
0 | 1 | 6 |
1 | 3 | 8 |
2 | 7 | 10 |
In [17]:
1 | # 结果 |
Out[17]:
col1 | col2 | |
---|---|---|
0 | 1 | 6 |
1 | 2 | 7 |
2 | 3 | 8 |
3 | 4 | 9 |
4 | 5 | 10 |
2 | 7 | 10 |
方法2:append + drop_duplicates
In [18]:
1 | df13 = df1.append(df2) |
Out[18]:
col1 | col2 | |
---|---|---|
0 | 1 | 6 |
1 | 2 | 7 |
2 | 3 | 8 |
3 | 4 | 9 |
4 | 5 | 10 |
2 | 7 | 10 |
方法3:merge
In [19]:
1 | pd.merge(df1,df2,how="outer") |
Out[19]:
col1 | col2 | |
---|---|---|
0 | 1 | 6 |
1 | 2 | 7 |
2 | 3 | 8 |
3 | 4 | 9 |
4 | 5 | 10 |
5 | 7 | 10 |