Fork me on GitHub

kaggle实战-用户个性化分析及分群

kaggle实战-用户个性化分析及分群

基于一份超市消费数据集的用户个性化分析以及用户细分分群的实现,文章主要是从人、货、促销、地点等4个方面进行分析

原数据集地址:

https://www.kaggle.com/code/sonalisingh1411/customer-personality-analysis-segmentation/data?select=marketing_campaign.csv

数据信息

In [1]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
import matplotlib

import plotly_express as px

from datetime import date
import warnings

warnings.filterwarnings("ignore")
pd.set_option("display.max_columns",None)
pd.set_option("display.max_rows",None)

In [2]:

查看当前路径下的全部文件

1
2
3
4
5
6
7
8
9
# 查看当前路径下的全部文件

import os
for dirname, _, filenames in os.walk("./"):
for file in filenames:
print(os.path.join(dirname, file))
./kaggle实战-商场用户个性化分析及细分.md
./kaggle-用户个性化分析与分群.ipynb
./marketing_campaign.csv

读取文件

In [3]:

1
2
3
4
df = pd.read_csv("marketing_campaign.csv",sep="\t")
df.head(3)

df.head().style.background_gradient(cmap="YlGnBu_r")

基本信息

In [5]:

1
df.shape   # 数据量

Out[5]:

1
(2240, 29)

In [6]:

1
df.columns  # 字段

Out[6]:

1
2
3
4
5
6
7
8
Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
'AcceptedCmp2', 'Complain', 'Z_CostContact', 'Z_Revenue', 'Response'],
dtype='object')

In [7]:

1
pd.value_counts(df.dtypes)  # 不同的字段类型

Out[7]:

1
2
3
4
int64      25
object 3
float64 1
dtype: int64

In [8]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
df.info()  #  数据信息info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 2240 non-null int64
1 Year_Birth 2240 non-null int64
2 Education 2240 non-null object
3 Marital_Status 2240 non-null object
4 Income 2216 non-null float64
5 Kidhome 2240 non-null int64
6 Teenhome 2240 non-null int64
7 Dt_Customer 2240 non-null object
8 Recency 2240 non-null int64
9 MntWines 2240 non-null int64
10 MntFruits 2240 non-null int64
11 MntMeatProducts 2240 non-null int64
12 MntFishProducts 2240 non-null int64
13 MntSweetProducts 2240 non-null int64
14 MntGoldProds 2240 non-null int64
15 NumDealsPurchases 2240 non-null int64
16 NumWebPurchases 2240 non-null int64
17 NumCatalogPurchases 2240 non-null int64
18 NumStorePurchases 2240 non-null int64
19 NumWebVisitsMonth 2240 non-null int64
20 AcceptedCmp3 2240 non-null int64
21 AcceptedCmp4 2240 non-null int64
22 AcceptedCmp5 2240 non-null int64
23 AcceptedCmp1 2240 non-null int64
24 AcceptedCmp2 2240 non-null int64
25 Complain 2240 non-null int64
26 Z_CostContact 2240 non-null int64
27 Z_Revenue 2240 non-null int64
28 Response 2240 non-null int64
dtypes: float64(1), int64(25), object(3)
memory usage: 507.6+ KB

统计信息

In [9]:

1
2
3
# 描述统计信息

df.describe().style.background_gradient(cmap="ocean_r")

Out[9]:

字段唯一值统计

In [10]:

1
2
3
# 数据字段唯一值个数

df.nunique()

Out[10]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
ID                     2240
Year_Birth 59
Education 5
Marital_Status 8
Income 1974
Kidhome 3
Teenhome 3
Dt_Customer 663
Recency 100
MntWines 776
MntFruits 158
MntMeatProducts 558
MntFishProducts 182
MntSweetProducts 177
MntGoldProds 213
NumDealsPurchases 15
NumWebPurchases 15
NumCatalogPurchases 14
NumStorePurchases 14
NumWebVisitsMonth 16
AcceptedCmp3 2
AcceptedCmp4 2
AcceptedCmp5 2
AcceptedCmp1 2
AcceptedCmp2 2
Complain 2
Z_CostContact 1
Z_Revenue 1
Response 2
dtype: int64

我们发现:Z_CostContact 和 Z_Revenue都是只存在一种取值情况,对后续分析和建模没有影响,直接删除。

In [11]:

1
df.drop(["Z_CostContact","Z_Revenue"],axis=1,inplace=True)

缺失值处理

缺失值情况

In [12]:

1
df.isnull().sum()  # 缺失值情况

Out[12]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
ID                      0
Year_Birth 0
Education 0
Marital_Status 0
Income 24
Kidhome 0
Teenhome 0
Dt_Customer 0
Recency 0
MntWines 0
MntFruits 0
MntMeatProducts 0
MntFishProducts 0
MntSweetProducts 0
MntGoldProds 0
NumDealsPurchases 0
NumWebPurchases 0
NumCatalogPurchases 0
NumStorePurchases 0
NumWebVisitsMonth 0
AcceptedCmp3 0
AcceptedCmp4 0
AcceptedCmp5 0
AcceptedCmp1 0
AcceptedCmp2 0
Complain 0
Response 0
dtype: int64

缺失值个数与比例

In [13]:

1
2
total = df.isnull().sum().sort_values(ascending=False)
Percentage = total / len(df)

In [14]:

1
2
3
4
missing_data = pd.DataFrame({"Total": total,
"Percentage":Percentage
})
missing_data

Out[14]:

缺失值可视化

In [16]:

1
2
3
4
import missingno as mn
mn.matrix(df, color=(0,0,0))

plt.show()

缺失值填充-中值

In [17]:

1
2
3
4
5
# 现有值的中值填充

df['Income']=df['Income'].fillna(df['Income'].median())

df.isnull().sum()

Out[17]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
ID                     0
Year_Birth 0
Education 0
Marital_Status 0
Income 0
Kidhome 0
Teenhome 0
Dt_Customer 0
Recency 0
MntWines 0
MntFruits 0
MntMeatProducts 0
MntFishProducts 0
MntSweetProducts 0
MntGoldProds 0
NumDealsPurchases 0
NumWebPurchases 0
NumCatalogPurchases 0
NumStorePurchases 0
NumWebVisitsMonth 0
AcceptedCmp3 0
AcceptedCmp4 0
AcceptedCmp5 0
AcceptedCmp1 0
AcceptedCmp2 0
Complain 0
Response 0
dtype: int64

People

字段1-Year_Birth

In [19]:

1
2
3
year_birth = df["Year_Birth"].value_counts().reset_index()
year_birth.columns = ["year_birth", "number"]
year_birth.head(10)

Out[19]:

year_birth number
0 1976 89
1 1971 87
2 1975 83
3 1972 79
4 1978 77
5 1970 77
6 1973 74
7 1965 74
8 1969 71
9 1974 69

In [20]:

1
px.bar(year_birth,x="year_birth",y="number",color="number")

通过上面的图形能够观察到:消费用户更多的是集中在1970-1975年之间

字段2-Education

用户的不同教育程度

In [21]:

1
2
3
education = df["Education"].value_counts().reset_index()
education.columns = ["Education","Number"]
education

Out[21]:

Education Number
0 Graduation 1127
1 PhD 486
2 Master 370
3 2n Cycle 203
4 Basic 54

In [22]:

1
2
3
4
5
6
7
8
fig = px.pie(education,names="Education",values="Number")

fig.update_traces(
textposition='inside', # 文本显示位置:['inside', 'outside', 'auto', 'none']
textinfo='percent+label'
)

fig.show()

使用直方图的显示方式:

In [23]:

1
2
3
4
5
6
7
df["Education"].value_counts().plot(kind="bar",color="turquoise",edgecolor="black",linewidth=3)

plt.title("Frequency Of Each Category in the Education Variable", fontsize=12)

plt.figure(figsize=(8,8))

plt.show()

通过饼图和柱状图能够观察到:

  • Graduation基本上占据了半数
  • 博士PHD和说是master总和接近38%,说明高学历人群也不在少数

字段3-Marital_Status

对个人状态的分析

In [24]:

1
2
3
ms = df['Marital_Status'].value_counts().reset_index()

ms

Out[24]:

index Marital_Status
0 Married 864
1 Together 580
2 Single 480
3 Divorced 232
4 Widow 77
5 Alone 3
6 Absurd 2
7 YOLO 2

In [25]:

1
2
3
4
5
6
7
8
fig = px.pie(ms,names="index",values="Marital_Status")

fig.update_traces(
textposition='inside',
textinfo='percent+label'
)

fig.show()

字段4-Income

针对收入的分布和异常点分析,主要是采用小提琴图和箱型图。

In [26]:

1
px.violin(df,y="Income")

收入中这个666k的特殊点可以看成是异常数据。

不同学历下的收入分布情况对比:

  1. 异常数据存在于Graduction学历中
  2. 除去异常点,Graduation和PhD、Master的均值还是PhD的高出一些

通过直方图来查看:

1
2
3
4
5
plt.figure(figsize=(8,8))

sns.distplot(df["Income"],color = 'turquoise')

plt.show()

通过箱型图查看:

1
2
df["Income"].plot.box(figsize=(8,8),color = 'turquoise')
plt.show()

字段5-Kidhome,Teenhome

In [30]:

1
df["Kidhome"].value_counts()

Out[30]:

1
2
3
4
0    1293
1 899
2 48
Name: Kidhome, dtype: int64

In [31]:

1
df["Teenhome"].value_counts()

Out[31]:

1
2
3
4
0    1158
1 1030
2 52
Name: Teenhome, dtype: int64

In [32]:

1
2
3
# 生成一个新的字段kids

df["kids"] = df["Kidhome"] + df["Teenhome"]

半数的用户是只有一个孩子

PRODUCTS

主要的字段是:

  1. MntWines: Amount spent on wine in last 2 years.
  2. MntFruits: Amount spent on fruits in last 2 years.
  3. MntMeatProducts: Amount spent on meat in last 2 years.
  4. MntFishProducts: Amount spent on fish in last 2 years.
  5. MntSweetProducts: Amount spent on sweets in last 2 years.
  6. MntGoldProds: Amount spent on gold in last 2 years.

通过小提琴图来查看各个字段的分布来查看异常值情况:

MntWines

In [35]:

1
px.violin(y=df["MntWines"])

MntFruits

MntMeatProducts

其他省略…生成一个新的字段:

总消费额

In [40]:

1
2
df['Expenses'] = df['MntWines'] + df['MntFruits'] + df['MntMeatProducts'] + df['MntFishProducts'] + df['MntSweetProducts'] + df['MntGoldProds']
df['Expenses'].head(10)

Out[40]:

1
2
3
4
5
6
7
8
9
10
11
0    1617
1 27
2 776
3 53
4 422
5 716
6 590
7 169
8 46
9 49
Name: Expenses, dtype: int64

In [41]:

1
px.violin(y=df["Expenses"])

不同学历下的总消费对比:

不同个人状态下(已婚、单身等)的消费对比:

1
2
3
4
5
plt.figure(figsize=(8,8))

sns.distplot(df["Expenses"],color = 'turquoise')

plt.show()

PROMOTION

  1. NumDealsPurchases: Number of purchases made with a discount.
  2. AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise.
  3. AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise.
  4. AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise.
  5. AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise.
  6. AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise.
  7. Response: 1 if customer accepted the offer in the last campaign, 0 otherwise.

关于促销活动的分析:

In [45]:

1
df['AcceptedCmp1'].unique()

Out[45]:

1
array([0, 1])

In [46]:

1
df['AcceptedCmp2'].unique()

Out[46]:

1
array([0, 1])

In [47]:

1
2
3
4
# 新字段

df['TotalAcceptedCmp'] = df['AcceptedCmp1'] + df['AcceptedCmp2'] + df['AcceptedCmp3'] + df['AcceptedCmp4'] + df['AcceptedCmp5']
df.head()

1
2
3
4
5
6
7
df['TotalAcceptedCmp'].value_counts().plot(kind='bar',color = 'turquoise',edgecolor = "black",linewidth = 3)

plt.title("Frequency Of TotalAcceptedCmp ",fontsize=12)

plt.figure(figsize=(8,8))

plt.show()

PLACE

  • NumWebPurchases: Number of purchases made through the company’s web site.
  • NumCatalogPurchases: Number of purchases made using a catalogue.
  • NumStorePurchases: Number of purchases made directly in stores.
  • NumWebVisitsMonth: Number of visits to company’s web site in the last month.

In [50]:

1
2
3
df['NumTotalPurchases'] = df['NumWebPurchases'] + df['NumCatalogPurchases'] + df['NumStorePurchases'] + df['NumDealsPurchases']

df['NumTotalPurchases'].unique()

Out[50]:

1
2
3
array([25,  6, 21,  8, 19, 22, 10,  2,  4, 16, 15,  5, 26,  9, 13, 12, 43,
17, 20, 14, 27, 11, 18, 28, 7, 24, 29, 23, 32, 30, 37, 31, 33, 35,
39, 1, 34, 0, 44])

In [51]:

1
px.violin(y=df["NumTotalPurchases"])

1
2
3
4
5
plt.figure(figsize=(8,8))

sns.distplot(df["NumTotalPurchases"],color = 'turquoise')

plt.show()

分析完毕之后删除原来的字段,保留新增的字段:

In [53]:

1
2
3
4
5
6
7
8
9
col_del = ["ID","AcceptedCmp1" , "AcceptedCmp2",
"AcceptedCmp3","AcceptedCmp4","AcceptedCmp5",
"NumWebVisitsMonth","NumWebPurchases",
"NumCatalogPurchases","NumStorePurchases",
"NumDealsPurchases" , "Kidhome", "Teenhome",
"MntWines", "MntFruits", "MntMeatProducts",
"MntFishProducts", "MntSweetProducts", "MntGoldProds"]

df=df.drop(columns=col_del,axis=1)

In [54]:

1
df.head(10).style.background_gradient(cmap='crest_r')

新的字段信息:

In [55]:

1
2
columns = df.columns
columns

Out[55]:

1
2
3
4
Index(['Year_Birth', 'Education', 'Marital_Status', 'Income', 'Dt_Customer',
'Recency', 'Complain', 'Response', 'kids', 'Expenses',
'TotalAcceptedCmp', 'NumTotalPurchases'],
dtype='object')

时间字段处理

日期处理

In [56]:

1
2
# Dt_Customer: Date of customer's enrollment(注册) with the company
df['Dt_Customer'] = pd.to_datetime(df.Dt_Customer)

In [57]:

1
df.dtypes

Out[57]:

1
2
3
4
5
6
7
8
9
10
11
12
13
Year_Birth                    int64
Education object
Marital_Status object
Income float64
Dt_Customer datetime64[ns]
Recency int64
Complain int64
Response int64
kids int64
Expenses int64
TotalAcceptedCmp int64
NumTotalPurchases int64
dtype: object

In [58]:

1
df["Dt_Customer"].value_counts().sort_index(ascending=False).head()  # 最晚时间2014-12-06

Out[58]:

1
2
3
4
5
6
2014-12-06     1
2014-12-05 11
2014-12-04 2
2014-12-03 5
2014-12-02 3
Name: Dt_Customer, dtype: int64

设置一个初始日期,求出每个日期和初始日期的差值:

In [59]:

1
2
3
4
5
6
df['First_day'] = '01-01-2015'
df['First_day'] = pd.to_datetime(df.First_day)

df['Day_engaged'] = (df['First_day'] - df['Dt_Customer']).dt.days

df.head()

客户注册年龄

In [60]:

1
2
3
4
5
6
7
8
9
10
11
12
13
# 从注册时间到现在的时间间隔

df['Age'] = (pd.Timestamp('now').year) - (pd.to_datetime(df['Dt_Customer']).dt.year)

df['Age'].value_counts().plot(kind='bar',
color = 'turquoise',
edgecolor = "black",
linewidth = 3)

plt.title("Frequency Of Each Category in the Age",fontsize=12)
plt.figure(figsize=(8,8))

plt.show()

删除无效字段:

In [61]:

1
df.drop(columns=["Dt_Customer", "First_day", "Year_Birth", "Recency", "Complain","Response"],axis=1,inplace=True)

双变量分析-Bivariate Analysis

Education vs Expenses

In [62]:

1
pd.crosstab(df['Education'],df['Expenses'],margins=True).style.background_gradient(cmap='nipy_spectral_r')

Out[62]:

1
2
3
4
5
6
7
8
9
10
sns.set_theme(style="white")

plt.figure(figsize=(8,8))

plt.title("How Expenses impacts on Education?",fontsize=15)

ax = sns.barplot(x="Education",
y="Expenses",
data=df,
palette="rainbow")

Marital status vs Expenses

Kids vs Expenses

In [66]:

1
pd.crosstab(df['kids'],df['Expenses'],margins=True).style.background_gradient(cmap='BuPu_r')

Out[66]:

Day enageged vs Expenses

In [68]:

1
pd.crosstab(df['Day_engaged'],df['Expenses'],margins=True).head(10).style.background_gradient(cmap='Oranges')

Out[68]:

1
2
3
4
5
6
7
sns.set_theme(style="white")

plt.figure(figsize=(18,8))

plt.title("How Day_engaged impacts on Education?",fontsize=15)

ax = sns.barplot(x="Day_engaged", y="Expenses", data=df,palette="rainbow")

NumTotalPurchases vs Expenses

In [70]:

1
pd.crosstab(df['NumTotalPurchases'],df['Expenses'],margins=True).head().style.background_gradient(cmap='Blues')

建模

相关性分析

In [72]:

1
df.describe(include = 'all').style.background_gradient(cmap='RdPu_r')

1
2
3
4
5
6
plt.figure(figsize=(10,8))

sns.heatmap(df.corr(),
annot=True,
cmap='RdYlBu_r',
linewidths=1)

特征与目标变量相关性

In [75]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Education 2240 non-null object
1 Marital_Status 2240 non-null object
2 Income 2240 non-null float64
3 kids 2240 non-null int64
4 Expenses 2240 non-null int64
5 TotalAcceptedCmp 2240 non-null int64
6 NumTotalPurchases 2240 non-null int64
7 Day_engaged 2240 non-null int64
8 Age 2240 non-null int64
dtypes: float64(1), int64(6), object(2)
memory usage: 157.6+ KB

In [76]:

1
2
cont_features = df.iloc[:, 2:]
cont_features.head()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import matplotlib
background_color = "#f6f5f5"

fig = plt.figure(figsize=(16, 8),
facecolor=background_color)
gs = fig.add_gridspec(1, 1)

ax0 = fig.add_subplot(gs[0, 0])
colors = ["#2f5586", "#f6f5f5","#2f5586"]

colormap = matplotlib.colors.LinearSegmentedColormap.from_list("", colors)

ax0.set_facecolor(background_color)
ax0.text(-1.1, 1.25, 'Correlation of Continuous Features with Target', fontsize=20, fontweight='bold')

# 求出每个特征和目标变量Expenses之间的相关系数
chart_df = pd.DataFrame(cont_features.corrwith(df['Expenses']))

chart_df.columns = ['corr']
sns.barplot(x=chart_df.index,
y=chart_df['corr'],
ax=ax0,
color='turquoise',
zorder=3,
edgecolor='black',
linewidth=3)

ax0.grid(which='major',
axis='x',
zorder=0,
color='#EEEEEE',
linewidth=0.4)

ax0.grid(which='major',
axis='y',
zorder=0,
color='#EEEEEE',
linewidth=0.4)

ax0.set_ylabel('')

for s in ["top","right", 'left']:
ax0.spines[s].set_visible(False)

plt.show()

通过柱状图的分布看到:只有Kids和目标变量是负相关的

统计不同类型字段

In [78]:

1
2
3
4
5
6
7
8
9
10
11
cate = []
Int = []
Float = []

for i in df.columns:
if (df[i].dtypes == "object"):
cate.append(i)
if (df[i].dtypes == "int64"):
Int.append(i)
if (df[i].dtypes == "float32" or "float64"):
Float.append(i)

类型编码

In [79]:

1
2
3
4
5
6
from sklearn.preprocessing import LabelEncoder

lbl_encode = LabelEncoder()

for i in cate:
df[i]=df[[i]].apply(lbl_encode.fit_transform)

数据归一化

In [80]:

1
2
3
4
5
6
7
8
9
10
df1 = df.copy()

from sklearn.preprocessing import StandardScaler

ssf = StandardScaler().fit_transform(df1.values)

sf_df = pd.DataFrame(ssf,
index=df1.index,
columns=df1.columns)
sf_df.head()

取出特征矩阵X

从上面的数据找个取出用于聚类的特征矩阵,包含两个变量:Education和Expenses

In [81]:

1
2
X = sf_df.iloc[:, [2, 4]].values
X

Out[81]:

1
2
3
4
5
6
7
array([[ 0.23569584,  1.67941681],
[-0.23545419, -0.96127545],
[ 0.77399892, 0.28267329],
...,
[ 0.18947568, 1.05495122],
[ 0.67940139, 0.39394775],
[ 0.02520835, -0.7204576 ]])

实施用户分群-肘图

利用肘图来确定聚类分群的K值

In [82]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from sklearn.cluster import KMeans
wcss = []

for i in range(1, 18):
kmeans = KMeans(n_clusters = i,
init = 'k-means++',
random_state = 50)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 18),
wcss,
color = "turquoise",
marker = '*')

plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

我们最终选取k=5

1
2
3
4
5
6
7
# 取k=5

kmeans = KMeans(n_clusters =5 ,
init = 'k-means++',
random_state = 50)
y = kmeans.fit_predict(X)
y # 预测的y值,共5个类别

分群可视化

In [84]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
plt.figure(figsize= (15,8))
plt.scatter(X[y == 0, 0], X[y == 0, 1],
s = 25, c = 'mediumblue',
label = 'one')
plt.scatter(X[y == 1, 0], X[y == 1, 1],
s = 25, c = 'turquoise',
label = 'two')
plt.scatter(X[y == 2, 0], X[y == 2, 1],
s = 25, c = 'red',
label = 'three')
plt.scatter(X[y == 3, 0], X[y == 3, 1],
s = 25, c = 'green',
label = 'four')
plt.scatter(X[y == 4, 0], X[y == 4, 1],
s = 25, c = 'yellow',
label = 'five')


plt.scatter(kmeans.cluster_centers_[:, 0],
kmeans.cluster_centers_[:, 1],
s = 55, c = 'black',
label = 'Centroids')

plt.title('Clusters of customers',fontsize = 20)
plt.xlabel('Income',fontsize = 15)
plt.ylabel('Expenses',fontsize = 15)

plt.legend(fontsize = 15)
plt.show()

基于plotly可视化

In [87]:

1
2
result = pd.DataFrame(X,columns=["Income","Expenses"])
result.head()

Out[87]:

Income Expenses
0 0.235696 1.679417
1 -0.235454 -0.961275
2 0.773999 0.282673
3 -1.022355 -0.918094
4 0.241888 -0.305254

In [88]:

1
2
result["Label"] = y
result.head()

Out[88]:

Income Expenses Label
0 0.235696 1.679417 1
1 -0.235454 -0.961275 0
2 0.773999 0.282673 4
3 -1.022355 -0.918094 2
4 0.241888 -0.305254 0

In [89]:

1
2
3
4
5
fig = px.scatter(result,
x="Income",
y="Expenses",
color="Label")
fig.show()

本文标题:kaggle实战-用户个性化分析及分群

发布时间:2022年08月13日 - 10:08

原始链接:http://www.renpeter.cn/2022/08/13/kaggle%E5%AE%9E%E6%88%98-%E7%94%A8%E6%88%B7%E4%B8%AA%E6%80%A7%E5%8C%96%E5%88%86%E6%9E%90%E5%8F%8A%E5%88%86%E7%BE%A4.html

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

Coffee or Tea