kaggle实战-用户个性化分析及分群
基于一份超市消费数据集的用户个性化分析以及用户细分分群的实现,文章主要是从人、货、促销、地点等4个方面进行分析
原数据集地址:
数据信息
In [1]:
1 | import numpy as np |
In [2]:
查看当前路径下的全部文件
1 | # 查看当前路径下的全部文件 |
读取文件
In [3]:
1 | df = pd.read_csv("marketing_campaign.csv",sep="\t") |
基本信息
In [5]:
1 | df.shape # 数据量 |
Out[5]:
1 | (2240, 29) |
In [6]:
1 | df.columns |
Out[6]:
1 | Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome', |
In [7]:
1 | pd.value_counts(df.dtypes) # 不同的字段类型 |
Out[7]:
1 | int64 25 |
In [8]:
1 | df.info() # 数据信息info |
统计信息
In [9]:
1 | # 描述统计信息 |
Out[9]:
字段唯一值统计
In [10]:
1 |
|
Out[10]:
1 | ID 2240 |
我们发现:Z_CostContact 和 Z_Revenue都是只存在一种取值情况,对后续分析和建模没有影响,直接删除。
In [11]:
1 | df.drop(["Z_CostContact","Z_Revenue"],axis=1,inplace=True) |
缺失值处理
缺失值情况
In [12]:
1 | df.isnull().sum() |
Out[12]:
1 | ID 0 |
缺失值个数与比例
In [13]:
1 | total = df.isnull().sum().sort_values(ascending=False) |
In [14]:
1 | missing_data = pd.DataFrame({"Total": total, |
Out[14]:
缺失值可视化
In [16]:
1 | import missingno as mn |
缺失值填充-中值
In [17]:
1 | # 现有值的中值填充 |
Out[17]:
1 | ID 0 |
People
字段1-Year_Birth
In [19]:
1 | year_birth = df["Year_Birth"].value_counts().reset_index() |
Out[19]:
year_birth | number | |
---|---|---|
0 | 1976 | 89 |
1 | 1971 | 87 |
2 | 1975 | 83 |
3 | 1972 | 79 |
4 | 1978 | 77 |
5 | 1970 | 77 |
6 | 1973 | 74 |
7 | 1965 | 74 |
8 | 1969 | 71 |
9 | 1974 | 69 |
In [20]:
1 | px.bar(year_birth,x="year_birth",y="number",color="number") |
通过上面的图形能够观察到:消费用户更多的是集中在1970-1975年之间
字段2-Education
用户的不同教育程度
In [21]:
1 | education = df["Education"].value_counts().reset_index() |
Out[21]:
Education | Number | |
---|---|---|
0 | Graduation | 1127 |
1 | PhD | 486 |
2 | Master | 370 |
3 | 2n Cycle | 203 |
4 | Basic | 54 |
In [22]:
1 | fig = px.pie(education,names="Education",values="Number") |
使用直方图的显示方式:
In [23]:
1 | df["Education"].value_counts().plot(kind="bar",color="turquoise",edgecolor="black",linewidth=3) |
通过饼图和柱状图能够观察到:
- Graduation基本上占据了半数
- 博士PHD和说是master总和接近38%,说明高学历人群也不在少数
字段3-Marital_Status
对个人状态的分析
In [24]:
1 | ms = df['Marital_Status'].value_counts().reset_index() |
Out[24]:
index | Marital_Status | |
---|---|---|
0 | Married | 864 |
1 | Together | 580 |
2 | Single | 480 |
3 | Divorced | 232 |
4 | Widow | 77 |
5 | Alone | 3 |
6 | Absurd | 2 |
7 | YOLO | 2 |
In [25]:
1 | fig = px.pie(ms,names="index",values="Marital_Status") |
字段4-Income
针对收入的分布和异常点分析,主要是采用小提琴图和箱型图。
In [26]:
1 | px.violin(df,y="Income") |
收入中这个666k的特殊点可以看成是异常数据。
不同学历下的收入分布情况对比:
- 异常数据存在于Graduction学历中
- 除去异常点,Graduation和PhD、Master的均值还是PhD的高出一些
通过直方图来查看:
1 | plt.figure(figsize=(8,8)) |
通过箱型图查看:
1 | df["Income"].plot.box(figsize=(8,8),color = 'turquoise') |
字段5-Kidhome,Teenhome
In [30]:
1 | df["Kidhome"].value_counts() |
Out[30]:
1 | 0 1293 |
In [31]:
1 | df["Teenhome"].value_counts() |
Out[31]:
1 | 0 1158 |
In [32]:
1 | # 生成一个新的字段kids |
半数的用户是只有一个孩子
PRODUCTS
主要的字段是:
- MntWines: Amount spent on wine in last 2 years.
- MntFruits: Amount spent on fruits in last 2 years.
- MntMeatProducts: Amount spent on meat in last 2 years.
- MntFishProducts: Amount spent on fish in last 2 years.
- MntSweetProducts: Amount spent on sweets in last 2 years.
- MntGoldProds: Amount spent on gold in last 2 years.
通过小提琴图来查看各个字段的分布来查看异常值情况:
MntWines
In [35]:
1 | px.violin(y=df["MntWines"]) |
MntFruits
MntMeatProducts
其他省略…生成一个新的字段:
总消费额
In [40]:
1 | df['Expenses'] = df['MntWines'] + df['MntFruits'] + df['MntMeatProducts'] + df['MntFishProducts'] + df['MntSweetProducts'] + df['MntGoldProds'] |
Out[40]:
1 | 0 1617 |
In [41]:
1 | px.violin(y=df["Expenses"]) |
不同学历下的总消费对比:
不同个人状态下(已婚、单身等)的消费对比:
1 | plt.figure(figsize=(8,8)) |
PROMOTION
- NumDealsPurchases: Number of purchases made with a discount.
- AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise.
- AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise.
- AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise.
- AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise.
- AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise.
- Response: 1 if customer accepted the offer in the last campaign, 0 otherwise.
关于促销活动的分析:
In [45]:
1 | df['AcceptedCmp1'].unique() |
Out[45]:
1 | array([0, 1]) |
In [46]:
1 | df['AcceptedCmp2'].unique() |
Out[46]:
1 | array([0, 1]) |
In [47]:
1 | # 新字段 |
1 | df['TotalAcceptedCmp'].value_counts().plot(kind='bar',color = 'turquoise',edgecolor = "black",linewidth = 3) |
PLACE
- NumWebPurchases: Number of purchases made through the company’s web site.
- NumCatalogPurchases: Number of purchases made using a catalogue.
- NumStorePurchases: Number of purchases made directly in stores.
- NumWebVisitsMonth: Number of visits to company’s web site in the last month.
In [50]:
1 | df['NumTotalPurchases'] = df['NumWebPurchases'] + df['NumCatalogPurchases'] + df['NumStorePurchases'] + df['NumDealsPurchases'] |
Out[50]:
1 | array([25, 6, 21, 8, 19, 22, 10, 2, 4, 16, 15, 5, 26, 9, 13, 12, 43, |
In [51]:
1 | px.violin(y=df["NumTotalPurchases"]) |
1 | plt.figure(figsize=(8,8)) |
分析完毕之后删除原来的字段,保留新增的字段:
In [53]:
1 | col_del = ["ID","AcceptedCmp1" , "AcceptedCmp2", |
In [54]:
1 | df.head(10).style.background_gradient(cmap='crest_r') |
新的字段信息:
In [55]:
1 | columns = df.columns |
Out[55]:
1 | Index(['Year_Birth', 'Education', 'Marital_Status', 'Income', 'Dt_Customer', |
时间字段处理
日期处理
In [56]:
1 | # Dt_Customer: Date of customer's enrollment(注册) with the company |
In [57]:
1 | df.dtypes |
Out[57]:
1 | Year_Birth int64 |
In [58]:
1 | df["Dt_Customer"].value_counts().sort_index(ascending=False).head() # 最晚时间2014-12-06 |
Out[58]:
1 | 2014-12-06 1 |
设置一个初始日期,求出每个日期和初始日期的差值:
In [59]:
1 | df['First_day'] = '01-01-2015' |
客户注册年龄
In [60]:
1 | # 从注册时间到现在的时间间隔 |
删除无效字段:
In [61]:
1 | df.drop(columns=["Dt_Customer", "First_day", "Year_Birth", "Recency", "Complain","Response"],axis=1,inplace=True) |
双变量分析-Bivariate Analysis
Education vs Expenses
In [62]:
1 | pd.crosstab(df['Education'],df['Expenses'],margins=True).style.background_gradient(cmap='nipy_spectral_r') |
Out[62]:
1 | sns.set_theme(style="white") |
Marital status vs Expenses
Kids vs Expenses
In [66]:
1 | pd.crosstab(df['kids'],df['Expenses'],margins=True).style.background_gradient(cmap='BuPu_r') |
Out[66]:
Day enageged vs Expenses
In [68]:
1 | pd.crosstab(df['Day_engaged'],df['Expenses'],margins=True).head(10).style.background_gradient(cmap='Oranges') |
Out[68]:
1 | sns.set_theme(style="white") |
NumTotalPurchases vs Expenses
In [70]:
1 | pd.crosstab(df['NumTotalPurchases'],df['Expenses'],margins=True).head().style.background_gradient(cmap='Blues') |
建模
相关性分析
In [72]:
1 | df.describe(include = 'all').style.background_gradient(cmap='RdPu_r') |
1 | plt.figure(figsize=(10,8)) |
特征与目标变量相关性
In [75]:
1 | df.info() |
In [76]:
1 | cont_features = df.iloc[:, 2:] |
1 | import matplotlib |
通过柱状图的分布看到:只有Kids和目标变量是负相关的
统计不同类型字段
In [78]:
1 | cate = [] |
类型编码
In [79]:
1 | from sklearn.preprocessing import LabelEncoder |
数据归一化
In [80]:
1 | df1 = df.copy() |
取出特征矩阵X
从上面的数据找个取出用于聚类的特征矩阵,包含两个变量:Education和Expenses
In [81]:
1 | X = sf_df.iloc[:, [2, 4]].values |
Out[81]:
1 | array([[ 0.23569584, 1.67941681], |
实施用户分群-肘图
利用肘图来确定聚类分群的K值
In [82]:
1 | from sklearn.cluster import KMeans |
我们最终选取k=5
1 | # 取k=5 |
分群可视化
In [84]:
1 | plt.figure(figsize= (15,8)) |
基于plotly可视化
In [87]:
1 | result = pd.DataFrame(X,columns=["Income","Expenses"]) |
Out[87]:
Income | Expenses | |
---|---|---|
0 | 0.235696 | 1.679417 |
1 | -0.235454 | -0.961275 |
2 | 0.773999 | 0.282673 |
3 | -1.022355 | -0.918094 |
4 | 0.241888 | -0.305254 |
In [88]:
1 | result["Label"] = y |
Out[88]:
Income | Expenses | Label | |
---|---|---|---|
0 | 0.235696 | 1.679417 | 1 |
1 | -0.235454 | -0.961275 | 0 |
2 | 0.773999 | 0.282673 | 4 |
3 | -1.022355 | -0.918094 | 2 |
4 | 0.241888 | -0.305254 | 0 |
In [89]:
1 | fig = px.scatter(result, |