Fork me on GitHub

kaggle实战-银行用户流失预测

kaggle实战-信用卡客户流失预警

带来一篇关于kaggle客户流失预测的数据分析与建模的文章,主要内容:

  • 数据基本信息
  • 数据EDA分析
  • 特征工程和编码
  • 基于3大分类模型的建模和预测、评分
  • 基于随机搜索和网格搜索调参优化

背景

近年来,不论是传统行业还是互联网行业,都面临着用户流失问题。一般在银行、电话服务公司、互联网公司、保险等公司,经常使用客户流失分析和客户流失率作为他们的关键性业务指标之一。

一般情况下,留住现有客户的成本是远低于获得新客户的成本。因此在这些公司都有自己的客户服务部门来挽回现有即将流失的客户,因为现有客户对公司来说比新客户更具有价值。

记住一点:获客成本高,用户留存很重要

导入库

In [1]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import numpy as np
import pandas as pd

import plotly as py
import plotly_express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score as f1
from sklearn.metrics import confusion_matrix
import scikitplot as skplt

In [2]:

1
2
df = pd.read_csv("BankChurners.csv")
df.head()

Out[2]:

数据基本信息

1
2
3
4
df.shape

# 结果
(10127, 23)

结果显示总共是10127行数据,23个字段

In [3]:

1
2
3
# 全部字段
columns = df.columns
columns

Out[3]:

1
2
3
4
5
6
7
8
9
10
Index(['CLIENTNUM', 'Attrition_Flag', 'Customer_Age', 'Gender',
'Dependent_count', 'Education_Level', 'Marital_Status',
'Income_Category', 'Card_Category', 'Months_on_book',
'Total_Relationship_Count', 'Months_Inactive_12_mon',
'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio',
'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'],
dtype='object')

字段解释为:

  • CLIENTNUM:Client number - Unique identifier for the customer holding the account
  • Attrition_Flag:Flag indicative of account closure in next 6 months (between Jan to Jun 2013)
  • Customer_Age:Age of the account holder
  • Gender:Gender of the account holder
  • Dependent_count:Number of people financially dependent on the account holder
  • Education_Level:Educational qualification of account holder (ex - high school, college grad etc.)
  • Marital_Status:Marital status of account holder (Single, Married, Divorced, Unknown)
  • Income_Category:Annual income category of the account holder
  • Card_Category:Card type depicting the variants of the cards by value proposition (Blue, Silver and Platinum)
  • Months_on_book:Number of months since the account holder opened an an account with the lender
  • Total_Relationship_Count:Total number of products held by the customer. Total number of relationships the account holder has with the bank (example - retail bank, mortgage, wealth management etc.)
  • Months_Inactive_12_mon:Total number of months inactive in last 12 months
  • Contacts_Count_12_mon:Number of Contacts in the last 12 months. No. of times the account holder called to the call center in the past 12 months
  • Credit_Limit:Credit limit
  • Total_Revolving_Bal:Total amount as revolving balance
  • Avg_Open_To_Buy:Open to Buy Credit Line (Average of last 12 months)
  • Total_Amt_Chng_Q4_Q1:Change in Transaction Amount (Q4 over Q1)
  • Total_Trans_Amt:Total Transaction Amount (Last 12 months)
  • Total_Trans_Ct:Total Transaction Count (Last 12 months)
  • Total_Ct_Chng_Q4_Q1:Change in Transaction Count (Q4 over Q1)
  • Avg_Utilization_Ratio:Average Card Utilization Ratio
  • Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1
  • Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2

In [4]:

1
df.dtypes   # 字段类型; 部分截图

Out[4]:

通过下面的代码能够统计不同类型下的字段数量:

1
2
3
4
5
6
7
8
# 不同字段类型的统计

pd.value_counts(df.dtypes)

int64 10
float64 7
object 6
dtype: int64
1
df.describe().style.background_gradient(cmap="ocean_r")  # 表格美化输出

df数据的描述统计信息美化输出(部分字段)

缺失值

In [7]:

1
2
3
4
5
6
# 每个字段的缺失值统计
df.isnull().sum()

# 缺失值比例:数据中没有缺失值
total = df.isnull().sum().sort_values(ascending=False)
Percentage = total / len(df)

根据值的降序排列,第一个是0,结果表明数据本身是没有缺失值的**

删除无关字段

In [9]:

1
2
no_use = np.arange(21, df.shape[1])  # 最后两个字段
no_use

Out[9]:

1
array([21, 22])

In [10]:

1
2
# 1、删除多个字段
df.drop(df.columns[no_use], axis=1, inplace=True)

In [11]:

CLIENTNUM表示的客户编号的信息,对建模无用直接删除:

1
2
# 2、删除单个字段
df.drop("CLIENTNUM", axis=1, inplace=True)

新生成的df的字段(删除了无效字段之后):

In [12]:

1
df.columns

Out[12]:

1
2
3
4
5
6
7
Index(['Attrition_Flag', 'Customer_Age', 'Gender', 'Dependent_count',
'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category',
'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon',
'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'],
dtype='object')

In [13]:

再次查看数据的描述统计信息:

1
df.describe().style.background_gradient(cmap="ocean_r")

EDA-Exploratory Data Analysis

基于使用频率和数值特征

In [14]:

取出和用户的数值型字段信息:

1
2
3
4
5
6
7
8
9
10
11
# df_frequency = df[["Customer_Age","Total_Trans_Ct","Total_Trans_Amt","Months_Inactive_12_mon","Credit_Limit","Attrition_Flag"]]  效果同下

df_frequency = pd.concat([df['Customer_Age'],
df['Total_Trans_Ct'],
df['Total_Trans_Amt'],
df['Months_Inactive_12_mon'],
df['Credit_Limit'],
df['Attrition_Flag']],
axis=1)

df_frequency.head()

探索在不同的Attrition_Flag下,两两字段之间的关系:

In [15]:

1
df["Attrition_Flag"].value_counts()

Out[15]:

1
2
3
Existing Customer    8500  # 现有顾客
Attrited Customer 1627 # 流失顾客
Name: Attrition_Flag, dtype: int64

结果表明:现有顾客为8500,流失客户为1627

In [16]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# 定义画布大小

fig, ax = plt.subplots(ncols=4, figsize=(20,6))

sns.scatterplot(data=df_frequency,
x="Total_Trans_Amt",
y="Total_Trans_Ct",
hue="Attrition_Flag",
ax=ax[0])

sns.scatterplot(data=df_frequency,
x="Months_Inactive_12_mon",
y="Total_Trans_Ct",
hue="Attrition_Flag",
ax=ax[1])

sns.scatterplot(data=df_frequency,
x="Credit_Limit",
y="Total_Trans_Ct",
hue="Attrition_Flag",
ax=ax[2])

sns.scatterplot(data=df_frequency,
x="Customer_Age",
y="Total_Trans_Ct",
hue="Attrition_Flag",
ax=ax[3])

plt.show()

基于plotly的实现:

1
2
3
4
5
6
for col in ["Customer_Age","Total_Trans_Amt","Months_Inactive_12_mon","Credit_Limit"]:
fig = px.scatter(df_frequency,
x=col,
y="Total_Trans_Ct",
color="Attrition_Flag")
fig.show()

image-20220824171232559

上main展示的一个字段和Total_Trans_Ct的关系。下面是基于go.Scatter实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# 生成一个副本

df_frequency_copy = df_frequency.copy()
df_frequency_copy["Attrition_Flag_number"] = df_frequency_copy["Attrition_Flag"].apply(lambda x: 1 if x == "Existing Customer" else 2)

# 两个基本参数:设置行、列

four_columns = ["Total_Trans_Amt","Months_Inactive_12_mon","Credit_Limit","Customer_Age"]

fig = make_subplots(rows=1,
cols=4,
start_cell="top-left",
shared_yaxes=True,
subplot_titles=four_columns # 子图
)

for i, v in enumerate(four_columns):
r = i // 4 + 1 # 行
c = (i + 1) % 4 # 列-余数

if c == 0:
fig.add_trace(go.Scatter(x=df_frequency_copy[v].tolist(),
y=df_frequency_copy["Total_Trans_Ct"].tolist(),
mode='markers',
marker=dict(color=df_frequency_copy.Attrition_Flag_number)),
row=r, col=4)

else:
fig.add_trace(go.Scatter(x=df_frequency_copy[v].tolist(),
y=df_frequency_copy["Total_Trans_Ct"].tolist(),
mode='markers',
marker=dict(color=df_frequency_copy.Attrition_Flag_number)),

row=r, col=c)

fig.update_layout(width=1000, height=450, showlegend=False)

fig.show()

蓝色:现有客户;黄色:流失客户

我们得到如下的几点结论:

  1. 图1:用户每年花费的金额越高,越可能留下来(非流失)
  2. 2-3个月不进行互动,用户流失的可能性较高
  3. 用户的信用额度越高,留下来的可能性越大
  4. 从图3中观察到:流失客户的信用卡使用次数大部分低于100次
  5. 从第4个图中观察到,用户年龄分布不是重要因素

基于用户人口统计信息

用户的人口统计信息主要是包含:用户年龄、性别、受教育程度、状态(单身、已婚等)、收入水平等信息

In [21]:

取出相关的字段进行分析:

1
2
3
4
5
6
7
8
df_demographic=df[['Customer_Age',
'Gender',
'Education_Level',
'Marital_Status',
'Income_Category',
'Attrition_Flag']]

df_demographic.head()

不同类型顾客的年龄分布

In [22]:

1
2
3
px.violin(df_demographic,
y="Customer_Age",
color="Attrition_Flag")

从上面的小提琴图看出来,不同类型的用户在年龄上的分布是类似的。

结论:年龄并不是用户是否流失的关键因素

年龄分布

查看整体数据中用户的年龄分布情况:

1
2
3
4
5
6
7
8
9
10
fig = make_subplots(rows=2, cols=1)

trace1=go.Box(x=df['Customer_Age'],name='Age With Box Plot',boxmean=True)
trace2=go.Histogram(x=df['Customer_Age'],name='Age With Histogram')

fig.add_trace(trace1, row=1,col=1)
fig.add_trace(trace2, row=2,col=1)

fig.update_layout(height=500, width=1000, title_text="用户年龄分布")
fig.show()

可以看到年龄基本上是呈现正态分布的,大多数集中在40-55之间。

不同类型下不同性别顾客统计

In [23]:

1
2
flag_gender = df.groupby(["Attrition_Flag","Gender"]).size().reset_index().rename(columns={0:"number"})
flag_gender

Out[23]:

Attrition_Flag Gender number
0 Attrited Customer F 930
1 Attrited Customer M 697
2 Existing Customer F 4428
3 Existing Customer M 4072

In [24]:

1
2
3
4
5
6
7
8
fig = px.bar(flag_gender,
x="Attrition_Flag",
y="number",
color="Gender",
barmode="group",
text="number")

fig.show()

从上面的柱状图中看出来:

  1. 女性在本次数据中高于男性;在两种不同类型的客户中女性也是高于男性
  2. 数据不平衡:现有客户和流失客户是不平衡的,大约是8400:1600

交叉表统计分析

基于pandas中交叉表的数据统计分析。解释交叉表很好的文章:https://pbpython.com/pandas-crosstab.html

In [25]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
fig, (ax1,ax2,ax3,ax4) = plt.subplots(ncols=4, figsize=(20,5))

pd.crosstab(df["Attrition_Flag"],df["Gender"]).plot(kind="bar", ax=ax1, ylim=[0,5000])
pd.crosstab(df["Attrition_Flag"],df["Education_Level"]).plot(kind="bar", ax=ax2, ylim=[0,5000])
pd.crosstab(df["Attrition_Flag"],df["Marital_Status"]).plot(kind="bar", ax=ax3, ylim=[0,5000])
pd.crosstab(df["Attrition_Flag"],df["Income_Category"]).plot(kind="bar", ax=ax4, ylim=[0,5000])


fig, (ax1,ax2,ax3) = plt.subplots(ncols=3, figsize=(20,5))
pd.crosstab(df['Attrition_Flag'],df['Dependent_count']).plot(kind='bar',ax=ax1, ylim=[0,5000])
pd.crosstab(df['Attrition_Flag'],df['Card_Category']).plot(kind='bar',ax=ax2, ylim=[0,10000])

_box = sns.boxplot(data=df_demographic,x='Attrition_Flag',y='Customer_Age', ax=ax3)

plt.show()

我们放大再看:

可以观察到:在两种客户中,不同的教育水平和个人状态的分布是类似的。这个结论也验证了:年龄并不是影响现有或者流失客户的因素

受教育程度

1
2
fig = px.pie(df,names='Education_Level',title='Propotion Of Education Levels')
fig.show()

对比两种客户数量

In [26]:

1
2
churn = df["Attrition_Flag"].value_counts()
churn

Out[26]:

1
2
3
Existing Customer    8500
Attrited Customer 1627
Name: Attrition_Flag, dtype: int64

In [27]:

1
churn.keys()

Out[27]:

1
Index(['Existing Customer', 'Attrited Customer'], dtype='object')

In [28]:

1
2
3
plt.pie(x=churn, labels=churn.keys(),autopct="%.1f%%")

plt.show()

上面的饼图表明:

  • 现有客户还是占据了绝大部分
  • 后面将通过采样的方式使得两种类型的客户数量保持平衡。

相关性

现有数据中的字段涉及到分类型和数值型,采取不同的分析和编码方式

  • 数值型变量:使用相关系数Pearson
  • 分类型变量:使用Cramer’s V ;克莱姆相关系数,常用于分析双变量之间的关系

参考内容:https://blog.csdn.net/deecheanW/article/details/120474864

1
2
3
4
5
6
7
8
# 字符型字段
# 相同效果:df.select_dtypes(include="O")
df_categorical=df.loc[:,df.dtypes==np.object]
df_categorical.head()

# 数值型字段
df_number = df.select_dtypes(exclude="O")
df_number.head()

对Attrition_Flag字段执行独热码编码操作:

In [31]:

1
2
# 先保留原信息
df_number["Attrition_Flag"] = df.loc[:, "Attrition_Flag"]

类型编码

In [34]:

1
2
3
4
5
6
7
8
from sklearn import preprocessing

label = preprocessing.LabelEncoder()
df_categorical_encoded = pd.DataFrame()

# 对分类型的字段进行类型编码
for i in df_categorical.columns:
df_categorical_encoded[i] = label.fit_transform(df_categorical[i])

计算克莱姆系数-cramers_V

In [35]:

1
2
3
4
5
6
7
8
9
10
11
from scipy.stats import chi2_contingency

# 定义计算克莱姆系数的函数
def cal_cramers_v(v1,v2):
crosstab = np.array(pd.crosstab(v1,v2,rownames=None,colnames=None))
stat = chi2_contingency(crosstab)[0]

obs = np.sum(crosstab)
mini = min(crosstab.shape) - 1

return stat / (obs * mini)

In [36]:

1
2
3
4
5
6
7
8
rows = []
for v1 in df_categorical_encoded:
col = []
for v2 in df_categorical_encoded:
# 计算克莱姆系数
cramers = cal_cramers_v(df_categorical_encoded[v1],df_categorical_encoded[v2])
col.append(round(cramers, 2))
rows.append(col)

In [37]:

1
2
3
4
5
6
7
8
# 克莱姆系数下的热力图

cramers_results = np.array(rows)

cramerv_matrix = pd.DataFrame(cramers_results,
columns=df_categorical_encoded.columns,
index=df_categorical_encoded.columns)
cramerv_matrix.head()

绘制相关的热力图:

1
2
3
4
5
6
7
8
9
10
11
mask = np.triu(np.ones_like(cramerv_matrix, dtype=np.bool))
cat_heatmap = sns.heatmap(cramerv_matrix, # 系数矩阵
mask=mask,
vmin=-1,
vmax=1,
annot=True,
cmap="BrBG")

cat_heatmap.set_title("Heatmap of Correlation(Categorical)", fontdict={"fontsize": 14}, pad=12)

plt.show()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 基于数值型字段的相关系数

from scipy import stats

num_corr = df_number.corr() # 相关系数
plt.figure(figsize = (16,6))

mask = np.triu(np.ones_like(num_corr, dtype=np.bool))
heatmap_number = sns.heatmap(num_corr, mask=mask,
vmin=-1, vmax=1,
annot=True, cmap="RdYlBu")

heatmap_number.set_title("Heatmap of Correlation(Number)", fontdict={"fontsize": 14}, pad=12)

plt.show()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
fig, ax = plt.subplots(ncols=2, figsize=(15,6))

heatmap = sns.heatmap(num_corr[["Existing Customer"]].sort_values(by="Existing Customer", ascending=False),
ax=ax[0],
vmin=-1,
vmax=1,
annot=True,
cmap="coolwarm_r")
heatmap.set_title("Features Correlating with Existing Customers",fontdict={"fontsize":18}, pad=16);

heatmap = sns.heatmap(num_corr[["Attrited Customer"]].sort_values(by="Attrited Customer", ascending=False),
ax=ax[1],
vmin=-1,
vmax=1,
annot=True,
cmap="coolwarm_r")
heatmap.set_title("Features Correlating with Attrited Customers",fontdict={"fontsize":18}, pad=16);

fig.tight_layout(pad=5)

plt.show()

小结:从上面右侧的热力图中能看到下面的字段和流失类型客户是无相关的。相关系数的值在正负0.1之间(右图)

  • Credit Limit
  • Average Open To Buy
  • Months On Book
  • Age
  • Dependent Count

现在我们考虑将上面的字段进行删除:

In [41]:

1
2
3
df_model = df.copy()

df_model = df_model.drop(['Credit_Limit','Customer_Age','Avg_Open_To_Buy','Months_on_book','Dependent_count'],axis=1)

用户标识编码

In [42]:

1
df_model['Attrition_Flag'] = df_model['Attrition_Flag'].map({'Existing Customer': 1, 'Attrited Customer': 0})

剩余字段的独热码:

1
df_model=pd.get_dummies(df_model)

建模

切分数据

在之前已经验证过现有客户和流失客户的数量是不均衡的,我们使用SMOTE(Synthetic Minority Oversampling Technique,通过上采样合成少量的数据)采样来平衡数据。

In [50]:

1
2
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

In [51]:

1
2
3
4
5
6
7
8
# 特征和目标变量

# X = df_model.drop("Attrition_Flag", axis=1, inplace=True)
X = df_model.loc[:, df_model.columns != "Attrition_Flag"]
y = df_model["Attrition_Flag"]

# 分割数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

SMOTE采样

In [52]:

1
2
3
4
sm = SMOTE(sampling_strategy="minority", k_neighbors=20, random_state=42)

# 实施采样过程
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

3种模型

In [53]:

1
2
3
4
5
6
# 1、随机森林

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train_res, y_train_res)

Out[53]:

1
RandomForestClassifier()

一般在使用树模型建模的时候数据不需要归一化。但是在使用支持向量机的时候需要:

In [54]:

1
2
3
4
5
6
7
8
9
10
# 2、支持向量机

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

# 使用支持向量机数据需要归一化

svm = make_pipeline(StandardScaler(), SVC(gamma='auto'))
svm.fit(X_train_res, y_train_res)

Out[54]:

1
2
Pipeline(steps=[('standardscaler', StandardScaler()),
('svc', SVC(gamma='auto'))])

In [55]:

1
2
3
4
5
6
7
8
9
10
# 3、提升树

from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(n_estimators=100, # tree的个数
learning_rate=1.0, # 学习率
max_depth=1, # 叶子的最大深度
random_state=42)

gb.fit(X_train_res, y_train_res)

Out[55]:

1
GradientBoostingClassifier(learning_rate=1.0, max_depth=1, random_state=42)

模型预测

In [56]:

1
2
3
y_rf = rf.predict(X_test)
y_svm = svm.predict(X_test)
y_gb = gb.predict(X_test)

混淆矩阵

In [57]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from sklearn.metrics import plot_confusion_matrix

fig,ax=plt.subplots(ncols=3, figsize=(20,6))

plot_confusion_matrix(rf, X_test, y_test, ax=ax[0])
ax[0].title.set_text('RF')

plot_confusion_matrix(svm, X_test, y_test, ax=ax[1])
ax[1].title.set_text('SVM')

plot_confusion_matrix(gb, X_test, y_test, ax=ax[2])
ax[2].title.set_text('GB')
fig.tight_layout(pad=5)

plt.show()

分类模型得分

In [58]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# classification_report, recall_score, precision_score, f1_score

from sklearn.metrics import classification_report, recall_score, precision_score, f1_score

print('Random Forest Classifier')
print(classification_report(y_test, y_rf))

print('------------------------')
print('Support Vector Machine')
print(classification_report(y_test, y_svm))

print('------------------------')
print('Gradient Boosting')
print(classification_report(y_test, y_gb))
Random Forest Classifier
precision recall f1-score support

0 0.85 0.83 0.84 541
1 0.97 0.97 0.97 2801

accuracy 0.95 3342
macro avg 0.91 0.90 0.90 3342
weighted avg 0.95 0.95 0.95 3342

------------------------
Support Vector Machine
precision recall f1-score support

0 0.81 0.55 0.66 541
1 0.92 0.98 0.95 2801

accuracy 0.91 3342
macro avg 0.87 0.76 0.80 3342
weighted avg 0.90 0.91 0.90 3342

------------------------
Gradient Boosting
precision recall f1-score support

0 0.83 0.84 0.84 541
1 0.97 0.97 0.97 2801

accuracy 0.95 3342
macro avg 0.90 0.90 0.90 3342
weighted avg 0.95 0.95 0.95 3342

从3种模型的混淆矩阵和分类模型的相关评价指标来看:可以看到随机森林和提升树的结果都是优于支持向量机的

模型调参优化

针对随机森林和提升树模型采用两种不同的调参优化方法:

  • 随机森林:随机搜索调参
  • 梯度提升树:网格搜索调参

随机搜索调参-随机森林模型

In [59]:

1
from sklearn.model_selection import RandomizedSearchCV

设置不同待调参数的取值:

In [60]:

1
2
3
4
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# n_estimators # 随机森林中树的个数

max_features = ['auto', 'sqrt']

In [61]:

1
2
3
4
5
# 每个tree的最大叶子数
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

max_depth

Out[61]:

1
[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None]

In [62]:

1
2
3
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]

随机搜索参数

In [64]:

1
2
3
4
5
6
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}

搜索结果如下:

In [65]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
rf_random = RandomizedSearchCV(
estimator = rf, # rf模型
param_distributions=random_grid, # 搜索参数
n_iter=30,
cv=3,
verbose=2,
random_state=42,
n_jobs=-1)

rf_random.fit(X_train_res, y_train_res)
print(rf_random.best_params_)
Fitting 3 folds for each of 30 candidates, totalling 90 fits
# 结果
{'n_estimators': 1400,
'min_samples_split': 2,
'min_samples_leaf': 1,
'max_features': 'auto',
'max_depth': 110,
'bootstrap': True}

使用搜索参数建模

使用上面搜索之后的参数再次建模:

In [67]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
rf_clf_search= RandomForestClassifier(n_estimators=1400,
min_samples_split=2,
min_samples_leaf=1,
max_features='auto',
max_depth=110,
bootstrap=True)

rf_clf_search.fit(X_train_res,y_train_res)
y_rf_opt=rf_clf_search.predict(X_test)

print('Random Forest Classifier (Optimized)')

print(classification_report(y_test, y_rf_opt))

_rf_opt=plot_confusion_matrix(rf_clf_search, X_test, y_test)
Random Forest Classifier (Optimized)
precision recall f1-score support

0 0.86 0.84 0.85 541
1 0.97 0.97 0.97 2801

accuracy 0.95 3342
macro avg 0.91 0.90 0.91 3342
weighted avg 0.95 0.95 0.95 3342

调参后的混淆矩阵:左上角的449变成452,说明分类的更加准确了

网格搜索调参-提升树模型

网格搜索参数

In [68]:

1
2
3
4
from sklearn.model_selection import GridSearchCV

param_test1 = {'n_estimators':range(20,100,10)}
param_test1

Out[68]:

1
{'n_estimators': range(20, 100, 10)}

In [69]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 实施搜索

grid_search1 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=1.0, # 待搜索模型
min_samples_split=500,
min_samples_leaf=50,
max_depth=8,
max_features='sqrt',
subsample=0.8,
random_state=10),
param_grid = param_test1, # 搜索参数
scoring='roc_auc',
n_jobs=4,
cv=5)

grid_search1.fit(X_train_res,y_train_res)

grid_search1.best_params_

Out[69]:

1
{'n_estimators': 90}

使用搜索参数建模

In [71]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
gb_clf_opt=GradientBoostingClassifier(n_estimators=90,  # 搜索到的参数90
learning_rate=1.0,
min_samples_split=500,
min_samples_leaf=50,
max_depth=8,
max_features='sqrt',
subsample=0.8,
random_state=10)
# 再次拟合
gb_clf_opt.fit(X_train_res,y_train_res)

y_gb_opt=gb_clf_opt.predict(X_test)
print('Gradient Boosting (Optimized)')
print(classification_report(y_test, y_gb_opt))

print(recall_score(y_test,y_gb_opt,pos_label=0))
_gbopt=plot_confusion_matrix(gb_clf_opt, X_test, y_test)
_gbopt

# 结果
Gradient Boosting (Optimized)
precision recall f1-score support

0 0.85 0.84 0.85 541
1 0.97 0.97 0.97 2801

accuracy 0.95 3342
macro avg 0.91 0.91 0.91 3342
weighted avg 0.95 0.95 0.95 3342

0.8428835489833642

左上角的分类数目从454提升到456,也有一定的提升,但是效果并不是很明显

总结

本文从一份用户相关的数据出发,从数据预处理、特征工程和编码,到建模分析和调参优化,完成了整个用户流失预警的全流程分析。整体模型的结果准确率达到了95%,召回率也达到了84.2%。肯定还有提升的空间,欢迎一起讨论~

本文标题:kaggle实战-银行用户流失预测

发布时间:2022年08月26日 - 09:08

原始链接:http://www.renpeter.cn/2022/08/26/kaggle%E5%AE%9E%E6%88%98-%E9%93%B6%E8%A1%8C%E7%94%A8%E6%88%B7%E6%B5%81%E5%A4%B1%E9%A2%84%E6%B5%8B.html

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

Coffee or Tea