Fork me on GitHub

kaggle实战-银行用户画像分析及流失预警

kaggle实战-银行用户画像及流失预测

本文使用kaggle官网提供的一份银行用户的数据进行相关统计分析、统计分析和流失预测建模。主要内容包含:

简介

原数据地址:https://www.kaggle.com/datasets/sakshigoyal7/credit-card-customers

主要是参考方案:https://www.kaggle.com/code/thomaskonstantin/bank-churn-data-exploration-and-churn-prediction

自己也进行了一些改进,最终F_1 score在3种不同的模型上面都提升了4-5个点左右:

1、原结果:

2、个人结果

导入库

导入一些库,主要是用于数据处理、可视化和建模与评价。

In [2]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import missingno as ms
import plotly.express as px
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.offline as pyo
pyo.init_notebook_mode()
sns.set_style('darkgrid')

from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split,cross_val_score

from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score as f1
from sklearn.metrics import confusion_matrix

plt.rc('figure',figsize=(18,9))

数据基本信息

In [4]:

1
2
df = pd.read_csv("BankChurners.csv")
df.head()

查看数据的一些基本信息:

In [3]:

1
df.shape  # 数据量和列数

Out[3]:

1
(10127, 23)

In [4]:

1
pd.value_counts(df.dtypes)   # 不同的字段类型

Out[4]:

统计不同字段类型的数量:

1
2
3
4
int64      10
float64 7
object 6
dtype: int64

In [5]:

描述统计信息的可视化:

1
df.describe().style.background_gradient(cmap="ocean_r")

Out[5]:

注意:部分字段截图;描述统计信息只针对数值型字段

删除无关字段

目前是最后两个+第一个字段

字段解释为:

  • CLIENTNUM:Client number - Unique identifier for the customer holding the account
  • Attrition_Flag:Flag indicative of account closure in next 6 months (between Jan to Jun 2013)
  • Customer_Age:Age of the account holder
  • Gender:Gender of the account holder
  • Dependent_count:Number of people financially dependent on the account holder
  • Education_Level:Educational qualification of account holder (ex - high school, college grad etc.)
  • Marital_Status:Marital status of account holder (Single, Married, Divorced, Unknown)
  • Income_Category:Annual income category of the account holder
  • Card_Category:Card type depicting the variants of the cards by value proposition (Blue, Silver and Platinum)
  • Months_on_book:Number of months since the account holder opened an an account with the lender
  • Total_Relationship_Count:Total number of products held by the customer. Total number of relationships the account holder has with the bank (example - retail bank, mortgage, wealth management etc.)
  • Months_Inactive_12_mon:Total number of months inactive in last 12 months
  • Contacts_Count_12_mon:Number of Contacts in the last 12 months. No. of times the account holder called to the call center in the past 12 months
  • Credit_Limit:Credit limit
  • Total_Revolving_Bal:Total amount as revolving balance
  • Avg_Open_To_Buy:Open to Buy Credit Line (Average of last 12 months)
  • Total_Amt_Chng_Q4_Q1:Change in Transaction Amount (Q4 over Q1)
  • Total_Trans_Amt:Total Transaction Amount (Last 12 months)
  • Total_Trans_Ct:Total Transaction Count (Last 12 months)
  • Total_Ct_Chng_Q4_Q1:Change in Transaction Count (Q4 over Q1)
  • Avg_Utilization_Ratio:Average Card Utilization Ratio
  • Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1
  • Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2

In [7]:

1
2
df = df[df.columns[:-2]]
df.drop("CLIENTNUM",axis=1,inplace=True) # 删除第一个字段

先删除了一些无关的字段

缺失值情况

In [9]:

1
2
3
ms.bar(df,color="blue")

plt.show()

根据上面的图形结果显示数据中是没有缺失值的。下面同样可以证明:

1
df.isnull().sum().sort_values(ascending=False)

因为是降序排列,第一个取值就是0,也说明是没有缺失值的。

image-20220825194956263

数据EDA-Exploratory Data Analysis

年龄分布

In [11]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
fig = make_subplots(rows=1, cols=2)

trace1=go.Box(x=df['Customer_Age'],
name='Age Box Plot',
boxmean=True)

trace2=go.Histogram(x=df['Customer_Age'],
name='Age Histogram')

fig.add_trace(trace1, row=1, col=1)
fig.add_trace(trace2, row=1, col=2)

fig.update_layout(height=500, width=1000, title_text="用户年龄分布")

fig.show()

大致上用户年龄的分布是符合正态分布的。

不同教育程度人数占比

In [12]:

1
2
3
4
5
fig = px.pie(df,
names='Education_Level',
title='不同教育程度人数占比')

fig.show()

不同性别、不同卡种类的用户对比

In [13]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
fig = make_subplots(
rows=2, # 行列
cols=2,
subplot_titles=('', # 每个子图标题
'<b>Platinum Card Holders',
'<b>Blue Card Holders<b>',
'Residuals'),
vertical_spacing=0.09, # 圆环空白比
specs=[[{"type": "pie","rowspan": 2}, # 每个子图类型
{"type": "pie"}] ,
[None,
{"type": "pie"}],
])

fig.add_trace(go.Pie(values=df["Gender"].value_counts().values,
labels=['<b>Female<b>',
'<b>Male<b>'],
hole=0.3,
pull=[0, 0.3]),
row=1,col=1
)

fig.add_trace(go.Pie(values=df.query('Card_Category=="Platinum"')["Gender"].value_counts().values,
labels=['Female Platinum Card Holders',
'Male Platinum Card Holders'],
hole=0.3,
pull=[0, 0.05, 0.5]),
row=1,col=2
)


fig.add_trace(go.Pie(values=df.query('Card_Category=="Blue"').Gender.value_counts().values,
labels=['Female Blue Card Holders',
'Male Blue Card Holders'],
hole=0.3,
pull=[0,0.2,0.5]),
row=2, col=2)


fig.update_layout(
height=800,
showlegend=True,
title="<b>Distribution Of Gender And Different Card Statuses<b>"
)

fig.show()

可以看到:整体用户中女性是多余男性的

家属人数-Dependent_count

In [14]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
fig = make_subplots(rows=2, cols=1)

trace1=go.Box(x=df['Dependent_count'],
name='家属人数-箱型图',
boxmean=True)

trace2=go.Histogram(x=df['Dependent_count'],
name='家属人数-箱型图')

fig.add_trace(trace1,row=1,col=1)
fig.add_trace(trace2,row=2,col=1)

fig.update_layout(height=700,
width=1200,
title_text="家属人数分布")

fig.show()

不同家属人数的统计量基本符合正态分布,呈现轻微的左偏。

不同个人状态

In [15]:

1
2
3
4
5
6
fig = px.pie(df,
names="Marital_Status",
title="不同个人状态人数对比"
)

fig.show()

信用卡用户中大部分都是已婚人士;同时单身用户的量也很大,说明也是有一定需求。

不同收入水平对比-Income_Category

In [16]:

1
2
3
4
number_of_income = df["Income_Category"].value_counts().reset_index()
number_of_income.columns = ["Income_Category","number"]

number_of_income

1
2
3
4
5
6
7
fig = px.bar(number_of_income,
x="Income_Category",
y="number",
text="number"
)

fig.show()

Months_on_book

在银行有交易或者操作记录的时间长短

In [19]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
fig = make_subplots(rows=2, cols=1)

trace1=go.Box(x=df['Months_on_book'],
name='Months on book Box-Plot',
boxmean=True)
trace2=go.Histogram(x=df['Months_on_book'],
name='Months on book Histogram')

fig.add_trace(trace1,row=1,col=1)
fig.add_trace(trace2,row=2,col=1)

fig.update_layout(height=700,
width=1200,
title_text="Relationships on Bank")

fig.show()

可以看到数据呈现明显的峰度,计算峰度的值:

In [20]:

1
2
print("打印数据的峰度值: ", df["Months_on_book"].kurt())
打印数据的峰度值: 0.40010012019986707

利用小提琴图查看数据的分布:

In [21]:

1
px.violin(y=df["Months_on_book"])

In [22]:

这个数据的分布有点意思:在36-37部分很突出,刚好和柱状图对应,导致小提琴图的中间部分被拉长。

我们观察在两种类型的用户中都是这样的分布:

在这种情况下,我们不能数据看成符合正态分布的特征属性

信用额度-Credit_Limit

In [23]:

1
2
3
fig = px.violin(y=df["Credit_Limit"])

fig.show()

女性用户的信用额度普遍大于男性

1
2
3
sns.displot(data=df, x=df["Credit_Limit"], kde=True)

plt.show()

从分布来看,左偏十分严重,有一定的长尾现象。

最近12个月交易-Total_Trans_Amt

In [26]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
fig = make_subplots(rows=2, cols=1)

trace1=go.Box(x=df['Total_Trans_Amt'],
name='Total_Trans_Amt Box Plot',
boxmean=True)
trace2=go.Histogram(x=df['Total_Trans_Amt'],
name='Total_Trans_Amt Histogram')

fig.add_trace(trace1,row=1,col=1)
fig.add_trace(trace2,row=2,col=1)

fig.update_layout(height=700,
width=1200,
title_text="最近12个月交易额分布")
fig.show()

从上面的直方图观察到:用户近12个月的交易额的分布存在多组下的集中分布特性,说明根据这个特征能够将原数据分成不同的组别,对不同的组别进行分析。

现存和流失客户

In [27]:

1
2
3
4
5
6
fig = px.pie(df,
names='Attrition_Flag',
title='现有和流失客户占比',
hole=0.33)

fig.show()

可以看到两种用户的占比是很不平衡的,后面考虑使用SMOTE采样方法进行处理。

多维度下的现存和流失客户数对比

In [28]:

1
df.columns

Out[28]:

1
2
3
4
5
6
7
Index(['Attrition_Flag', 'Customer_Age', 'Gender', 'Dependent_count',
'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category',
'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon',
'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'],
dtype='object')

In [29]:

1
2
3
4
5
6
df1 = (df.groupby(["Gender","Education_Level","Marital_Status","Income_Category","Attrition_Flag"])
.size()
.reset_index()
.rename(columns={0:"number"}))

df1.head()

1
2
3
4
5
6
7
8
9
10
11
12
fig = px.treemap(
df1,
path=[px.Constant("all"),"Gender","Education_Level","Marital_Status","Income_Category","Attrition_Flag"], # 重点:传递数据路径
values="number",
color="Education_Level"
)

fig.update_traces(root_color="lightskyblue")

fig.update_layout(margin=dict(t=30,l=20,r=25,b=30))

fig.show()

数据预处理

特征编码

In [31]:

1
2
3
df.Attrition_Flag = df.Attrition_Flag.replace({'Attrited Customer':1,'Existing Customer':0})

df.Gender = df.Gender.map({'F':1,'M':0})

In [32]:

1
2
3
# 独热码

df = pd.concat([df,pd.get_dummies(df['Education_Level']).drop(columns=['Unknown'])],axis=1)

受教育水平使用独热码进行编码,同时删除Unknown的数据信息。下面分类型变量是同样操作:

In [33]:

1
2
3
df = pd.concat([df,pd.get_dummies(df['Income_Category']).drop(columns=['Unknown'])],axis=1)
df = pd.concat([df,pd.get_dummies(df['Marital_Status']).drop(columns=['Unknown'])],axis=1)
df = pd.concat([df,pd.get_dummies(df['Card_Category']).drop(columns=['Platinum'])],axis=1)

把原始的几个字段删除:

In [34]:

1
2
df.drop(columns = ['Education_Level','Income_Category',
'Marital_Status','Card_Category'],inplace=True)

计算相关性系数

In [35]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# 基于plotly实现

colorscale= [[1.0, "rgb(165,0,38)"],
[0.8, "rgb(215,48,39)"],
[0.7, "rgb(244,109,67)"],
[0.6, "rgb(253,174,97)"],
[0.5, "rgb(254,224,144)"],
[0.4, "rgb(224,243,248)"],
[0.3, "rgb(171,217,233)"],
[0.2, "rgb(116,173,209)"],
[0.1, "rgb(69,117,180)"],
[0.0, "rgb(49,54,149)"]]

fig = make_subplots(rows=1,cols=1)

corr = df.corr("pearson")
x = corr.columns
y = corr.index
z = corr.values

fig.add_trace(go.Heatmap(x=x,y=y,z=z,
name="相关系数",
showscale=False,
xgap=0.7,
ygap=0.7,
colorscale=colorscale
),row=1,col=1)

fig.update_layout(
hoverlabel=dict(
bgcolor="white",
font_size=16,
font_family="Rockwell"
)
)
fig.update_layout(height=800,
width=800,
title_text="相关系数")
fig.show()

基于SMOTE采样处理

In [36]:

1
2
3
4
# 1、切分数据

X = df.iloc[:,1:]
y = df.iloc[:,0]

In [37]:

1
pd.value_counts(y)  # 合成前

Out[37]:

1
2
3
0    8500
1 1627
Name: Attrition_Flag, dtype: int64

In [38]:

1
2
3
4
5
6
7
# 2、SMOTE采样

sm = SMOTE(sampling_strategy="minority",
k_neighbors=20,
random_state=42)

X, y = sm.fit_resample(X, y)

In [39]:

1
2
3
4
# 3、采样后的全部特征 + 标签(Churn)

sm_df = X.assign(Churn=y)
sm_df.head()

Out[39]:

查看合成之后的数据,发现是相同的比例:

In [40]:

1
pd.value_counts(y)   # 合成后

Out[40]:

1
2
3
0    8500
1 8500
Name: Attrition_Flag, dtype: int64

数据降维PCA

降维可视化

In [41]:

1
sm_df.columns

Out[41]:

1
2
3
4
5
6
7
8
9
10
Index(['Customer_Age', 'Gender', 'Dependent_count', 'Months_on_book',
'Total_Relationship_Count', 'Months_Inactive_12_mon',
'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio',
'College', 'Doctorate', 'Graduate', 'High School', 'Post-Graduate',
'Uneducated', '$120K +', '$40K - $60K', '$60K - $80K', '$80K - $120K',
'Less than $40K', 'Divorced', 'Married', 'Single', 'Blue', 'Gold',
'Silver', 'Churn'],
dtype='object')

In [42]:

1
2
3
4
# 独热码生成的字段数据
one_hot_data = sm_df[sm_df.columns[15:-1]].copy()
# 除去独热码字段的采样后的数据
sm_df = sm_df.drop(columns=sm_df.columns[15:-1])

In [43]:

1
sm_df.shape

Out[43]:

1
(17000, 16)

In [44]:

1
one_hot_data.shape

Out[44]:

1
(17000, 17)

In [45]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
n = 8

pca = PCA(n_components=n)
pca_matrix = pca.fit_transform(one_hot_data)

# 每个属性解释性方差率
evr = pca.explained_variance_ratio_
# 总解释性方差
total_var = evr.sum() * 100
# 累计解释率
cumsum_evr = np.cumsum(evr)


# ----------
trace1 = {
"name": "单个解释性方差",
"type": "bar",
'y':evr}

trace2 = {
"name": "累计解释性方差",
"type": "scatter",
'y':cumsum_evr}

data = [trace1, trace2]
layout = {
"xaxis": {"title": "Principal components"},
"yaxis": {"title": "Explained variance ratio"},
}

fig = go.Figure(data=data, layout=layout)
fig.update_layout(title='Explained Variance Using {} Dimensions'.format(n))
fig.show()

为了保留至少80%的主成分信息,通过多次尝试发现:n=8的时候几乎刚好达到要求

降维结果

降维之后的有效字段生成的DataFrame

In [46]:

1
2
3
pca_df = pd.DataFrame(pca_matrix,
columns=["P{}".format(i) for i in range(0, n)])
pca_df.head()

Out[46]:

生成的8个主成分DataFrame:

将sm_df和pca_df进行合并作为后面建模的数据:

In [47]:

1
df_model = pd.concat([sm_df, pca_df],axis=1)

解释效果

In [48]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
fig = px.scatter_matrix(
pca_df.values,
color = df_model["Credit_Limit"],
dimensions = range(n),
labels = {str(i):'P{}'.format(i) for i in range(0,n)},
title = f'Total Explained Variance: {total_var:.2f}%' # 总解释方差
)

fig.update_traces(diagonal_visible=False)
fig.update_layout(
coloraxis_colorbar=dict(
title="Credit_Limit", # 信用额度
),
)
fig.show()

可以看到我们保留了79.61%的信息。

再次查看相关系数矩阵

In [49]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# 基于plotly实现

colorscale= [[1.0 , "rgb(165,0,38)"],
[0.8888888888888888, "rgb(215,48,39)"],
[0.7777777777777778, "rgb(244,109,67)"],
[0.6666666666666666, "rgb(253,174,97)"],
[0.5555555555555556, "rgb(254,224,144)"],
[0.4444444444444444, "rgb(224,243,248)"],
[0.3333333333333333, "rgb(171,217,233)"],
[0.2222222222222222, "rgb(116,173,209)"],
[0.1111111111111111, "rgb(69,117,180)"],
[0.0 , "rgb(49,54,149)"]]

fig = make_subplots(rows=1,cols=1)

corr = df_model.corr("pearson") # df ---> df_model
x = corr.columns
y = corr.index
z = corr.values

fig.add_trace(go.Heatmap(x=x,y=y,z=z,
name="相关系数",
showscale=False,
xgap=0.7,
ygap=0.7,
colorscale=colorscale
),row=1,col=1)

fig.update_layout(
hoverlabel=dict(
bgcolor="white",
font_size=16,
font_family="Rockwell"
)
)
fig.update_layout(height=800,
width=700,
title_text="相关系数(降维后)")
fig.show()

建模

取出特征矩阵

In [50]:

1
df_model_corr = df_model.corr()

In [51]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
fig = plt.figure(figsize=(16,12))

heatmap = sns.heatmap(df_model_corr[["Churn"]].sort_values(by="Churn", ascending=False),
vmin=-1,
vmax=1,
annot=True,
cmap="coolwarm_r")
heatmap.set_title("Features Correlating with Churn",
fontdict={"fontsize":18},
pad=16)

fig.tight_layout(pad=5)

plt.show()

从上面的热力图中我们可以每个特征和Churn(目标变量)的相关性大小;在这里我们将系数的绝对值小于0.2的特征删除掉:

In [53]:

1
df_model_corr.sort_values("Churn")["Churn"]

Out[53]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Total_Trans_Ct             -0.535498
P3 -0.452090
Total_Ct_Chng_Q4_Q1 -0.405432
Total_Revolving_Bal -0.347875
Total_Relationship_Count -0.309749
Total_Trans_Amt -0.257634
Avg_Utilization_Ratio -0.249698
Total_Amt_Chng_Q4_Q1 -0.192537
P2 -0.181671
P5 -0.133274
P1 -0.112865
Dependent_count -0.098599
Gender -0.090104
Credit_Limit -0.038770
Avg_Open_To_Buy -0.004766
P6 -0.003208
Months_on_book -0.001183
Customer_Age 0.003416
P7 0.010319
P4 0.050547
P0 0.064520
Months_Inactive_12_mon 0.086637
Contacts_Count_12_mon 0.154657
Churn 1.000000
Name: Churn, dtype: float64

新特征矩阵

In [54]:

1
2
3
4
5
no_use_col = ['Total_Amt_Chng_Q4_Q1', 'P1', 'Dependent_count',
'Credit_Limit', 'P7', 'Avg_Open_To_Buy',
'Customer_Age', 'Months_Inactive_12_mon',
'Months_on_book', 'P6', 'P4', 'P0',
'Contacts_Count_12_mon']

In [55]:

1
2
df_new = df_model.drop(no_use_col, axis=1)
df_new.head()

切割数据

In [56]:

1
2
3
4
5
6
7
# 新训练集的特征矩阵和目标变量

X_ = df_new.drop("Churn",axis=1)
y_ = df_new["Churn"]


X_train, X_test, y_train, y_test = train_test_split(X_, y_, test_size=0.3, random_state=42)

交叉验证

In [57]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
rf = Pipeline(steps =[('scale',StandardScaler()),  # 树模型归一化可省略
("RF",RandomForestClassifier(random_state=42))])

ada = Pipeline(steps =[('scale',StandardScaler()),
("ADB",AdaBoostClassifier(random_state=42,
learning_rate=0.6))])

svm = Pipeline(steps =[('scale',StandardScaler()),
("SVC", SVC(random_state=42, kernel='rbf'))])

rf_f1_scores = cross_val_score(rf,X_train,y_train,
cv=5,scoring='f1')

ada_f1_scores = cross_val_score(ada,X_train,y_train,
cv=5,scoring='f1')
svm_f1_scores = cross_val_score(svm,X_train,y_train,
cv=5,scoring='f1')

结果对比

In [58]:

1
2
3
length = len(rf_f1_scores)

x = list(range(length))

In [59]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
fig = make_subplots(rows=3,
cols=1,
shared_xaxes=True,
subplot_titles=('RF','Adaboost','SVM'))

fig.add_trace(
go.Scatter(x=x,y=rf_f1_scores,name='Random Forest'),
row=1, col=1
)
fig.add_trace(
go.Scatter(x=x,y=ada_f1_scores,name='Adaboost'),
row=2, col=1
)
fig.add_trace(
go.Scatter(x=x,y=svm_f1_scores,name='SVM'),
row=3, col=1
)

fig.update_layout(height=700,
width=900,
title_text="Different Model 5 Fold Cross Validation")

fig.update_yaxes(title_text="F1 Score")

fig.update_xaxes(title_text="Fold #")

fig.show()

模型预测

使用3种模型进行预测:

In [60]:

1
2
3
4
5
6
7
8
rf.fit(X_train,y_train)
rf_prediction = rf.predict(X_test)

ada.fit(X_train,y_train)
ada_prediction = ada.predict(X_test)

svm.fit(X_train,y_train)
svm_prediction = svm.predict(X_test)

In [61]:

1
2
3
rf_f1 = f1(rf_prediction, y_test)
ada_f1 = f1(ada_prediction, y_test)
svm_f1 = f1(svm_prediction, y_test)

3种模型对比

3种模型的F1取值的结果对比:

In [62]:

1
2
3
4
df_f1_test = pd.DataFrame({"Model":["RF","AdaBoost","SVM"],
"F1_score":[rf_f1,ada_f1,svm_f1]
})
df_f1_test

很明显:随机森林仍然是最好的。我们和原方案的结果进行对比:还是提升了4-5个点。

本文标题:kaggle实战-银行用户画像分析及流失预警

发布时间:2022年08月25日 - 20:08

原始链接:http://www.renpeter.cn/2022/08/25/kaggle%E5%AE%9E%E6%88%98-%E9%93%B6%E8%A1%8C%E7%94%A8%E6%88%B7%E7%94%BB%E5%83%8F%E5%88%86%E6%9E%90%E5%8F%8A%E6%B5%81%E5%A4%B1%E9%A2%84%E8%AD%A6.html

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

Coffee or Tea