kaggle实战-银行用户画像分析及流失预警

kaggle实战-银行用户画像及流失预测

本文使用kaggle官网提供的一份银行用户的数据进行相关统计分析、统计分析和流失预测建模。主要内容包含：

简介

原数据地址：https://www.kaggle.com/datasets/sakshigoyal7/credit-card-customers

主要是参考方案：https://www.kaggle.com/code/thomaskonstantin/bank-churn-data-exploration-and-churn-prediction

自己也进行了一些改进，最终F_1 score在3种不同的模型上面都提升了4-5个点左右：

1、原结果：

2、个人结果

导入库

导入一些库，主要是用于数据处理、可视化和建模与评价。

In [2]:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import missingno as ms
import plotly.express as px
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.offline as pyo
pyo.init_notebook_mode()
sns.set_style('darkgrid')

from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split,cross_val_score

from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score as f1
from sklearn.metrics import confusion_matrix

plt.rc('figure',figsize=(18,9))

数据基本信息

In [4]:

1 2	df = pd.read_csv("BankChurners.csv") df.head()

查看数据的一些基本信息：

In [3]:

1	df.shape # 数据量和列数

Out[3]:

1	(10127, 23)

In [4]:

1	pd.value_counts(df.dtypes) # 不同的字段类型

Out[4]:

统计不同字段类型的数量：

int64      10
float64     7
object      6
dtype: int64

In [5]:

描述统计信息的可视化：

1	df.describe().style.background_gradient(cmap="ocean_r")

Out[5]:

注意：部分字段截图；描述统计信息只针对数值型字段

删除无关字段

目前是最后两个+第一个字段

字段解释为：

CLIENTNUM：Client number - Unique identifier for the customer holding the account
Attrition_Flag：Flag indicative of account closure in next 6 months (between Jan to Jun 2013)
Customer_Age：Age of the account holder
Gender：Gender of the account holder
Dependent_count：Number of people financially dependent on the account holder
Education_Level：Educational qualification of account holder (ex - high school, college grad etc.)
Marital_Status：Marital status of account holder (Single, Married, Divorced, Unknown)
Income_Category：Annual income category of the account holder
Card_Category：Card type depicting the variants of the cards by value proposition (Blue, Silver and Platinum)
Months_on_book：Number of months since the account holder opened an an account with the lender
Total_Relationship_Count：Total number of products held by the customer. Total number of relationships the account holder has with the bank (example - retail bank, mortgage, wealth management etc.)
Months_Inactive_12_mon：Total number of months inactive in last 12 months
Contacts_Count_12_mon：Number of Contacts in the last 12 months. No. of times the account holder called to the call center in the past 12 months
Credit_Limit：Credit limit
Total_Revolving_Bal：Total amount as revolving balance
Avg_Open_To_Buy：Open to Buy Credit Line (Average of last 12 months)
Total_Amt_Chng_Q4_Q1：Change in Transaction Amount (Q4 over Q1)
Total_Trans_Amt：Total Transaction Amount (Last 12 months)
Total_Trans_Ct：Total Transaction Count (Last 12 months)
Total_Ct_Chng_Q4_Q1：Change in Transaction Count (Q4 over Q1)
Avg_Utilization_Ratio：Average Card Utilization Ratio
Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1
Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2

In [7]:

1 2	df = df[df.columns[:-2]] df.drop("CLIENTNUM",axis=1,inplace=True) # 删除第一个字段

先删除了一些无关的字段

缺失值情况

In [9]:

1
2
3

ms.bar(df,color="blue")

plt.show()

根据上面的图形结果显示数据中是没有缺失值的。下面同样可以证明：

1	df.isnull().sum().sort_values(ascending=False)

因为是降序排列，第一个取值就是0，也说明是没有缺失值的。

数据EDA-Exploratory Data Analysis

年龄分布

In [11]:

fig = make_subplots(rows=1, cols=2)

trace1=go.Box(x=df['Customer_Age'],
              name='Age Box Plot',
              boxmean=True)

trace2=go.Histogram(x=df['Customer_Age'],
                    name='Age Histogram')

fig.add_trace(trace1, row=1, col=1)
fig.add_trace(trace2, row=1, col=2)

fig.update_layout(height=500, width=1000, title_text="用户年龄分布")

fig.show()

大致上用户年龄的分布是符合正态分布的。

不同教育程度人数占比

In [12]:

fig = px.pie(df,
             names='Education_Level',
             title='不同教育程度人数占比')

fig.show()

不同性别、不同卡种类的用户对比

In [13]:

fig = make_subplots(
    rows=2,  # 行列
    cols=2,
    subplot_titles=('',  # 每个子图标题
                    '<b>Platinum Card Holders',
                    '<b>Blue Card Holders<b>',
                    'Residuals'),
    vertical_spacing=0.09,  # 圆环空白比
    specs=[[{"type": "pie","rowspan": 2},  # 每个子图类型
            {"type": "pie"}] ,
           [None,
            {"type": "pie"}],
          ])

fig.add_trace(go.Pie(values=df["Gender"].value_counts().values,
                    labels=['<b>Female<b>',
                            '<b>Male<b>'],
                    hole=0.3,
                    pull=[0, 0.3]),
              row=1,col=1
             )

fig.add_trace(go.Pie(values=df.query('Card_Category=="Platinum"')["Gender"].value_counts().values,
                    labels=['Female Platinum Card Holders',
                            'Male Platinum Card Holders'],
                    hole=0.3,
                    pull=[0, 0.05, 0.5]),
              row=1,col=2
             )


fig.add_trace(go.Pie(values=df.query('Card_Category=="Blue"').Gender.value_counts().values,
                     labels=['Female Blue Card Holders',
                             'Male Blue Card Holders'],
                     hole=0.3,
                     pull=[0,0.2,0.5]),
              row=2, col=2)


fig.update_layout(
    height=800,
    showlegend=True,
    title="<b>Distribution Of Gender And Different Card Statuses<b>"
)

fig.show()

可以看到：整体用户中女性是多余男性的

家属人数-Dependent_count

In [14]:

fig = make_subplots(rows=2, cols=1)

trace1=go.Box(x=df['Dependent_count'],
              name='家属人数-箱型图',
              boxmean=True)

trace2=go.Histogram(x=df['Dependent_count'],
                    name='家属人数-箱型图')

fig.add_trace(trace1,row=1,col=1)
fig.add_trace(trace2,row=2,col=1)

fig.update_layout(height=700,
                  width=1200,
                  title_text="家属人数分布")

fig.show()

不同家属人数的统计量基本符合正态分布，呈现轻微的左偏。

不同个人状态

In [15]:

fig = px.pie(df,
             names="Marital_Status",
             title="不同个人状态人数对比"
            )

fig.show()

信用卡用户中大部分都是已婚人士；同时单身用户的量也很大，说明也是有一定需求。

不同收入水平对比-Income_Category

In [16]:

number_of_income = df["Income_Category"].value_counts().reset_index()
number_of_income.columns = ["Income_Category","number"]

number_of_income

fig = px.bar(number_of_income,
             x="Income_Category",
             y="number",
             text="number"
            )

fig.show()

Months_on_book

在银行有交易或者操作记录的时间长短

In [19]:

fig = make_subplots(rows=2, cols=1)

trace1=go.Box(x=df['Months_on_book'],
           name='Months on book Box-Plot',
           boxmean=True)
trace2=go.Histogram(x=df['Months_on_book'],
                 name='Months on book Histogram')

fig.add_trace(trace1,row=1,col=1)
fig.add_trace(trace2,row=2,col=1)

fig.update_layout(height=700,
                  width=1200,
                  title_text="Relationships on Bank")

fig.show()

可以看到数据呈现明显的峰度，计算峰度的值：

In [20]:

1 2	print("打印数据的峰度值: ", df["Months_on_book"].kurt()) 打印数据的峰度值: 0.40010012019986707

利用小提琴图查看数据的分布：

In [21]:

1	px.violin(y=df["Months_on_book"])

In [22]:

这个数据的分布有点意思：在36-37部分很突出，刚好和柱状图对应，导致小提琴图的中间部分被拉长。

我们观察在两种类型的用户中都是这样的分布：

在这种情况下，我们不能数据看成符合正态分布的特征属性

信用额度-Credit_Limit

In [23]:

1
2
3

fig = px.violin(y=df["Credit_Limit"])

fig.show()

女性用户的信用额度普遍大于男性

1
2
3

sns.displot(data=df, x=df["Credit_Limit"], kde=True)

plt.show()

从分布来看，左偏十分严重，有一定的长尾现象。

最近12个月交易-Total_Trans_Amt

In [26]:

fig = make_subplots(rows=2, cols=1)

trace1=go.Box(x=df['Total_Trans_Amt'],
              name='Total_Trans_Amt Box Plot',
              boxmean=True)
trace2=go.Histogram(x=df['Total_Trans_Amt'],
                    name='Total_Trans_Amt Histogram')

fig.add_trace(trace1,row=1,col=1)
fig.add_trace(trace2,row=2,col=1)

fig.update_layout(height=700,
                  width=1200,
                  title_text="最近12个月交易额分布")
fig.show()

从上面的直方图观察到：用户近12个月的交易额的分布存在多组下的集中分布特性，说明根据这个特征能够将原数据分成不同的组别，对不同的组别进行分析。

现存和流失客户

In [27]:

fig = px.pie(df,
       names='Attrition_Flag',
       title='现有和流失客户占比',
       hole=0.33)

fig.show()

可以看到两种用户的占比是很不平衡的，后面考虑使用SMOTE采样方法进行处理。

多维度下的现存和流失客户数对比

In [28]:

1	df.columns

Out[28]:

Index(['Attrition_Flag', 'Customer_Age', 'Gender', 'Dependent_count',
       'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category',
       'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon',
       'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'],
      dtype='object')

In [29]:

df1 = (df.groupby(["Gender","Education_Level","Marital_Status","Income_Category","Attrition_Flag"])
       .size()
       .reset_index()
       .rename(columns={0:"number"}))

df1.head()

fig = px.treemap(
    df1,
    path=[px.Constant("all"),"Gender","Education_Level","Marital_Status","Income_Category","Attrition_Flag"],  # 重点：传递数据路径
    values="number",
    color="Education_Level"
)

fig.update_traces(root_color="lightskyblue")

fig.update_layout(margin=dict(t=30,l=20,r=25,b=30))

fig.show()

数据预处理

特征编码

In [31]:

1
2
3

df.Attrition_Flag = df.Attrition_Flag.replace({'Attrited Customer':1,'Existing Customer':0})

df.Gender = df.Gender.map({'F':1,'M':0})

In [32]:

1
2
3

# 独热码

df = pd.concat([df,pd.get_dummies(df['Education_Level']).drop(columns=['Unknown'])],axis=1)

受教育水平使用独热码进行编码，同时删除Unknown的数据信息。下面分类型变量是同样操作：

In [33]:

1
2
3

df = pd.concat([df,pd.get_dummies(df['Income_Category']).drop(columns=['Unknown'])],axis=1)
df = pd.concat([df,pd.get_dummies(df['Marital_Status']).drop(columns=['Unknown'])],axis=1)
df = pd.concat([df,pd.get_dummies(df['Card_Category']).drop(columns=['Platinum'])],axis=1)

把原始的几个字段删除：

In [34]:

1 2	df.drop(columns = ['Education_Level','Income_Category', 'Marital_Status','Card_Category'],inplace=True)

计算相关性系数

In [35]:

# 基于plotly实现

colorscale= [[1.0, "rgb(165,0,38)"],
            [0.8, "rgb(215,48,39)"],
            [0.7, "rgb(244,109,67)"],
            [0.6, "rgb(253,174,97)"],
            [0.5, "rgb(254,224,144)"],
            [0.4, "rgb(224,243,248)"],
            [0.3, "rgb(171,217,233)"],
            [0.2, "rgb(116,173,209)"],
            [0.1, "rgb(69,117,180)"],
            [0.0, "rgb(49,54,149)"]]

fig = make_subplots(rows=1,cols=1)

corr = df.corr("pearson")
x = corr.columns
y = corr.index
z = corr.values

fig.add_trace(go.Heatmap(x=x,y=y,z=z,
                         name="相关系数",
                         showscale=False,
                         xgap=0.7,
                         ygap=0.7,
                         colorscale=colorscale
                        ),row=1,col=1)

fig.update_layout(
    hoverlabel=dict(
        bgcolor="white",
        font_size=16,
        font_family="Rockwell"
    )
)
fig.update_layout(height=800,
                  width=800,
                  title_text="相关系数")
fig.show()

基于SMOTE采样处理

In [36]:

# 1、切分数据

X = df.iloc[:,1:]
y = df.iloc[:,0]

In [37]:

1	pd.value_counts(y) # 合成前

Out[37]:

1
2
3

0    8500
1    1627
Name: Attrition_Flag, dtype: int64

In [38]:

# 2、SMOTE采样

sm = SMOTE(sampling_strategy="minority",
           k_neighbors=20,
           random_state=42)

X, y = sm.fit_resample(X, y)

In [39]:

# 3、采样后的全部特征 + 标签（Churn）

sm_df = X.assign(Churn=y)
sm_df.head()

Out[39]:

查看合成之后的数据，发现是相同的比例：

In [40]:

1	pd.value_counts(y) # 合成后

Out[40]:

1
2
3

0    8500
1    8500
Name: Attrition_Flag, dtype: int64

数据降维PCA

降维可视化

In [41]:

1	sm_df.columns

Out[41]:

Index(['Customer_Age', 'Gender', 'Dependent_count', 'Months_on_book',
       'Total_Relationship_Count', 'Months_Inactive_12_mon',
       'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio',
       'College', 'Doctorate', 'Graduate', 'High School', 'Post-Graduate',
       'Uneducated', '$120K +', '$40K - $60K', '$60K - $80K', '$80K - $120K',
       'Less than $40K', 'Divorced', 'Married', 'Single', 'Blue', 'Gold',
       'Silver', 'Churn'],
      dtype='object')

In [42]:

# 独热码生成的字段数据
one_hot_data = sm_df[sm_df.columns[15:-1]].copy()
# 除去独热码字段的采样后的数据
sm_df = sm_df.drop(columns=sm_df.columns[15:-1])

In [43]:

1	sm_df.shape

Out[43]:

1	(17000, 16)

In [44]:

1	one_hot_data.shape

Out[44]:

1	(17000, 17)

In [45]:

n = 8

pca = PCA(n_components=n)
pca_matrix = pca.fit_transform(one_hot_data)

# 每个属性解释性方差率
evr = pca.explained_variance_ratio_
# 总解释性方差
total_var = evr.sum() * 100
# 累计解释率
cumsum_evr = np.cumsum(evr)


# ----------
trace1 = {
    "name": "单个解释性方差",
    "type": "bar",
    'y':evr}

trace2 = {
    "name": "累计解释性方差",
    "type": "scatter",
     'y':cumsum_evr}

data = [trace1, trace2]
layout = {
    "xaxis": {"title": "Principal components"},
    "yaxis": {"title": "Explained variance ratio"},
  }

fig = go.Figure(data=data, layout=layout)
fig.update_layout(title='Explained Variance Using {} Dimensions'.format(n))
fig.show()

为了保留至少80%的主成分信息，通过多次尝试发现：n=8的时候几乎刚好达到要求

降维结果

降维之后的有效字段生成的DataFrame

In [46]:

1
2
3

pca_df = pd.DataFrame(pca_matrix,
                      columns=["P{}".format(i) for i in range(0, n)])
pca_df.head()

Out[46]:

生成的8个主成分DataFrame：

将sm_df和pca_df进行合并作为后面建模的数据：

In [47]:

1	df_model = pd.concat([sm_df, pca_df],axis=1)

解释效果

In [48]:

fig = px.scatter_matrix(
    pca_df.values,
    color = df_model["Credit_Limit"],
    dimensions = range(n),
    labels = {str(i):'P{}'.format(i) for i in range(0,n)},
    title = f'Total Explained Variance: {total_var:.2f}%' # 总解释方差
)

fig.update_traces(diagonal_visible=False)
fig.update_layout(
    coloraxis_colorbar=dict(
        title="Credit_Limit",  # 信用额度
    ),
)
fig.show()

可以看到我们保留了79.61%的信息。

再次查看相关系数矩阵

In [49]:

# 基于plotly实现

colorscale=     [[1.0              , "rgb(165,0,38)"],
                [0.8888888888888888, "rgb(215,48,39)"],
                [0.7777777777777778, "rgb(244,109,67)"],
                [0.6666666666666666, "rgb(253,174,97)"],
                [0.5555555555555556, "rgb(254,224,144)"],
                [0.4444444444444444, "rgb(224,243,248)"],
                [0.3333333333333333, "rgb(171,217,233)"],
                [0.2222222222222222, "rgb(116,173,209)"],
                [0.1111111111111111, "rgb(69,117,180)"],
                [0.0               , "rgb(49,54,149)"]]

fig = make_subplots(rows=1,cols=1)

corr = df_model.corr("pearson")  # df ---> df_model
x = corr.columns
y = corr.index
z = corr.values

fig.add_trace(go.Heatmap(x=x,y=y,z=z,
                         name="相关系数",
                         showscale=False,
                         xgap=0.7,
                         ygap=0.7,
                         colorscale=colorscale
                        ),row=1,col=1)

fig.update_layout(
    hoverlabel=dict(
        bgcolor="white",
        font_size=16,
        font_family="Rockwell"
    )
)
fig.update_layout(height=800,
                  width=700,
                  title_text="相关系数（降维后）")
fig.show()

建模

取出特征矩阵

In [50]:

1	df_model_corr = df_model.corr()

In [51]:

fig = plt.figure(figsize=(16,12))

heatmap = sns.heatmap(df_model_corr[["Churn"]].sort_values(by="Churn", ascending=False),
                     vmin=-1,
                     vmax=1,
                     annot=True,
                     cmap="coolwarm_r")
heatmap.set_title("Features Correlating with Churn",
                  fontdict={"fontsize":18},
                  pad=16)

fig.tight_layout(pad=5)

plt.show()

从上面的热力图中我们可以每个特征和Churn（目标变量）的相关性大小；在这里我们将系数的绝对值小于0.2的特征删除掉：

In [53]:

1	df_model_corr.sort_values("Churn")["Churn"]

Out[53]:

Total_Trans_Ct             -0.535498
P3                         -0.452090
Total_Ct_Chng_Q4_Q1        -0.405432
Total_Revolving_Bal        -0.347875
Total_Relationship_Count   -0.309749
Total_Trans_Amt            -0.257634
Avg_Utilization_Ratio      -0.249698
Total_Amt_Chng_Q4_Q1       -0.192537
P2                         -0.181671
P5                         -0.133274
P1                         -0.112865
Dependent_count            -0.098599
Gender                     -0.090104
Credit_Limit               -0.038770
Avg_Open_To_Buy            -0.004766
P6                         -0.003208
Months_on_book             -0.001183
Customer_Age                0.003416
P7                          0.010319
P4                          0.050547
P0                          0.064520
Months_Inactive_12_mon      0.086637
Contacts_Count_12_mon       0.154657
Churn                       1.000000
Name: Churn, dtype: float64

新特征矩阵

In [54]:

no_use_col = ['Total_Amt_Chng_Q4_Q1', 'P1', 'Dependent_count',
              'Credit_Limit', 'P7', 'Avg_Open_To_Buy',
              'Customer_Age', 'Months_Inactive_12_mon',
              'Months_on_book', 'P6', 'P4', 'P0',
              'Contacts_Count_12_mon']

In [55]:

1 2	df_new = df_model.drop(no_use_col, axis=1) df_new.head()

切割数据

In [56]:

# 新训练集的特征矩阵和目标变量

X_ = df_new.drop("Churn",axis=1)
y_ = df_new["Churn"]


X_train, X_test, y_train, y_test = train_test_split(X_, y_, test_size=0.3, random_state=42)

交叉验证

In [57]:

rf = Pipeline(steps =[('scale',StandardScaler()),  # 树模型归一化可省略
                      ("RF",RandomForestClassifier(random_state=42))])

ada = Pipeline(steps =[('scale',StandardScaler()),
                       ("ADB",AdaBoostClassifier(random_state=42,
                                                  learning_rate=0.6))])

svm = Pipeline(steps =[('scale',StandardScaler()),
                       ("SVC", SVC(random_state=42, kernel='rbf'))])

rf_f1_scores = cross_val_score(rf,X_train,y_train,
                              cv=5,scoring='f1')

ada_f1_scores = cross_val_score(ada,X_train,y_train,
                                cv=5,scoring='f1')
svm_f1_scores = cross_val_score(svm,X_train,y_train,
                                cv=5,scoring='f1')

结果对比

In [58]:

1
2
3

length = len(rf_f1_scores)

x = list(range(length))

In [59]:

fig = make_subplots(rows=3,
                    cols=1,
                    shared_xaxes=True,
                    subplot_titles=('RF','Adaboost','SVM'))

fig.add_trace(
    go.Scatter(x=x,y=rf_f1_scores,name='Random Forest'),
    row=1, col=1
)
fig.add_trace(
    go.Scatter(x=x,y=ada_f1_scores,name='Adaboost'),
    row=2, col=1
)
fig.add_trace(
    go.Scatter(x=x,y=svm_f1_scores,name='SVM'),
    row=3, col=1
)

fig.update_layout(height=700,
                  width=900,
                  title_text="Different Model 5 Fold Cross Validation")

fig.update_yaxes(title_text="F1 Score")

fig.update_xaxes(title_text="Fold #")

fig.show()

模型预测

使用3种模型进行预测：

In [60]:

rf.fit(X_train,y_train)
rf_prediction = rf.predict(X_test)

ada.fit(X_train,y_train)
ada_prediction = ada.predict(X_test)

svm.fit(X_train,y_train)
svm_prediction = svm.predict(X_test)

In [61]:

1
2
3

rf_f1 = f1(rf_prediction, y_test)
ada_f1 = f1(ada_prediction, y_test)
svm_f1 = f1(svm_prediction, y_test)

3种模型对比

3种模型的F1取值的结果对比：

In [62]:

df_f1_test = pd.DataFrame({"Model":["RF","AdaBoost","SVM"],
                            "F1_score":[rf_f1,ada_f1,svm_f1]
                           })
df_f1_test

很明显：随机森林仍然是最好的。我们和原方案的结果进行对比：还是提升了4-5个点。