基于机器学习的糖尿病预测

基于机器学习的皮马印第安人糖尿病预测

本文是针对Kaggle上面一份皮马印第安人糖尿病的数据的建模，属于机器学习中的二分类问题。原数据地址：

https://www.kaggle.com/code/vincentlugat/pima-indians-diabetes-eda-prediction-0-906/data

导入库

import pandas as pd
import numpy as np

# 可视化
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import plotly.express as px
import plotly.offline as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.tools as tls
import plotly.figure_factory as ff
py.init_notebook_mode(connected=True)
import squarify

# 数据处理、建模
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import precision_score, recall_score, confusion_matrix,  roc_curve, precision_recall_curve, accuracy_score, roc_auc_score
import lightgbm as lgbm
from sklearn.ensemble import VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_curve,auc
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_predict
from yellowbrick.classifier import DiscriminationThreshold

import scipy.stats as ss
from scipy import interp
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform

# 忽略警告
import warnings
warnings.filterwarnings('ignore')

数据概况

数据基本信息

In [3]:

1 2	df = pd.read_csv("diabetes.csv") df.head()

Out[3]:

整体的数据量信息：

In [3]:

df.shape

Out[3]:

(768, 9)

数据缺失值情况：

In [4]:

1	df.isnull().sum()

Out[4]:

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [5]:

1
2
3

# 数据字段类型

df.dtypes

Out[5]:

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

In [6]:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   Pregnancies               768 non-null    int64
 1   Glucose                   768 non-null    int64
 2   BloodPressure             768 non-null    int64
 3   SkinThickness             768 non-null    int64
 4   Insulin                   768 non-null    int64
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64
 8   Outcome                   768 non-null    int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

查看数据的描述统计信息，针对数值型的字段。本案例全部是数值型字段：

In [7]:

1	df.describe()

Out[7]:

字段解释

In [8]:

1 2	columns = df.columns columns

Out[8]:

1
2
3

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

Pregnancies：妊娠期周数
Glucose：葡萄糖含量
BloodPressure：血压
SkinThickness：皮肤厚度
Insulin：胰岛素含量
BMI：身体的BMI指标
DiabetesPedigreeFunction：糖尿病系谱函数值
Age：年龄
Outcome：最终的分类标签；0-健康，1-糖尿病患者

不同Outcome下的人群对比

查看不同最终的分类标签数据分布情况：

In [9]:

1 2	D = df[df["Outcome"] != 0] H = df[df["Outcome"] == 0]

In [10]:

df1 = df["Outcome"].value_counts().reset_index()

df1.columns = ["Outcome","Count"]
df1

Out[10]:

	Outcome	Count
0	0	500
1	1	268

In [11]:

fig = px.bar(df1,
             x="Count",
             y=["Healthy","Diabetic"],
             text="Count",
             orientation="h")

fig.show()

In [12]:

def target_percent():
    trace = go.Pie(
        labels = ['Healthy','Diabetic'], # 分类的labels
        values = df['Outcome'].value_counts(),  # 数据取值
        textfont=dict(size=15),   # 字体大小
        opacity = 0.8,  # 透明度
        marker=dict(colors=['lightskyblue', 'gold'], # 颜色
                    line=dict(color='#000000',
                                         width=1.5)
                   )
    )
    layout = dict(title='Distribution of Outcome variable')

    # 添加数据和layout信息
    fig = dict(data = [trace], layout=layout)
    py.iplot(fig)


target_percent()

缺失值处理

数据中部分字段的取值为0，在这里我们认为：0是缺失值

In [13]:

1	df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN) # 0用空值代替

缺失值比例可视化：

In [14]:

def missing_plot(df, key):
    """
    参数：传入的数据df
         指定输入的字段key
    """
    null_data = pd.DataFrame(len(df[key]) - df.isnull().sum(), columns=['Count'])
    percentage_null = pd.DataFrame((len(df[key]) - (len(df[key]) - df.isnull().sum())) / len(df[key]) * 100, columns=["Count"])
    percentage_null = percentage_null.round(2)

    trace = go.Bar(x=null_data.index,
                   y=null_data["Count"],
                  opacity=0.8,
                  text=percentage_null["Count"],
                  textposition="auto",
                  marker=dict(color = '#7EC0EE',
                              line=dict(color='#000000',
                                        width=1.5))
                  )
    layout = dict(title =  "missing Values (count & %)")

    fig = dict(data = [trace], layout=layout)
    py.iplot(fig)


# 调用函数
missing_plot(df, "Outcome")

字段分布

不同字段下的数据分布情况：

In [15]:

plt.style.use('ggplot') # 指定风格

f, ax = plt.subplots(figsize=(11, 15))

ax.set_facecolor('#eafbfa')
ax.set(xlim=(-.05, 200))

plt.ylabel('Variables')
plt.title("Overview Data Set")

ax = sns.boxplot(data = df,  # 数据
                 orient = 'h',  # 水平显示
                 palette = 'Set2'
                )

缺失值均值填充

In [21]:

1	df.isnull().sum() # 查看字段缺失值情况

Out[21]:

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

根据Outcome进行分组统计，使用均值填充该字段的缺失值：

In [22]:

def fill_median(col):
    temp = df[df[col].notnull()]
    temp = temp[[col, "Outcome"]].groupby(["Outcome"])[[col]].median().reset_index()
    return temp

字段Insulin

In [23]:

1 2	# D Outcome=1 # H Outcome=0

In [24]:

def plot_distribution(col, bin_size):
    temp1 = D[col]  # 筛选两个数据
    temp2 = H[col]

    hist_data = [temp1, temp2]

    group_labels = ["D","H"]  # D-diabetic   H-healthy
    colors = ['#EFD700', '#7EC0EE']

    fig = ff.create_distplot(hist_data,
                            group_labels,
                            colors=colors,
                            show_hist=True,
                            bin_size=bin_size,
                            curve_type='kde')
    fig["layout"].update(title="Density plot")

    py.iplot(fig, filename="Density plot")

plot_distribution('Insulin', 0)

查看不同Outcome下的均值并填充：

In [25]:

1	fill_median("Insulin")

Out[25]:

	Outcome	Insulin
0	0	102.5
1	1	169.5

In [26]:

1 2	df.loc[(df['Outcome'] == 0 ) & (df['Insulin'].isnull()), 'Insulin'] = 102.5 df.loc[(df['Outcome'] == 1 ) & (df['Insulin'].isnull()), 'Insulin'] = 169.5

字段Glucose

In [27]:

1	plot_distribution("Glucose",0)

1 2	df.loc[(df['Outcome'] == 0 ) & (df['Glucose'].isnull()), 'Glucose'] = 107 df.loc[(df['Outcome'] == 1 ) & (df['Glucose'].isnull()), 'Glucose'] = 140

字段SkinThickness

In [30]:

1	plot_distribution("SkinThickness",0)

fill_median("SkinThickness")

df.loc[(df['Outcome'] == 0 ) & (df['SkinThickness'].isnull()), 'SkinThickness'] = 27
df.loc[(df['Outcome'] == 1 ) & (df['SkinThickness'].isnull()), 'SkinThickness'] = 32

字段BMI

In [33]:

1	plot_distribution("BMI",0)

字段BloodPressure

In [36]:

1	plot_distribution("BloodPressure", 0)

其他不存在缺失值的字段的分布情况：

In [39]:

1	plot_distribution("Age",0)

填充之后再次查看数据的缺失值情况：已经不存在缺失值

In [42]:

1	missing_plot(df,"Outcome")

特征构建及EDA

创建3个绘图函数：

两两特征关系

In [43]:

def plot_feat1_feat2(feat1, feat2) :
    D = df[(df['Outcome'] != 0)]  # 糖尿病
    H = df[(df['Outcome'] == 0)]  # 健康
    trace0 = go.Scatter(
        x = D[feat1],  # 糖尿病患者的中两个特征的散点分布
        y = D[feat2],
        name = 'diabetic',
        mode = 'markers',
        marker = dict(color = '#FFD700',
            line = dict(
                width = 1)))

    trace1 = go.Scatter(
        x = H[feat1],  #  健康人群中两个特征的散点分布
        y = H[feat2],
        name = 'healthy',
        mode = 'markers',
        marker = dict(color = '#7EC0EE',
            line = dict(
                width = 1)))

    layout = dict(title = feat1 +" "+"vs"+" "+ feat2,
                  xaxis = dict(title = feat1,zeroline = False),
                  yaxis = dict(title = feat2,zeroline = False)
                 )

    plots = [trace0, trace1]

    fig = dict(data = plots, layout=layout)
    py.iplot(fig)

指定特征绘图

In [44]:

def barplot(var_select, sub) :
    tmp1 = df[(df['Outcome'] != 0)]
    tmp2 = df[(df['Outcome'] == 0)]
    tmp3 = pd.DataFrame(pd.crosstab(df[var_select],df['Outcome']), )
    tmp3['% diabetic'] = tmp3[1] / (tmp3[1] + tmp3[0]) * 100

    color=['lightskyblue','gold' ]

    # add trace1
    trace1 = go.Bar(
        x=tmp1[var_select].value_counts().keys().tolist(),  # keys
        y=tmp1[var_select].value_counts().values.tolist(),  # values
        text=tmp1[var_select].value_counts().values.tolist(),
        textposition = 'auto',
        name='diabetic',
        opacity = 0.8,
        marker=dict(
            color='gold',
            line=dict(color='#000000',width=1)))

    # add trace2
    trace2 = go.Bar(
        x=tmp2[var_select].value_counts().keys().tolist(),
        y=tmp2[var_select].value_counts().values.tolist(),
        text=tmp2[var_select].value_counts().values.tolist(),
        textposition = 'auto',
        name='healthy',
        opacity = 0.8,
        marker=dict(
            color='lightskyblue',
            line=dict(color='#000000',width=1)))

    # add trace3
    trace3 = go.Scatter(
        x=tmp3.index,
        y=tmp3['% diabetic'],
        yaxis = 'y2',
        name='% diabetic',
        opacity = 0.6,
        marker=dict(
            color='black',
            line=dict(color='#000000',width=0.5
                     )
        ))

    # 设置layout
    layout = dict(title =  str(var_select)+' '+(sub),
                  xaxis=dict(),
                  yaxis=dict(title= 'Count'),
                  yaxis2=dict(range= [-0, 75],
                          overlaying= 'y',
                          anchor= 'x',
                          side= 'right',
                          zeroline=False,
                          showgrid= False,
                          title= '% diabetic'
                         ))

    fig = go.Figure(data=[trace1, trace2, trace3],
                    layout=layout)
    py.iplot(fig)

指定特征下不同Outcome人群占比

In [45]:

def plot_pie(var_select, sub) :
    D = df[(df['Outcome'] != 0)] # 患者
    H = df[(df['Outcome'] == 0)]  # 健康

    # 颜色设置
    col =['Silver', 'mediumturquoise','#CF5C36','lightblue',
          'magenta', '#FF5D73','#F2D7EE','mediumturquoise']

    # 添加不同的饼图trace
    trace1 = go.Pie(values  = D[var_select].value_counts().values.tolist(),
                    labels  = D[var_select].value_counts().keys().tolist(),
                    textfont=dict(size=15),  # 字体大小
                    opacity = 0.8,  # 透明度
                    hole = 0.5,   # 中间圆环比例
                    hoverinfo = "label+percent+name",  # 悬停信息
                    domain = dict(x = [.0,.48]),
                    name = "Diabetic",
                    marker = dict(colors = col,
                                  line = dict(width = 1.5)))
    trace2 = go.Pie(values = H[var_select].value_counts().values.tolist(),
                    labels = H[var_select].value_counts().keys().tolist(),
                    textfont=dict(size=15),
                    opacity = 0.8,
                    hole = 0.5,
                    hoverinfo = "label+percent+name",
                    marker = dict(line = dict(width = 1.5)),
                    domain = dict(x = [.52,1]),
                    name = "Healthy")

    layout = go.Layout(dict(
        title = var_select + " distribution by target <br>"+(sub),
        annotations = [ dict(text = "Diabetic"+" : "+"268",
                             font = dict(size = 13),
                             showarrow = False,
                             x = .22,
                             y = -0.1),
                       dict(text = "Healthy"+" : "+"500",
                            font = dict(size = 13),
                            showarrow = False,
                            x = .8,
                            y = -.1)]))


    fig  = go.Figure(data = [trace1,trace2],layout = layout)
    py.iplot(fig)

Glucose和Age关系

In [46]:

1	plot_feat1_feat2('Glucose','Age')

绘制二者关系的散点图，并标记密集区域：

In [47]:

palette ={0 : 'lightblue', 1 : 'gold'}
edgecolor = 'black'

fig = plt.figure(figsize=(12,8))

ax1 = sns.scatterplot(x = df['Glucose'],
                      y = df['Age'],
                      hue = "Outcome",  # 分类信息
                      data = df,
                      palette = palette,
                      edgecolor=edgecolor)
# 添加标注信息
plt.annotate('N1',
             size=25,
             color='black',
             xy=(80, 30),
             xytext=(60, 35),
            arrowprops=dict(facecolor='black',
                            shrink=0.05),)

# 标注的范围
plt.plot([50, 120], [30, 30], linewidth=2, color = 'red')
plt.plot([120, 120], [20, 30], linewidth=2, color = 'red')
plt.plot([50, 120], [20, 20], linewidth=2, color = 'red')
plt.plot([50, 50], [20, 30], linewidth=2, color = 'red')
plt.title('Glucose vs Age')
plt.show()

创建后一个新特征：N1

In [48]:

1 2	df.loc[:,'N1']=0 df.loc[(df['Age']<=30) & (df['Glucose']<=120),'N1']=1

In [49]:

1	barplot('N1', ':Glucose <= 120 and Age <= 30')

基于BMI新特征-N2

In [51]:

1 2	df.loc[:,'N2']=0 df.loc[(df['BMI']<=30),'N2']=1

In [52]:

1	barplot('N2', ': BMI <= 30')

不同Outcome下的占比：

In [53]:

1	plot_pie('N2', 'BMI <= 30')

Pregnancies 和 Age关系

In [54]:

1	plot_feat1_feat2('Pregnancies','Age')

palette ={0 : 'lightblue', 1 : 'gold'}
edgecolor = 'black'

fig = plt.figure(figsize=(12,8))

ax1 = sns.scatterplot(x = df['Pregnancies'],
                      y = df['Age'],
                      hue = "Outcome",
                      data = df,
                      palette = palette,
                      edgecolor=edgecolor)

plt.annotate('N3',
             size=25,
             color='black',
             xy=(6, 25),
             xytext=(10, 25),
            arrowprops=dict(facecolor='black',
                            shrink=0.05),
            )
plt.plot([0, 6], [30, 30], linewidth=2, color = 'red')
plt.plot([6, 6], [20, 30], linewidth=2, color = 'red')
plt.plot([0, 6], [20, 20], linewidth=2, color = 'red')
plt.plot([0, 0], [20, 30], linewidth=2, color = 'red')
plt.title('Pregnancies vs Age')
plt.show()

构造特征N3：

Glucose和BloodPressure关系

In [59]:

px.scatter(df,
					x="Glucose",
					y="BloodPressure",
					color="Outcome")

palette ={0 : 'lightblue', 1 : 'gold'}
edgecolor = 'black'

fig = plt.figure(figsize=(12,8))

ax1 = sns.scatterplot(x = df['Glucose'],
                      y = df['BloodPressure'],
                      hue = "Outcome",
                      data = df,
                      palette = palette,
                      edgecolor=edgecolor)

plt.annotate('N4',
             size=25,
             color='black',
             xy=(70, 80),
             xytext=(50, 110),
            arrowprops=dict(facecolor='black', shrink=0.05),
            )

plt.plot([40, 105], [80, 80], linewidth=2, color = 'red')
plt.plot([40, 40], [20, 80], linewidth=2, color = 'red')
plt.plot([40, 105], [20, 20], linewidth=2, color = 'red')
plt.plot([105, 105], [20, 80], linewidth=2, color = 'red')
plt.title('Glucose vs BloodPressure')
plt.show()

构建新特征N4：

In [62]:

# 新特征

df.loc[:,'N4']=0
df.loc[(df['Glucose']<=105) & (df['BloodPressure']<=80),'N4']=1

基于SkinThickness构建特征

In [63]:

1 2	df.loc[:,'N5']=0 df.loc[(df['SkinThickness']<=20) ,'N5']=1

In [64]:

1 2	barplot('N5', ':SkinThickness <= 20') plot_pie('N5', 'SkinThickness <= 20')

SkinThickness 和 BMI的关系

In [65]:

1	plot_feat1_feat2('SkinThickness','BMI')

df.loc[:,'N6']=0
df.loc[(df['BMI']<30) & (df['SkinThickness']<=20),'N6']=1

palette ={0 : 'lightblue', 1 : 'gold'}
edgecolor = 'black'

fig = plt.figure(figsize=(12,8))

ax1 = sns.scatterplot(x = df['SkinThickness'],
                      y = df['BMI'],
                      hue = "Outcome",
                      data = df,
                      palette = palette,
                      edgecolor=edgecolor)

plt.annotate('N6',
             size=25,
             color='black',
             xy=(20, 20),
             xytext=(50, 25),
            arrowprops=dict(facecolor='black', shrink=0.05),
            )
plt.plot([0, 20], [30, 30], linewidth=2, color = 'red')
plt.plot([0, 0], [16, 30], linewidth=2, color = 'red')
plt.plot([0, 20], [16, 16], linewidth=2, color = 'red')
plt.plot([20, 20], [16, 30], linewidth=2, color = 'red')
plt.title('SkinThickness vs BMI')
plt.show()

1 2	barplot('N6', ': BMI < 30 and SkinThickness <= 20') plot_pie('N6', 'BMI < 30 and SkinThickness <= 20')

Glucose和BMI的关系

In [69]:

1	plot_feat1_feat2('Glucose','BMI')

palette ={0 : 'lightblue', 1 : 'gold'}
edgecolor = 'black'

fig = plt.figure(figsize=(12,8))

ax1 = sns.scatterplot(x = df['Glucose'],
                      y = df['BMI'],
                      hue = "Outcome",
                      data = df,
                      palette = palette,
                      edgecolor=edgecolor)

plt.annotate('N7',
             size=25,
             color='black',
             xy=(70, 35),
             xytext=(40, 60),
            arrowprops=dict(facecolor='black', shrink=0.05),
            )

plt.plot([105, 105], [16, 30], linewidth=2, color = 'red')
plt.plot([40, 40], [16, 30], linewidth=2, color = 'red')
plt.plot([40, 105], [16, 16], linewidth=2, color = 'red')
plt.plot([40, 105], [30, 30], linewidth=2, color = 'red')
plt.title('Glucose vs BMI')
plt.show()

df.loc[:,'N7']=0
df.loc[(df['Glucose']<=105) & (df['BMI']<=30),'N7']=1

# barplot('N7', ': Glucose <= 105 and BMI <= 30')
# plot_pie('N7', 'Glucose <= 105 and BMI <= 30')

基于Insulin构建特征

In [72]:

1
2
3

# 特征分布情况

plot_distribution('Insulin', 0)

df.loc[:,'N9']=0
df.loc[(df['Insulin']<200),'N9']=1

# barplot('N9', ': Insulin < 200')

# plot_pie('N9', 'Insulin < 200')

基于BloodPressure构建特征

In [74]:

1 2	df.loc[:,'N10']=0 df.loc[(df['BloodPressure']<80),'N10']=1

In [75]:

1 2	barplot('N10', ': BloodPressure < 80') plot_pie('N10', 'BloodPressure < 80')

基于Pregnancies构建特征

In [76]:

1	plot_distribution('Pregnancies', 0)

df.loc[:,'N11']=0
df.loc[(df['Pregnancies']<4) & (df['Pregnancies']!=0) ,'N11']=1

barplot('N11', ': Pregnancies > 0 and < 4')
plot_pie('N11', 'Pregnancies > 0 and < 4')

其他特征衍生

In [78]:

df['N0'] = df['BMI'] * df['SkinThickness']
df['N8'] =  df['Pregnancies'] / df['Age']
df['N13'] = df['Glucose'] / df['DiabetesPedigreeFunction']
df['N12'] = df['Age'] * df['DiabetesPedigreeFunction']
df['N14'] = df['Age'] / df['Insulin']

In [79]:

1 2	D = df[(df['Outcome'] != 0)] H = df[(df['Outcome'] == 0)]

In [80]:

1	plot_distribution('N0', 0)

df.loc[:,'N15']=0
df.loc[(df['N0']<1034) ,'N15']=1

barplot('N15', ': N0 < 1034')
plot_pie('N15', 'N0 < 1034')

构造了多个新特征后的数据：

建模

In [83]:

1	df.nunique() # 每个字段的唯一值情况

Out[83]:

Pregnancies                  17
Glucose                     135
BloodPressure                47
SkinThickness                50
Insulin                     187
BMI                         247
DiabetesPedigreeFunction    517
Age                          52
Outcome                       2
N1                            2
N2                            2
N3                            2
N4                            2
N5                            2
N6                            2
N7                            2
N9                            2
N10                           2
N11                           2
N0                          637
N8                          206
N13                         765
N12                         741
N14                         435
N15                           2
dtype: int64

In [84]:

1	df.nunique()[df.nunique() < 12] # 字段的唯一值小于12种

Out[84]:

Outcome    2
N1         2
N2         2
N3         2
N4         2
N5         2
N6         2
N7         2
N9         2
N10        2
N11        2
N15        2
dtype: int64

In [85]:

1	df.nunique()[df.nunique() < 12].keys().tolist()

Out[85]:

['Outcome',
 'N1',
 'N2',
 'N3',
 'N4',
 'N5',
 'N6',
 'N7',
 'N9',
 'N10',
 'N11',
 'N15']

特征分类

根据每个特征的唯一值不同，将特征分为数值型特征和分类型特征

In [86]:

target_col = ["Outcome"]

# 分类型特征
cat_cols = df.nunique()[df.nunique() < 12].keys().tolist()
cat_cols = [x for x in cat_cols]

print(cat_cols)


# 数值型特征
num_cols = [x for x in df.columns if x not in cat_cols + target_col]

# 只包含两个分类的特征
bin_cols = df.nunique()[df.nunique() == 2].keys().tolist()

# 多分类特征
multi_cols = [i for i in cat_cols if i not in bin_cols]
['Outcome', 'N1', 'N2', 'N3', 'N4', 'N5', 'N6', 'N7', 'N9', 'N10', 'N11', 'N15']

编码

In [87]:

1
2
3

le = LabelEncoder()
for i in bin_cols:
    df[i] = le.fit_transform(df[i])

In [88]:

1 2	df = pd.get_dummies(data=df,columns=multi_cols) df.head()

数据标准化

In [89]:

std = StandardScaler()

scaled = std.fit_transform(df[num_cols])
scaled = pd.DataFrame(scaled, columns=num_cols)

In [90]:

df_copy = df.copy()
df = df.drop(columns=num_cols,axis=1)
df = df.merge(scaled,
              left_index=True,
              right_index=True,
              how="left")

df.head()

Out[90]:

新数据的相关系数矩阵

In [91]:

corr = df.corr()
# print(corr)
matrix_cols = corr.columns.tolist()
corr_array = np.array(corr)


trace = go.Heatmap(x=matrix_cols,
                  y=matrix_cols,
                  z=corr_array,
                  colorscale="Viridis",
                  colorbar=dict())

layout = go.Layout(dict(title="New Correlation Matrix"),
                  margin=dict(r = 0 ,
                              l = 100,
                              t = 0,
                              b = 100),
                   yaxis=dict(tickfont=dict(size = 9)),
                   xaxis=dict(tickfont=dict(size = 9)),
                  )

fig = go.Figure(data=[trace], layout=layout)

py.iplot(fig)

切分特征和标签

In [92]:

1 2	X = df.drop('Outcome', 1) y = df['Outcome']

模型评估

定义一个函数model_performance来评估不同分类模型的性能：

In [93]:

def model_performance(model, subtitle) :
    # 交叉验证
    cv = KFold(n_splits=5,
               shuffle=True,
               random_state = 42)
    y_real = []
    y_proba = []
    tprs = []
    aucs = []
    mean_fpr = np.linspace(0,1,100)
    i = 1

    for train,test in cv.split(X,y):  # 交叉验证
        model.fit(X.iloc[train], y.iloc[train])  # 模型训练
        pred_proba = model.predict_proba(X.iloc[test])  # 得分预测
        #  准确率和召回率
        precision, recall, _ = precision_recall_curve(y.iloc[test], pred_proba[:,1])
        y_real.append(y.iloc[test])
        y_proba.append(pred_proba[:,1])
        fpr, tpr, t = roc_curve(y[test], pred_proba[:, 1])
        tprs.append(interp(mean_fpr, fpr, tpr))
        roc_auc = auc(fpr, tpr)
        aucs.append(roc_auc)

    # 混淆矩阵求解
    y_pred = cross_val_predict(model, X, y, cv=5)
    conf_matrix = confusion_matrix(y, y_pred)
    # 混淆矩阵绘图
    trace1 = go.Heatmap(z = conf_matrix,
                        x = ["0 (pred)","1 (pred)"],
                        y = ["0 (true)","1 (true)"],
                        xgap = 2,
                        ygap = 2,
                        colorscale = 'Viridis',
                        showscale  = False)

    #Show metrics
    tp = conf_matrix[1,1]
    fn = conf_matrix[1,0]
    fp = conf_matrix[0,1]
    tn = conf_matrix[0,0]
    Accuracy  =  ((tp+tn)/(tp+tn+fp+fn))
    Precision =  (tp/(tp+fp))
    Recall    =  (tp/(tp+fn))
    F1_score  =  (2*(((tp/(tp+fp))*(tp/(tp+fn)))/((tp/(tp+fp))+(tp/(tp+fn)))))


    show_metrics = pd.DataFrame(data=[[Accuracy ,
                                       Precision,
                                       Recall,
                                       F1_score]])
    show_metrics = show_metrics.T

    colors = ['gold', 'lightgreen',
              'lightcoral', 'lightskyblue']

    trace2 = go.Bar(x = (show_metrics[0].values),
                    y = ['Accuracy', 'Precision', 'Recall', 'F1_score'],
                    text = np.round_(show_metrics[0].values,4),
                    textposition = 'auto',
                    textfont=dict(color='black'),
                    orientation = 'h',
                    opacity = 1,
                    marker=dict(
                        color=colors,
                        line=dict(color='#000000',width=1.5)))

    #ROC曲线
    mean_tpr = np.mean(tprs, axis=0)
    mean_auc = auc(mean_fpr, mean_tpr)

    trace3 = go.Scatter(x=mean_fpr,
                        y=mean_tpr,
                        name = "Roc : " ,
                        line = dict(color = ('rgb(22, 96, 167)'),
                                    width = 2),
                        fill='tozeroy')
    trace4 = go.Scatter(x = [0,1],y = [0,1],
                        line = dict(color = ('black'),
                                    width = 1.5,
                        dash = 'dot'))

    #Precision - recall curve
    y_real = y
    y_proba = np.concatenate(y_proba)
    precision, recall, _ = precision_recall_curve(y_real, y_proba)

    trace5 = go.Scatter(x = recall,
                        y = precision,
                        name = "Precision" + str(precision),
                        line = dict(color = ('lightcoral'),
                                    width = 2),
                        fill='tozeroy')

    mean_auc=round(mean_auc,3)

    #  如何绘制子图
    fig = tls.make_subplots(rows=2,
                            cols=2,
                            print_grid=False,
                            specs=[[{}, {}],
                                 [{}, {}]],
                            subplot_titles=('Confusion Matrix',
                                          'Metrics',
                                          'ROC curve'+" "+ '('+ str(mean_auc)+')',
                                          'Precision - Recall curve',
                                          ))
    #  添加不同的trace
    fig.append_trace(trace1,1,1)
    fig.append_trace(trace2,1,2)
    fig.append_trace(trace3,2,1)
    fig.append_trace(trace4,2,1)
    fig.append_trace(trace5,2,2)

    fig['layout'].update(showlegend = False,
                         title = '<b>Model performance report (5 folds)</b><br>'+subtitle,
                        autosize = False,
                        height = 830,
                        width = 830,
                        plot_bgcolor = 'black',
                        paper_bgcolor = 'black',
                        margin = dict(b = 195),
                        font=dict(color='white'))
    fig["layout"]["xaxis1"].update(color = 'white')
    fig["layout"]["yaxis1"].update(color = 'white')
    fig["layout"]["xaxis2"].update((dict(range=[0, 1],
                                         color = 'white')))
    fig["layout"]["yaxis2"].update(color = 'white')
    fig["layout"]["xaxis3"].update(dict(title = "false positive rate"),
                                   color = 'white')
    fig["layout"]["yaxis3"].update(dict(title = "true positive rate"),
                                   color = 'white')
    fig["layout"]["xaxis4"].update(dict(title = "recall"), range = [0,1.05],
                                   color = 'white')
    fig["layout"]["yaxis4"].update(dict(title = "precision"), range = [0,1.05],
                                   color = 'white')
    for i in fig['layout']['annotations']:
        i['font'] = titlefont=dict(color='white', size = 14)
    py.iplot(fig)

不同模型不同指标的得分显示：

In [94]:

def scores_table(model, subtitle):
    scores = ['accuracy', 'precision',
              'recall', 'f1', 'roc_auc']
    res = []
    for sc in scores:
        scores = cross_val_score(model,
                                 X,
                                 y,
                                 cv = 5,
                                 scoring = sc)
        res.append(scores)

    df = pd.DataFrame(res).T
    df.loc['mean'] = df.mean()
    df.loc['std'] = df.std()
    df= df.rename(columns={0: 'accuracy',
                           1:'precision',
                           2:'recall',
                           3:'f1',
                           4:'roc_auc'}
                 )

    trace = go.Table(
        # 表头设置
        header=dict(values=['<b>Fold', '<b>Accuracy',
                            '<b>Precision', '<b>Recall',
                            '<b>F1 score', '<b>Roc auc'],

                    line = dict(color='#7D7F80'),
                    fill = dict(color='#a1c3d1'),
                    align = ['center'],
                    font = dict(size = 15)),

        cells=dict(values=[('1','2','3','4',
                            '5','mean', 'std'),
                           np.round(df['accuracy'],3),
                           np.round(df['precision'],3),
                           np.round(df['recall'],3),
                           np.round(df['f1'],3),
                           np.round(df['roc_auc'],3)],
                   line = dict(color='#7D7F80'),
                   fill = dict(color='#EDFAFF'),
                   align = ['center'],
                   font = dict(size = 15)))

    layout = dict(width=800,
                  height=400,
                  title = '<b>Cross Validation - 5 folds</b><br>'+subtitle,
                  font = dict(size = 15))
    fig = dict(data=[trace], layout=layout)

    py.iplot(fig, filename = 'styled_table')

构建LGBM模型

In [95]:

random_state=42

fit_params = {"early_stopping_rounds" : 100,
             "eval_metric" : 'auc',
             "eval_set" : [(X,y)],
             'eval_names': ['valid'],
             'verbose': 0,
             'categorical_feature': 'auto'}

param_test = {'learning_rate' : [0.01, 0.02, 0.03, 0.04, 0.05,
                                 0.08, 0.1, 0.2, 0.3, 0.4],
              'n_estimators' : [100, 200, 300, 400, 500,
                                600, 800, 1000, 1500, 2000],
              'num_leaves': sp_randint(6, 50),
              'min_child_samples': sp_randint(100, 500),
              'min_child_weight': [1e-5, 1e-3, 1e-2, 1e-1,
                                   1, 1e1, 1e2, 1e3, 1e4],
              'subsample': sp_uniform(loc=0.2, scale=0.8),
              'max_depth': [-1, 1, 2, 3, 4, 5, 6, 7],
              'colsample_bytree': sp_uniform(loc=0.4, scale=0.6),
              'reg_alpha': [0, 1e-1, 1, 2, 5, 7, 10, 50, 100],
              'reg_lambda': [0, 1e-1, 1, 5, 10, 20, 50, 100]}

#  迭代次数
n_iter = 300

# 初始模型
lgbm_clf = lgbm.LGBMClassifier(random_state=random_state,
                               silent=True,
                               metric='None',
                               n_jobs=4)

# 网格搜索设置
grid_search = RandomizedSearchCV(
    estimator=lgbm_clf,
    param_distributions=param_test,
    n_iter=n_iter,
    scoring='accuracy',
    cv=5,
    refit=True,
    random_state=random_state,
    verbose=True)

grid_search.fit(X, y, **fit_params)
# 搜索后的最佳参数组合
opt_parameters =  grid_search.best_params_
# 使用最佳参数重新建模
lgbm_clf = lgbm.LGBMClassifier(**opt_parameters)
Fitting 5 folds for each of 300 candidates, totalling 1500 fits

In [96]:

1 2	model_performance(lgbm_clf, 'LightGBM') scores_table(lgbm_clf, 'LightGBM')

构建KNN分类模型

knn_clf = KNeighborsClassifier()

# 投票表决方案
voting_clf = VotingClassifier(estimators=[
    ('lgbm_clf', lgbm_clf),
    ('knn', KNeighborsClassifier())],
                              voting='soft',
                              weights = [1,1])

params = {
      'knn__n_neighbors': np.arange(1,30)
      }

grid = GridSearchCV(estimator=voting_clf, param_grid=params, cv=5)

grid.fit(X,y)

print("Best Score:" + str(grid.best_score_))
print("Best Parameters: " + str(grid.best_params_))

# 结果
Best Score:0.8919701213818861
Best Parameters: {'knn__n_neighbors': 9}

visualizer = DiscriminationThreshold(voting_clf)

visualizer.fit(X, y)
visualizer.poof()

数据获取

关于本文的数据集获取方式，关注【尤而小屋】，回复糖尿病即可领取