Fork me on GitHub

基于机器学习的糖尿病预测

基于机器学习的皮马印第安人糖尿病预测

本文是针对Kaggle上面一份皮马印第安人糖尿病的数据的建模,属于机器学习中的二分类问题。原数据地址:

https://www.kaggle.com/code/vincentlugat/pima-indians-diabetes-eda-prediction-0-906/data

目录

本文的整体目录:

导入库

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import pandas as pd
import numpy as np

# 可视化
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import plotly.express as px
import plotly.offline as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.tools as tls
import plotly.figure_factory as ff
py.init_notebook_mode(connected=True)
import squarify

# 数据处理、建模
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import precision_score, recall_score, confusion_matrix, roc_curve, precision_recall_curve, accuracy_score, roc_auc_score
import lightgbm as lgbm
from sklearn.ensemble import VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_curve,auc
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_predict
from yellowbrick.classifier import DiscriminationThreshold

import scipy.stats as ss
from scipy import interp
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform

# 忽略警告
import warnings
warnings.filterwarnings('ignore')

数据概况

数据基本信息

In [3]:

1
2
df = pd.read_csv("diabetes.csv")
df.head()

Out[3]:

整体的数据量信息:

In [3]:

1
df.shape

Out[3]:

1
(768, 9)

数据缺失值情况:

In [4]:

1
df.isnull().sum()

Out[4]:

1
2
3
4
5
6
7
8
9
10
Pregnancies                 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64

In [5]:

1
2
3
# 数据字段类型

df.dtypes

Out[5]:

1
2
3
4
5
6
7
8
9
10
Pregnancies                   int64
Glucose int64
BloodPressure int64
SkinThickness int64
Insulin int64
BMI float64
DiabetesPedigreeFunction float64
Age int64
Outcome int64
dtype: object

In [6]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

查看数据的描述统计信息,针对数值型的字段。本案例全部是数值型字段:

In [7]:

1
df.describe()

Out[7]:

字段解释

In [8]:

1
2
columns = df.columns
columns

Out[8]:

1
2
3
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
dtype='object')
  • Pregnancies:妊娠期周数
  • Glucose:葡萄糖含量
  • BloodPressure:血压
  • SkinThickness:皮肤厚度
  • Insulin:胰岛素含量
  • BMI:身体的BMI指标
  • DiabetesPedigreeFunction:糖尿病系谱函数值
  • Age:年龄
  • Outcome:最终的分类标签;0-健康,1-糖尿病患者

不同Outcome下的人群对比

查看不同最终的分类标签数据分布情况:

In [9]:

1
2
D = df[df["Outcome"] != 0]
H = df[df["Outcome"] == 0]

In [10]:

1
2
3
4
df1 = df["Outcome"].value_counts().reset_index()

df1.columns = ["Outcome","Count"]
df1

Out[10]:

Outcome Count
0 0 500
1 1 268

In [11]:

1
2
3
4
5
6
7
fig = px.bar(df1,
x="Count",
y=["Healthy","Diabetic"],
text="Count",
orientation="h")

fig.show()

image-20221223232319564

In [12]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def target_percent():
trace = go.Pie(
labels = ['Healthy','Diabetic'], # 分类的labels
values = df['Outcome'].value_counts(), # 数据取值
textfont=dict(size=15), # 字体大小
opacity = 0.8, # 透明度
marker=dict(colors=['lightskyblue', 'gold'], # 颜色
line=dict(color='#000000',
width=1.5)
)
)
layout = dict(title='Distribution of Outcome variable')

# 添加数据和layout信息
fig = dict(data = [trace], layout=layout)
py.iplot(fig)


target_percent()

缺失值处理

数据中部分字段的取值为0,在这里我们认为:0是缺失值

In [13]:

1
df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN)  # 0用空值代替

缺失值比例可视化:

In [14]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def missing_plot(df, key):
"""
参数:传入的数据df
指定输入的字段key
"""
null_data = pd.DataFrame(len(df[key]) - df.isnull().sum(), columns=['Count'])
percentage_null = pd.DataFrame((len(df[key]) - (len(df[key]) - df.isnull().sum())) / len(df[key]) * 100, columns=["Count"])
percentage_null = percentage_null.round(2)

trace = go.Bar(x=null_data.index,
y=null_data["Count"],
opacity=0.8,
text=percentage_null["Count"],
textposition="auto",
marker=dict(color = '#7EC0EE',
line=dict(color='#000000',
width=1.5))
)
layout = dict(title = "missing Values (count & %)")

fig = dict(data = [trace], layout=layout)
py.iplot(fig)


# 调用函数
missing_plot(df, "Outcome")

字段分布

不同字段下的数据分布情况:

In [15]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
plt.style.use('ggplot') # 指定风格

f, ax = plt.subplots(figsize=(11, 15))

ax.set_facecolor('#eafbfa')
ax.set(xlim=(-.05, 200))

plt.ylabel('Variables')
plt.title("Overview Data Set")

ax = sns.boxplot(data = df, # 数据
orient = 'h', # 水平显示
palette = 'Set2'
)

相关系数热力图

查看不同字段间的相关性强弱:

In [16]:

1
2
3
4
5
6
corr = df.corr()
# print(corr)
matrix_cols = corr.columns.tolist()
corr_array = np.array(corr)

corr_array[:3]

Out[16]:

1
2
3
4
5
6
array([[ 1.        ,  0.12813455,  0.21417848,  0.10023907,  0.08217103,
0.02171892, -0.03352267, 0.54434123, 0.22189815],
[ 0.12813455, 1. , 0.22319178, 0.22804323, 0.58118621,
0.23277051, 0.13724574, 0.26713555, 0.49465026],
[ 0.21417848, 0.22319178, 1. , 0.22683907, 0.0982723 ,
0.28923034, -0.00280453, 0.33010743, 0.17058928]])

In [17]:

1
corr.head()

Out[17]:

针对原始数据的相关系数热力图:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
trace = go.Heatmap(x=matrix_cols,
y=matrix_cols,
z=corr_array,
colorscale="Viridis",
colorbar=dict())

layout = go.Layout(dict(title="Correlation Matrix for variables"),
margin=dict(r = 0 ,
l = 100,
t = 0,
b = 100),
yaxis=dict(tickfont=dict(size = 9)),
xaxis=dict(tickfont=dict(size = 9)),
)

fig = go.Figure(data=[trace], layout=layout)

py.iplot(fig)

使用seaborn快速实现:

In [20]:

1
2
3
sns.heatmap(corr)

plt.show()

缺失值均值填充

In [21]:

1
df.isnull().sum()  # 查看字段缺失值情况

Out[21]:

1
2
3
4
5
6
7
8
9
10
Pregnancies                   0
Glucose 5
BloodPressure 35
SkinThickness 227
Insulin 374
BMI 11
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64

根据Outcome进行分组统计,使用均值填充该字段的缺失值:

In [22]:

1
2
3
4
def fill_median(col):
temp = df[df[col].notnull()]
temp = temp[[col, "Outcome"]].groupby(["Outcome"])[[col]].median().reset_index()
return temp

字段Insulin

In [23]:

1
2
# D  Outcome=1
# H Outcome=0

In [24]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def plot_distribution(col, bin_size):
temp1 = D[col] # 筛选两个数据
temp2 = H[col]

hist_data = [temp1, temp2]

group_labels = ["D","H"] # D-diabetic H-healthy
colors = ['#EFD700', '#7EC0EE']

fig = ff.create_distplot(hist_data,
group_labels,
colors=colors,
show_hist=True,
bin_size=bin_size,
curve_type='kde')
fig["layout"].update(title="Density plot")

py.iplot(fig, filename="Density plot")

plot_distribution('Insulin', 0)

查看不同Outcome下的均值并填充:

In [25]:

1
fill_median("Insulin")

Out[25]:

Outcome Insulin
0 0 102.5
1 1 169.5

In [26]:

1
2
df.loc[(df['Outcome'] == 0 ) & (df['Insulin'].isnull()), 'Insulin'] = 102.5
df.loc[(df['Outcome'] == 1 ) & (df['Insulin'].isnull()), 'Insulin'] = 169.5

字段Glucose

In [27]:

1
plot_distribution("Glucose",0)

1
2
df.loc[(df['Outcome'] == 0 ) & (df['Glucose'].isnull()), 'Glucose'] = 107
df.loc[(df['Outcome'] == 1 ) & (df['Glucose'].isnull()), 'Glucose'] = 140

字段SkinThickness

In [30]:

1
plot_distribution("SkinThickness",0)

1
2
3
4
fill_median("SkinThickness")

df.loc[(df['Outcome'] == 0 ) & (df['SkinThickness'].isnull()), 'SkinThickness'] = 27
df.loc[(df['Outcome'] == 1 ) & (df['SkinThickness'].isnull()), 'SkinThickness'] = 32

字段BMI

In [33]:

1
plot_distribution("BMI",0)

字段BloodPressure

In [36]:

1
plot_distribution("BloodPressure", 0)

其他不存在缺失值的字段的分布情况:

In [39]:

1
plot_distribution("Age",0)

填充之后再次查看数据的缺失值情况:已经不存在缺失值

In [42]:

1
missing_plot(df,"Outcome")

特征构建及EDA

创建3个绘图函数:

两两特征关系

In [43]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def plot_feat1_feat2(feat1, feat2) :
D = df[(df['Outcome'] != 0)] # 糖尿病
H = df[(df['Outcome'] == 0)] # 健康
trace0 = go.Scatter(
x = D[feat1], # 糖尿病患者的中两个特征的散点分布
y = D[feat2],
name = 'diabetic',
mode = 'markers',
marker = dict(color = '#FFD700',
line = dict(
width = 1)))

trace1 = go.Scatter(
x = H[feat1], # 健康人群中两个特征的散点分布
y = H[feat2],
name = 'healthy',
mode = 'markers',
marker = dict(color = '#7EC0EE',
line = dict(
width = 1)))

layout = dict(title = feat1 +" "+"vs"+" "+ feat2,
xaxis = dict(title = feat1,zeroline = False),
yaxis = dict(title = feat2,zeroline = False)
)

plots = [trace0, trace1]

fig = dict(data = plots, layout=layout)
py.iplot(fig)

指定特征绘图

In [44]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
def barplot(var_select, sub) :
tmp1 = df[(df['Outcome'] != 0)]
tmp2 = df[(df['Outcome'] == 0)]
tmp3 = pd.DataFrame(pd.crosstab(df[var_select],df['Outcome']), )
tmp3['% diabetic'] = tmp3[1] / (tmp3[1] + tmp3[0]) * 100

color=['lightskyblue','gold' ]

# add trace1
trace1 = go.Bar(
x=tmp1[var_select].value_counts().keys().tolist(), # keys
y=tmp1[var_select].value_counts().values.tolist(), # values
text=tmp1[var_select].value_counts().values.tolist(),
textposition = 'auto',
name='diabetic',
opacity = 0.8,
marker=dict(
color='gold',
line=dict(color='#000000',width=1)))

# add trace2
trace2 = go.Bar(
x=tmp2[var_select].value_counts().keys().tolist(),
y=tmp2[var_select].value_counts().values.tolist(),
text=tmp2[var_select].value_counts().values.tolist(),
textposition = 'auto',
name='healthy',
opacity = 0.8,
marker=dict(
color='lightskyblue',
line=dict(color='#000000',width=1)))

# add trace3
trace3 = go.Scatter(
x=tmp3.index,
y=tmp3['% diabetic'],
yaxis = 'y2',
name='% diabetic',
opacity = 0.6,
marker=dict(
color='black',
line=dict(color='#000000',width=0.5
)
))

# 设置layout
layout = dict(title = str(var_select)+' '+(sub),
xaxis=dict(),
yaxis=dict(title= 'Count'),
yaxis2=dict(range= [-0, 75],
overlaying= 'y',
anchor= 'x',
side= 'right',
zeroline=False,
showgrid= False,
title= '% diabetic'
))

fig = go.Figure(data=[trace1, trace2, trace3],
layout=layout)
py.iplot(fig)

指定特征下不同Outcome人群占比

In [45]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def plot_pie(var_select, sub) :
D = df[(df['Outcome'] != 0)] # 患者
H = df[(df['Outcome'] == 0)] # 健康

# 颜色设置
col =['Silver', 'mediumturquoise','#CF5C36','lightblue',
'magenta', '#FF5D73','#F2D7EE','mediumturquoise']

# 添加不同的饼图trace
trace1 = go.Pie(values = D[var_select].value_counts().values.tolist(),
labels = D[var_select].value_counts().keys().tolist(),
textfont=dict(size=15), # 字体大小
opacity = 0.8, # 透明度
hole = 0.5, # 中间圆环比例
hoverinfo = "label+percent+name", # 悬停信息
domain = dict(x = [.0,.48]),
name = "Diabetic",
marker = dict(colors = col,
line = dict(width = 1.5)))
trace2 = go.Pie(values = H[var_select].value_counts().values.tolist(),
labels = H[var_select].value_counts().keys().tolist(),
textfont=dict(size=15),
opacity = 0.8,
hole = 0.5,
hoverinfo = "label+percent+name",
marker = dict(line = dict(width = 1.5)),
domain = dict(x = [.52,1]),
name = "Healthy")

layout = go.Layout(dict(
title = var_select + " distribution by target <br>"+(sub),
annotations = [ dict(text = "Diabetic"+" : "+"268",
font = dict(size = 13),
showarrow = False,
x = .22,
y = -0.1),
dict(text = "Healthy"+" : "+"500",
font = dict(size = 13),
showarrow = False,
x = .8,
y = -.1)]))


fig = go.Figure(data = [trace1,trace2],layout = layout)
py.iplot(fig)

Glucose和Age关系

In [46]:

1
plot_feat1_feat2('Glucose','Age')

绘制二者关系的散点图,并标记密集区域:

In [47]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
palette ={0 : 'lightblue', 1 : 'gold'}
edgecolor = 'black'

fig = plt.figure(figsize=(12,8))

ax1 = sns.scatterplot(x = df['Glucose'],
y = df['Age'],
hue = "Outcome", # 分类信息
data = df,
palette = palette,
edgecolor=edgecolor)
# 添加标注信息
plt.annotate('N1',
size=25,
color='black',
xy=(80, 30),
xytext=(60, 35),
arrowprops=dict(facecolor='black',
shrink=0.05),)

# 标注的范围
plt.plot([50, 120], [30, 30], linewidth=2, color = 'red')
plt.plot([120, 120], [20, 30], linewidth=2, color = 'red')
plt.plot([50, 120], [20, 20], linewidth=2, color = 'red')
plt.plot([50, 50], [20, 30], linewidth=2, color = 'red')
plt.title('Glucose vs Age')
plt.show()

创建后一个新特征:N1

In [48]:

1
2
df.loc[:,'N1']=0
df.loc[(df['Age']<=30) & (df['Glucose']<=120),'N1']=1

In [49]:

1
barplot('N1', ':Glucose <= 120 and Age <= 30')

基于BMI新特征-N2

In [51]:

1
2
df.loc[:,'N2']=0
df.loc[(df['BMI']<=30),'N2']=1

In [52]:

1
barplot('N2', ': BMI <= 30')

不同Outcome下的占比:

In [53]:

1
plot_pie('N2', 'BMI <= 30')

Pregnancies 和 Age关系

In [54]:

1
plot_feat1_feat2('Pregnancies','Age')

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
palette ={0 : 'lightblue', 1 : 'gold'}
edgecolor = 'black'

fig = plt.figure(figsize=(12,8))

ax1 = sns.scatterplot(x = df['Pregnancies'],
y = df['Age'],
hue = "Outcome",
data = df,
palette = palette,
edgecolor=edgecolor)

plt.annotate('N3',
size=25,
color='black',
xy=(6, 25),
xytext=(10, 25),
arrowprops=dict(facecolor='black',
shrink=0.05),
)
plt.plot([0, 6], [30, 30], linewidth=2, color = 'red')
plt.plot([6, 6], [20, 30], linewidth=2, color = 'red')
plt.plot([0, 6], [20, 20], linewidth=2, color = 'red')
plt.plot([0, 0], [20, 30], linewidth=2, color = 'red')
plt.title('Pregnancies vs Age')
plt.show()

构造特征N3:

Glucose和BloodPressure关系

In [59]:

1
2
3
4
px.scatter(df,
x="Glucose",
y="BloodPressure",
color="Outcome")

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
palette ={0 : 'lightblue', 1 : 'gold'}
edgecolor = 'black'

fig = plt.figure(figsize=(12,8))

ax1 = sns.scatterplot(x = df['Glucose'],
y = df['BloodPressure'],
hue = "Outcome",
data = df,
palette = palette,
edgecolor=edgecolor)

plt.annotate('N4',
size=25,
color='black',
xy=(70, 80),
xytext=(50, 110),
arrowprops=dict(facecolor='black', shrink=0.05),
)

plt.plot([40, 105], [80, 80], linewidth=2, color = 'red')
plt.plot([40, 40], [20, 80], linewidth=2, color = 'red')
plt.plot([40, 105], [20, 20], linewidth=2, color = 'red')
plt.plot([105, 105], [20, 80], linewidth=2, color = 'red')
plt.title('Glucose vs BloodPressure')
plt.show()

构建新特征N4:

In [62]:

1
2
3
4
# 新特征

df.loc[:,'N4']=0
df.loc[(df['Glucose']<=105) & (df['BloodPressure']<=80),'N4']=1

基于SkinThickness构建特征

In [63]:

1
2
df.loc[:,'N5']=0
df.loc[(df['SkinThickness']<=20) ,'N5']=1

In [64]:

1
2
barplot('N5', ':SkinThickness <= 20')
plot_pie('N5', 'SkinThickness <= 20')

SkinThickness 和 BMI的关系

In [65]:

1
plot_feat1_feat2('SkinThickness','BMI')

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
df.loc[:,'N6']=0
df.loc[(df['BMI']<30) & (df['SkinThickness']<=20),'N6']=1

palette ={0 : 'lightblue', 1 : 'gold'}
edgecolor = 'black'

fig = plt.figure(figsize=(12,8))

ax1 = sns.scatterplot(x = df['SkinThickness'],
y = df['BMI'],
hue = "Outcome",
data = df,
palette = palette,
edgecolor=edgecolor)

plt.annotate('N6',
size=25,
color='black',
xy=(20, 20),
xytext=(50, 25),
arrowprops=dict(facecolor='black', shrink=0.05),
)
plt.plot([0, 20], [30, 30], linewidth=2, color = 'red')
plt.plot([0, 0], [16, 30], linewidth=2, color = 'red')
plt.plot([0, 20], [16, 16], linewidth=2, color = 'red')
plt.plot([20, 20], [16, 30], linewidth=2, color = 'red')
plt.title('SkinThickness vs BMI')
plt.show()

1
2
barplot('N6', ': BMI < 30 and SkinThickness <= 20')
plot_pie('N6', 'BMI < 30 and SkinThickness <= 20')

Glucose和BMI的关系

In [69]:

1
plot_feat1_feat2('Glucose','BMI')

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
palette ={0 : 'lightblue', 1 : 'gold'}
edgecolor = 'black'

fig = plt.figure(figsize=(12,8))

ax1 = sns.scatterplot(x = df['Glucose'],
y = df['BMI'],
hue = "Outcome",
data = df,
palette = palette,
edgecolor=edgecolor)

plt.annotate('N7',
size=25,
color='black',
xy=(70, 35),
xytext=(40, 60),
arrowprops=dict(facecolor='black', shrink=0.05),
)

plt.plot([105, 105], [16, 30], linewidth=2, color = 'red')
plt.plot([40, 40], [16, 30], linewidth=2, color = 'red')
plt.plot([40, 105], [16, 16], linewidth=2, color = 'red')
plt.plot([40, 105], [30, 30], linewidth=2, color = 'red')
plt.title('Glucose vs BMI')
plt.show()

1
2
3
4
5
df.loc[:,'N7']=0
df.loc[(df['Glucose']<=105) & (df['BMI']<=30),'N7']=1

# barplot('N7', ': Glucose <= 105 and BMI <= 30')
# plot_pie('N7', 'Glucose <= 105 and BMI <= 30')

基于Insulin构建特征

In [72]:

1
2
3
# 特征分布情况

plot_distribution('Insulin', 0)

1
2
3
4
5
6
df.loc[:,'N9']=0
df.loc[(df['Insulin']<200),'N9']=1

# barplot('N9', ': Insulin < 200')

# plot_pie('N9', 'Insulin < 200')

基于BloodPressure构建特征

In [74]:

1
2
df.loc[:,'N10']=0
df.loc[(df['BloodPressure']<80),'N10']=1

In [75]:

1
2
barplot('N10', ': BloodPressure < 80')
plot_pie('N10', 'BloodPressure < 80')

基于Pregnancies构建特征

In [76]:

1
plot_distribution('Pregnancies', 0)

1
2
3
4
5
df.loc[:,'N11']=0
df.loc[(df['Pregnancies']<4) & (df['Pregnancies']!=0) ,'N11']=1

barplot('N11', ': Pregnancies > 0 and < 4')
plot_pie('N11', 'Pregnancies > 0 and < 4')

其他特征衍生

In [78]:

1
2
3
4
5
df['N0'] = df['BMI'] * df['SkinThickness']
df['N8'] = df['Pregnancies'] / df['Age']
df['N13'] = df['Glucose'] / df['DiabetesPedigreeFunction']
df['N12'] = df['Age'] * df['DiabetesPedigreeFunction']
df['N14'] = df['Age'] / df['Insulin']

In [79]:

1
2
D = df[(df['Outcome'] != 0)]
H = df[(df['Outcome'] == 0)]

In [80]:

1
plot_distribution('N0', 0)

1
2
3
4
5
df.loc[:,'N15']=0
df.loc[(df['N0']<1034) ,'N15']=1

barplot('N15', ': N0 < 1034')
plot_pie('N15', 'N0 < 1034')

构造了多个新特征后的数据:

建模

In [83]:

1
df.nunique()  # 每个字段的唯一值情况

Out[83]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Pregnancies                  17
Glucose 135
BloodPressure 47
SkinThickness 50
Insulin 187
BMI 247
DiabetesPedigreeFunction 517
Age 52
Outcome 2
N1 2
N2 2
N3 2
N4 2
N5 2
N6 2
N7 2
N9 2
N10 2
N11 2
N0 637
N8 206
N13 765
N12 741
N14 435
N15 2
dtype: int64

In [84]:

1
df.nunique()[df.nunique() < 12]  # 字段的唯一值小于12

Out[84]:

1
2
3
4
5
6
7
8
9
10
11
12
13
Outcome    2
N1 2
N2 2
N3 2
N4 2
N5 2
N6 2
N7 2
N9 2
N10 2
N11 2
N15 2
dtype: int64

In [85]:

1
df.nunique()[df.nunique() < 12].keys().tolist()

Out[85]:

1
2
3
4
5
6
7
8
9
10
11
12
['Outcome',
'N1',
'N2',
'N3',
'N4',
'N5',
'N6',
'N7',
'N9',
'N10',
'N11',
'N15']

特征分类

根据每个特征的唯一值不同,将特征分为数值型特征和分类型特征

In [86]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
target_col = ["Outcome"]

# 分类型特征
cat_cols = df.nunique()[df.nunique() < 12].keys().tolist()
cat_cols = [x for x in cat_cols]

print(cat_cols)


# 数值型特征
num_cols = [x for x in df.columns if x not in cat_cols + target_col]

# 只包含两个分类的特征
bin_cols = df.nunique()[df.nunique() == 2].keys().tolist()

# 多分类特征
multi_cols = [i for i in cat_cols if i not in bin_cols]
['Outcome', 'N1', 'N2', 'N3', 'N4', 'N5', 'N6', 'N7', 'N9', 'N10', 'N11', 'N15']

编码

In [87]:

1
2
3
le = LabelEncoder()
for i in bin_cols:
df[i] = le.fit_transform(df[i])

In [88]:

1
2
df = pd.get_dummies(data=df,columns=multi_cols)
df.head()

数据标准化

In [89]:

1
2
3
4
std = StandardScaler()

scaled = std.fit_transform(df[num_cols])
scaled = pd.DataFrame(scaled, columns=num_cols)

In [90]:

1
2
3
4
5
6
7
8
df_copy = df.copy()
df = df.drop(columns=num_cols,axis=1)
df = df.merge(scaled,
left_index=True,
right_index=True,
how="left")

df.head()

Out[90]:

新数据的相关系数矩阵

In [91]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
corr = df.corr()
# print(corr)
matrix_cols = corr.columns.tolist()
corr_array = np.array(corr)


trace = go.Heatmap(x=matrix_cols,
y=matrix_cols,
z=corr_array,
colorscale="Viridis",
colorbar=dict())

layout = go.Layout(dict(title="New Correlation Matrix"),
margin=dict(r = 0 ,
l = 100,
t = 0,
b = 100),
yaxis=dict(tickfont=dict(size = 9)),
xaxis=dict(tickfont=dict(size = 9)),
)

fig = go.Figure(data=[trace], layout=layout)

py.iplot(fig)

切分特征和标签

In [92]:

1
2
X = df.drop('Outcome', 1)
y = df['Outcome']

模型评估

定义一个函数model_performance来评估不同分类模型的性能:

In [93]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
def model_performance(model, subtitle) :
# 交叉验证
cv = KFold(n_splits=5,
shuffle=True,
random_state = 42)
y_real = []
y_proba = []
tprs = []
aucs = []
mean_fpr = np.linspace(0,1,100)
i = 1

for train,test in cv.split(X,y): # 交叉验证
model.fit(X.iloc[train], y.iloc[train]) # 模型训练
pred_proba = model.predict_proba(X.iloc[test]) # 得分预测
# 准确率和召回率
precision, recall, _ = precision_recall_curve(y.iloc[test], pred_proba[:,1])
y_real.append(y.iloc[test])
y_proba.append(pred_proba[:,1])
fpr, tpr, t = roc_curve(y[test], pred_proba[:, 1])
tprs.append(interp(mean_fpr, fpr, tpr))
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)

# 混淆矩阵求解
y_pred = cross_val_predict(model, X, y, cv=5)
conf_matrix = confusion_matrix(y, y_pred)
# 混淆矩阵绘图
trace1 = go.Heatmap(z = conf_matrix,
x = ["0 (pred)","1 (pred)"],
y = ["0 (true)","1 (true)"],
xgap = 2,
ygap = 2,
colorscale = 'Viridis',
showscale = False)

#Show metrics
tp = conf_matrix[1,1]
fn = conf_matrix[1,0]
fp = conf_matrix[0,1]
tn = conf_matrix[0,0]
Accuracy = ((tp+tn)/(tp+tn+fp+fn))
Precision = (tp/(tp+fp))
Recall = (tp/(tp+fn))
F1_score = (2*(((tp/(tp+fp))*(tp/(tp+fn)))/((tp/(tp+fp))+(tp/(tp+fn)))))


show_metrics = pd.DataFrame(data=[[Accuracy ,
Precision,
Recall,
F1_score]])
show_metrics = show_metrics.T

colors = ['gold', 'lightgreen',
'lightcoral', 'lightskyblue']

trace2 = go.Bar(x = (show_metrics[0].values),
y = ['Accuracy', 'Precision', 'Recall', 'F1_score'],
text = np.round_(show_metrics[0].values,4),
textposition = 'auto',
textfont=dict(color='black'),
orientation = 'h',
opacity = 1,
marker=dict(
color=colors,
line=dict(color='#000000',width=1.5)))

#ROC曲线
mean_tpr = np.mean(tprs, axis=0)
mean_auc = auc(mean_fpr, mean_tpr)

trace3 = go.Scatter(x=mean_fpr,
y=mean_tpr,
name = "Roc : " ,
line = dict(color = ('rgb(22, 96, 167)'),
width = 2),
fill='tozeroy')
trace4 = go.Scatter(x = [0,1],y = [0,1],
line = dict(color = ('black'),
width = 1.5,
dash = 'dot'))

#Precision - recall curve
y_real = y
y_proba = np.concatenate(y_proba)
precision, recall, _ = precision_recall_curve(y_real, y_proba)

trace5 = go.Scatter(x = recall,
y = precision,
name = "Precision" + str(precision),
line = dict(color = ('lightcoral'),
width = 2),
fill='tozeroy')

mean_auc=round(mean_auc,3)

# 如何绘制子图
fig = tls.make_subplots(rows=2,
cols=2,
print_grid=False,
specs=[[{}, {}],
[{}, {}]],
subplot_titles=('Confusion Matrix',
'Metrics',
'ROC curve'+" "+ '('+ str(mean_auc)+')',
'Precision - Recall curve',
))
# 添加不同的trace
fig.append_trace(trace1,1,1)
fig.append_trace(trace2,1,2)
fig.append_trace(trace3,2,1)
fig.append_trace(trace4,2,1)
fig.append_trace(trace5,2,2)

fig['layout'].update(showlegend = False,
title = '<b>Model performance report (5 folds)</b><br>'+subtitle,
autosize = False,
height = 830,
width = 830,
plot_bgcolor = 'black',
paper_bgcolor = 'black',
margin = dict(b = 195),
font=dict(color='white'))
fig["layout"]["xaxis1"].update(color = 'white')
fig["layout"]["yaxis1"].update(color = 'white')
fig["layout"]["xaxis2"].update((dict(range=[0, 1],
color = 'white')))
fig["layout"]["yaxis2"].update(color = 'white')
fig["layout"]["xaxis3"].update(dict(title = "false positive rate"),
color = 'white')
fig["layout"]["yaxis3"].update(dict(title = "true positive rate"),
color = 'white')
fig["layout"]["xaxis4"].update(dict(title = "recall"), range = [0,1.05],
color = 'white')
fig["layout"]["yaxis4"].update(dict(title = "precision"), range = [0,1.05],
color = 'white')
for i in fig['layout']['annotations']:
i['font'] = titlefont=dict(color='white', size = 14)
py.iplot(fig)

不同模型不同指标的得分显示:

In [94]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
def scores_table(model, subtitle):
scores = ['accuracy', 'precision',
'recall', 'f1', 'roc_auc']
res = []
for sc in scores:
scores = cross_val_score(model,
X,
y,
cv = 5,
scoring = sc)
res.append(scores)

df = pd.DataFrame(res).T
df.loc['mean'] = df.mean()
df.loc['std'] = df.std()
df= df.rename(columns={0: 'accuracy',
1:'precision',
2:'recall',
3:'f1',
4:'roc_auc'}
)

trace = go.Table(
# 表头设置
header=dict(values=['<b>Fold', '<b>Accuracy',
'<b>Precision', '<b>Recall',
'<b>F1 score', '<b>Roc auc'],

line = dict(color='#7D7F80'),
fill = dict(color='#a1c3d1'),
align = ['center'],
font = dict(size = 15)),

cells=dict(values=[('1','2','3','4',
'5','mean', 'std'),
np.round(df['accuracy'],3),
np.round(df['precision'],3),
np.round(df['recall'],3),
np.round(df['f1'],3),
np.round(df['roc_auc'],3)],
line = dict(color='#7D7F80'),
fill = dict(color='#EDFAFF'),
align = ['center'],
font = dict(size = 15)))

layout = dict(width=800,
height=400,
title = '<b>Cross Validation - 5 folds</b><br>'+subtitle,
font = dict(size = 15))
fig = dict(data=[trace], layout=layout)

py.iplot(fig, filename = 'styled_table')

构建LGBM模型

In [95]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
random_state=42

fit_params = {"early_stopping_rounds" : 100,
"eval_metric" : 'auc',
"eval_set" : [(X,y)],
'eval_names': ['valid'],
'verbose': 0,
'categorical_feature': 'auto'}

param_test = {'learning_rate' : [0.01, 0.02, 0.03, 0.04, 0.05,
0.08, 0.1, 0.2, 0.3, 0.4],
'n_estimators' : [100, 200, 300, 400, 500,
600, 800, 1000, 1500, 2000],
'num_leaves': sp_randint(6, 50),
'min_child_samples': sp_randint(100, 500),
'min_child_weight': [1e-5, 1e-3, 1e-2, 1e-1,
1, 1e1, 1e2, 1e3, 1e4],
'subsample': sp_uniform(loc=0.2, scale=0.8),
'max_depth': [-1, 1, 2, 3, 4, 5, 6, 7],
'colsample_bytree': sp_uniform(loc=0.4, scale=0.6),
'reg_alpha': [0, 1e-1, 1, 2, 5, 7, 10, 50, 100],
'reg_lambda': [0, 1e-1, 1, 5, 10, 20, 50, 100]}

# 迭代次数
n_iter = 300

# 初始模型
lgbm_clf = lgbm.LGBMClassifier(random_state=random_state,
silent=True,
metric='None',
n_jobs=4)

# 网格搜索设置
grid_search = RandomizedSearchCV(
estimator=lgbm_clf,
param_distributions=param_test,
n_iter=n_iter,
scoring='accuracy',
cv=5,
refit=True,
random_state=random_state,
verbose=True)

grid_search.fit(X, y, **fit_params)
# 搜索后的最佳参数组合
opt_parameters = grid_search.best_params_
# 使用最佳参数重新建模
lgbm_clf = lgbm.LGBMClassifier(**opt_parameters)
Fitting 5 folds for each of 300 candidates, totalling 1500 fits

In [96]:

1
2
model_performance(lgbm_clf, 'LightGBM')
scores_table(lgbm_clf, 'LightGBM')

构建KNN分类模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
knn_clf = KNeighborsClassifier()

# 投票表决方案
voting_clf = VotingClassifier(estimators=[
('lgbm_clf', lgbm_clf),
('knn', KNeighborsClassifier())],
voting='soft',
weights = [1,1])

params = {
'knn__n_neighbors': np.arange(1,30)
}

grid = GridSearchCV(estimator=voting_clf, param_grid=params, cv=5)

grid.fit(X,y)

print("Best Score:" + str(grid.best_score_))
print("Best Parameters: " + str(grid.best_params_))

# 结果
Best Score:0.8919701213818861
Best Parameters: {'knn__n_neighbors': 9}
1
2
3
4
visualizer = DiscriminationThreshold(voting_clf)

visualizer.fit(X, y)
visualizer.poof()

数据获取

关于本文的数据集获取方式,关注【尤而小屋】,回复糖尿病即可领取

本文标题:基于机器学习的糖尿病预测

发布时间:2022年12月24日 - 23:12

原始链接:http://www.renpeter.cn/2022/12/24/%E5%9F%BA%E4%BA%8E%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E7%9A%84%E7%B3%96%E5%B0%BF%E7%97%85%E9%A2%84%E6%B5%8B.html

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

Coffee or Tea