Fork me on GitHub

kaggle实战-基于机器学习的中风病人预测

基于随机森林、逻辑回归、SVM的中风病人预测

本文是kaggle上一个关于中风案例的数据集建模分析,文章的主要内容:

  1. 不同字段的分布情况
  2. 基于随机森林的缺失值填充
  3. 中风和未中风的样本不均衡如何解决?
  4. 不同字段的编码工作
  5. 交叉验证
  6. 网格搜索调参

原数据地址:https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset?datasetId=1120859&sortBy=voteCount&select=healthcare-dataset-stroke-data.csv

导入库

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import numpy as np
import pandas as pd

# 绘图
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.gridspec as grid_spec
import seaborn as sns
plt.style.use("fivethirtyeight")

import plotly.express as px
import plotly.graph_objs as go

# 采样
from imblearn.over_sampling import SMOTE

# 数据标准化、分割、交叉验证
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,LabelEncoder
from sklearn.model_selection import train_test_split,cross_val_score

# 各种模型
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.tree import DecisionTreeRegressor,DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# 模型评价
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, recall_score, roc_auc_score, precision_score, f1_score
import warnings
warnings.filterwarnings('ignore')

数据基本信息

先把数据导进来,查看数据的基本信息

image-20220625103148872

下面我们查看数据基本信息

In [3]:

1
df.shape

Out[3]:

1
(5110, 12)

In [4]:

1
df.dtypes

Out[4]:

1
2
3
4
5
6
7
8
9
10
11
12
13
id                     int64
gender object
age float64
hypertension int64
heart_disease int64
ever_married object # 字符型
work_type object
Residence_type object
avg_glucose_level float64
bmi float64
smoking_status object
stroke int64
dtype: object

In [5]:

1
df.describe()  # 描述统计信息

Out[5]:

字段分布

gender统计

In [6]:

1
2
3
4
plt.figure(1, figsize=(12,5))

sns.countplot(y="gender", data=df)
plt.show()

age分布

In [7]:

1
px.violin(y=df["age"])

1
2
3
4
5
fig = px.histogram(df,
x="age",
color_discrete_sequence=['firebrick'])

fig.show()

ever_married

In [9]:

1
2
3
4
5
plt.figure(1, figsize=(12,5))

sns.countplot(y="ever_married", data=df)

plt.show()

本数据集中的结婚人士大约是未结婚的两倍。

work-type

查看不同工作状态的人员数量

In [10]:

1
2
3
4
5
plt.figure(1, figsize=(12,8))

sns.countplot(y="work_type", data=df)

plt.show()

Residence_type

In [11]:

1
2
3
4
5
plt.figure(1, figsize=(12,8))

sns.countplot(y="Residence_type", data=df)

plt.show()

avg_glucose_level

血糖水平的分布

In [12]:

1
2
3
4
5
fig = px.histogram(df,
x="avg_glucose_level",
color_discrete_sequence=['firebrick'])

fig.show()

可以看到大部分人的血糖还是在100以下,说明是正常的

bmi

bmi指标的分布情况

In [13]:

1
2
3
4
5
fig = px.histogram(df,
x="bmi",
color_discrete_sequence=['firebrick'])

fig.show()

bmi指标的均值大约在28左右,呈现一定的正态分布

smoking_status

抽烟情况的统计

In [14]:

1
2
3
4
5
plt.figure(1, figsize=(12,8))

sns.countplot(y="smoking_status", data=df)

plt.show()

可以看到抽烟或者曾经抽烟的人相对来说是少一些的

缺失值情况

缺失值统计

In [15]:

1
df.isnull().sum()

Out[15]:

1
2
3
4
5
6
7
8
9
10
11
12
13
id                     0
gender 0
age 0
hypertension 0
heart_disease 0
ever_married 0
work_type 0
Residence_type 0
avg_glucose_level 0
bmi 201
smoking_status 0
stroke 0
dtype: int64

In [16]:

1
201 / len(df)  # 缺失比例

Out[16]:

1
0.03933463796477495

缺失值可视化

In [17]:

1
2
3
4
5
6
7
8
9
plt.title('Missing Value Status',fontweight='bold')
ax = sns.heatmap(df.isna().sum().to_frame(),
annot=True,
fmt='d',
cmap='vlag')

ax.set_xlabel('Amount Missing')

plt.show()

1
2
3
import missingno as mso
mso.bar(df,color="blue")
plt.show()

缺失值处理

使用决策树回归来预测缺失值的BMI值:通过年龄、性别和现有的bmi值来进行预测填充

In [19]:

1
2
3
dt_bmi = Pipeline(steps=[("scale",StandardScaler()), # 数据标准化
("lr",DecisionTreeRegressor(random_state=42))
])

In [20]:

1
2
3
4
5
6
X = df[["age","gender","bmi"]].copy()

dic = {"Male":0, "Female":1, "Other":-1}

X["gender"] = X["gender"].map(dic).astype(np.uint8)
X.head()

取出非缺失值的部分进行训练:

1
2
3
4
5
6
# 缺失值部分
missing = X[X.bmi.isna()]

# 非缺失值部分
X = X[~X.bmi.isna()]
y = X.pop("bmi")
1
2
3
# 模型训练

dt_bmi.fit(X,y)

Out[22]:

1
2
Pipeline(steps=[('scale', StandardScaler()),
('lr', DecisionTreeRegressor(random_state=42))])

In [23]:

1
2
3
4
# 模型预测

y_pred = dt_bmi.predict(missing[["age","gender"]])
y_pred[:5]

Out[23]:

1
array([29.87948718, 30.55609756, 27.24722222, 30.84186047, 33.14666667])

将预测的值转成Series,并且注意索引号:

1
2
predict_bmi = pd.Series(y_pred, index=missing.index)
predict_bmi

Out[24]:

1
2
3
4
5
6
7
8
9
10
11
12
1       29.879487
8 30.556098
13 27.247222
19 30.841860
27 33.146667
...
5039 32.716000
5048 28.313636
5093 31.459322
5099 28.313636
5105 28.476923
Length: 201, dtype: float64

填充到原来的df数据中:

In [25]:

1
df.loc[missing.index, "bmi"] = predict_bmi

进行上面的预测和填充之后,我们再次查看缺失值情况,发现已经没有任何缺失值:

In [26]:

1
df.isnull().sum()

Out[26]:

1
2
3
4
5
6
7
8
9
10
11
12
13
id                   0
gender 0
age 0
hypertension 0
heart_disease 0
ever_married 0
work_type 0
Residence_type 0
avg_glucose_level 0
bmi 0
smoking_status 0
stroke 0
dtype: int64

数据EDA

In [27]:

1
2
3
4
variables = [variable for variable in df.columns if variable not in ['id','stroke']]

# 除去id号和是否中风外的全部字段
variables

Out[27]:

1
2
3
4
5
6
7
8
9
10
['gender',
'age',
'hypertension',
'heart_disease',
'ever_married',
'work_type',
'Residence_type',
'avg_glucose_level',
'bmi',
'smoking_status']

连续型变量

In [28]:

1
2
3
4
5
6
7
conts = ['age','avg_glucose_level','bmi']

for cont in conts:
plt.figure(1, figsize=(15,6))
sns.distplot(df[cont])

plt.show()

几点结论:

  • 年龄age:整体分布比较均衡,不同年龄段的人数差异小
  • 血糖水平:主要集中在100以下
  • bmi指标:呈现一定的正态分布

中风和未中风

上面我们查看了连续型变量的分布情况;可以看到bmi呈现明显的左偏态的分布。下面我们对比中风和未中风的情况:

1
2
3
4
5
6
7
8
9
conts = ['age','avg_glucose_level','bmi']

for cont in conts:
plt.figure(1, figsize=(15,12))
sns.displot(data=df,
x=cont,
hue="stroke",
kind="kde")
plt.show()

从3个密度图中能够观察到:从上面的密度图中可以看出来:对于是否中风,年龄age是一个最主要的因素

对比不同年龄段的血糖和BMI指数

In [30]:

1
2
3
4
5
px.scatter(df,x="age",
y="avg_glucose_level",
color="stroke",
trendline='ols'
)

年龄和血糖、bmi关系

1
2
3
4
5
px.scatter(df,x="age",
y="bmi",
color="stroke",
trendline='ols'
)

年龄和患病几率

从散点分布图中看到:年龄可能真的是一个比较重要的因素,和BMI以及平均的血糖水平有着一定的关系。

可能随着年龄的增长,风险在增加。果真如此吗?

In [32]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
background_color = "#fafafa"

fig = plt.figure(figsize=(12, 6),
dpi=160,
facecolor=background_color)

gs = fig.add_gridspec(2, 1)
gs.update(wspace=0.11, hspace=0.5)

ax0 = fig.add_subplot(gs[0, 0])
ax0.set_facecolor(background_color)

# 字段类型转化
df['age'] = df['age'].astype(int)

rate = []
for i in range(df['age'].min(), df['age'].max()):
rate.append(df[df['age'] < i]['stroke'].sum() / len(df[df['age'] < i]['stroke'])) # sum求和就是中风人数 / 总人数

sns.lineplot(data=rate,color='#0f4c81',ax=ax0)

for s in ["top","right","left"]:
ax0.spines[s].set_visible(False)

ax0.tick_params(axis='both',
which='major',
labelsize=8)

ax0.tick_params(axis=u'both',
which=u'both',
length=0)

ax0.text(-3,
0.055,
'Risk Increase by Age',
fontsize=18,
fontfamily='serif',
fontweight='bold')

ax0.text(-3,0.047,
'As age increase, so too does risk of having a stroke',
fontsize=14,
fontfamily='serif')


plt.show()

上面的图形说明了两点:

  1. 年龄越大,中风的几率的确越来越高
  2. 中风的几率是非常低的(y轴的值很低),这是由于中风和未中风的样本不均衡造成的

原数据5000个样本中只有249个中风样本,比例接近1:20

样本不均衡

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from pywaffle import Waffle

fig = plt.figure(
figsize=(7, 2),
dpi=150,
facecolor=background_color,
FigureClass=Waffle,
rows=1,
values=[1, 19],
colors=['#0f4c81', "lightgray"],
characters='⬤',
font_size=16,
vertical=True,
)

# 主标题
fig.text(0.035,0.78,
'Stroked People in our dataset',
fontfamily='serif',
fontsize=10,
fontweight='bold')
# 子标题
fig.text(0.035,
0.65,
'1:20 [249 out of 5000]',
fontfamily='serif',
fontsize=10)

plt.show()

属性分布

整体变量情况

首先我们剔除gender中为Other的情况

In [34]:

1
2
str_only = df[df['stroke'] == 1]   # 中风
no_str_only = df[df['stroke'] == 0] # 未中风

In [35]:

1
len(str_only)

Out[35]:

1
249

In [36]:

1
2
# 剔除other
no_str_only = no_str_only[(no_str_only['gender'] != 'Other')]

下面的代码是比较在不同的属性下中风和未中风的情况:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
fig = plt.figure(figsize=(22,15))
gs = fig.add_gridspec(3, 3)
gs.update(wspace=0.35, hspace=0.27)

# 生成9个子图
ax0 = fig.add_subplot(gs[0, 0])
ax1 = fig.add_subplot(gs[0, 1])
ax2 = fig.add_subplot(gs[0, 2])
ax3 = fig.add_subplot(gs[1, 0])
ax4 = fig.add_subplot(gs[1, 1])
ax5 = fig.add_subplot(gs[1, 2])
ax6 = fig.add_subplot(gs[2, 0])
ax7 = fig.add_subplot(gs[2, 1])
ax8 = fig.add_subplot(gs[2, 2])

# 背景色
background_color = "#f6f6f6"
fig.patch.set_facecolor(background_color)

## 1、Age

ax0.grid(color='gray',
linestyle=':',
axis='y',
zorder=0, dashes=(1,5))

# 中风和未中风
positive = pd.DataFrame(str_only["age"])
negative = pd.DataFrame(no_str_only["age"])
# kde密度图
sns.kdeplot(positive["age"], # 中风数据
ax=ax0, # 指定子图
color="#0f4c81", # 颜色
shade=True, # 阴影
ec='black', # 边缘色
label="positive" # label
)

sns.kdeplot(negative["age"], # 未中风
ax=ax0,
color="#9bb7d4",
shade=True,
ec='black',
label="negative")

ax0.yaxis.set_major_locator(mtick.MultipleLocator(2))
ax0.set_ylabel('')
ax0.set_xlabel('')
ax0.text(-20, # 文本信息设置
0.0465,
'Age',
fontsize=14,
fontweight='bold',
fontfamily='serif',
color="#323232")


# 2、Smoking
# 不同状态的人数
positive = pd.DataFrame(str_only["smoking_status"].value_counts())
# 比例情况
positive["Percentage"] = positive["smoking_status"].apply(lambda x: x/sum(positive["smoking_status"])*100)

negative = pd.DataFrame(no_str_only["smoking_status"].value_counts())
negative["Percentage"] = negative["smoking_status"].apply(lambda x: x/sum(negative["smoking_status"])*100)

ax1.text(0, 4,
'Smoking Status',
fontsize=14,
fontweight='bold',
fontfamily='serif',
color="#323232")
ax1.barh(positive.index,
positive['Percentage'],
color="#0f4c81",
zorder=3,
height=0.7)
ax1.barh(negative.index,
negative['Percentage'],
color="#9bb7d4",
zorder=3,
ec='black',
height=0.3)
ax1.xaxis.set_major_formatter(mtick.PercentFormatter())
ax1.xaxis.set_major_locator(mtick.MultipleLocator(10))

# gender
# 1、统计人数
positive = pd.DataFrame(str_only["gender"].value_counts())
# 2、转成比例
positive["Percentage"] = positive["gender"].apply(lambda x: x/sum(positive["gender"])*100)
negative = pd.DataFrame(no_str_only["gender"].value_counts())
negative["Percentage"] = negative["gender"].apply(lambda x: x/sum(negative["gender"])*100)

x = np.arange(len(positive))
ax2.text(-0.4, 68.5,
'Gender',
fontsize=14,
fontweight='bold',
fontfamily='serif',
color="#323232")

ax2.grid(color='gray',
linestyle=':',
axis='y',
zorder=0,
dashes=(1,5))
ax2.bar(x,
height=positive["Percentage"],
zorder=3,
color="#0f4c81",
width=0.4)

ax2.bar(x+0.4,
height=negative["Percentage"],
zorder=3,
color="#9bb7d4",
width=0.4)
ax2.set_xticks(x + 0.4 / 2)
ax2.set_xticklabels(['Male','Female'])
ax2.yaxis.set_major_formatter(mtick.PercentFormatter())
ax2.yaxis.set_major_locator(mtick.MultipleLocator(10))

for i,j in zip([0, 1], positive["Percentage"]):
ax2.annotate(f'{j:0.0f}%',
xy=(i, j/2),
color='#f6f6f6',
horizontalalignment='center',
verticalalignment='center')

for i,j in zip([0, 1], negative["Percentage"]):
ax2.annotate(f'{j:0.0f}%',
xy=(i+0.4, j/2),
color='#f6f6f6',
horizontalalignment='center',
verticalalignment='center')


# heart_disease

positive = pd.DataFrame(str_only["heart_disease"].value_counts())
positive["Percentage"] = positive["heart_disease"].apply(lambda x: x/sum(positive["heart_disease"])*100)
negative = pd.DataFrame(no_str_only["heart_disease"].value_counts())
negative["Percentage"] = negative["heart_disease"].apply(lambda x: x/sum(negative["heart_disease"])*100)

x = np.arange(len(positive))
ax3.text(-0.3, 110,
'Heart Disease',
fontsize=14,
fontweight='bold',
fontfamily='serif',
color="#323232")
ax3.grid(color='gray',
linestyle=':',
axis='y',
zorder=0,
dashes=(1,5))

ax3.bar(x,
height=positive["Percentage"],
zorder=3,
color="#0f4c81",
width=0.4)

ax3.bar(x+0.4,
height=negative["Percentage"],
zorder=3,
color="#9bb7d4",
width=0.4)

ax3.set_xticks(x + 0.4 / 2)
ax3.set_xticklabels(['No History','History'])
ax3.yaxis.set_major_formatter(mtick.PercentFormatter())
ax3.yaxis.set_major_locator(mtick.MultipleLocator(20))

for i,j in zip([0, 1], positive["Percentage"]):
ax3.annotate(f'{j:0.0f}%',
xy=(i, j/2),
color='#f6f6f6',
horizontalalignment='center',
verticalalignment='center')
for i,j in zip([0, 1], negative["Percentage"]):
ax3.annotate(f'{j:0.0f}%',
xy=(i+0.4, j/2),
color='#f6f6f6',
horizontalalignment='center',
verticalalignment='center')


# ## AX4 - TITLE

ax4.spines["bottom"].set_visible(False)
ax4.tick_params(left=False, bottom=False)
ax4.set_xticklabels([])
ax4.set_yticklabels([])
ax4.text(0.5, 0.6, 'Can we see patterns for\n\n patients in our data?', horizontalalignment='center', verticalalignment='center',
fontsize=22, fontweight='bold', fontfamily='serif', color="#323232")

ax4.text(0.15,0.57,"Stroke", fontweight="bold", fontfamily='serif', fontsize=22, color='#0f4c81')
ax4.text(0.41,0.57,"&", fontweight="bold", fontfamily='serif', fontsize=22, color='#323232')
ax4.text(0.49,0.57,"No-Stroke", fontweight="bold", fontfamily='serif', fontsize=22, color='#9bb7d4')


# Glucose

ax5.grid(color='gray',
linestyle=':',
axis='y',
zorder=0,
dashes=(1,5))
positive = pd.DataFrame(str_only["avg_glucose_level"])
negative = pd.DataFrame(no_str_only["avg_glucose_level"])
sns.kdeplot(positive["avg_glucose_level"],
ax=ax5,
color="#0f4c81",
ec='black',
shade=True,
label="positive")

sns.kdeplot(negative["avg_glucose_level"],
ax=ax5,
color="#9bb7d4",
ec='black',
shade=True,
label="negative")

ax5.text(-55, 0.01855,
'Avg. Glucose Level',
fontsize=14,
fontweight='bold',
fontfamily='serif',
color="#323232")
ax5.yaxis.set_major_locator(mtick.MultipleLocator(2))
ax5.set_ylabel('')
ax5.set_xlabel('')



# bmi

ax6.grid(color='gray',
linestyle=':',
axis='y',
zorder=0,
dashes=(1,5))

positive = pd.DataFrame(str_only["bmi"])
negative = pd.DataFrame(no_str_only["bmi"])
sns.kdeplot(positive["bmi"],
ax=ax6,
color="#0f4c81",
ec='black',
shade=True,
label="positive")

sns.kdeplot(negative["bmi"],
ax=ax6,
color="#9bb7d4",
ec='black',
shade=True,
label="negative")
ax6.text(-0.06,
0.09,
'BMI',
fontsize=14,
fontweight='bold',
fontfamily='serif',
color="#323232")
ax6.yaxis.set_major_locator(mtick.MultipleLocator(2))
ax6.set_ylabel('')
ax6.set_xlabel('')


# Work Type

positive = pd.DataFrame(str_only["work_type"].value_counts())
positive["Percentage"] = positive["work_type"].apply(lambda x: x/sum(positive["work_type"])*100)
positive = positive.sort_index()

negative = pd.DataFrame(no_str_only["work_type"].value_counts())
negative["Percentage"] = negative["work_type"].apply(lambda x: x/sum(negative["work_type"])*100)
negative = negative.sort_index()

ax7.bar(negative.index, height=negative["Percentage"], zorder=3, color="#9bb7d4", width=0.05)
ax7.scatter(negative.index, negative["Percentage"], zorder=3,s=200, color="#9bb7d4")
ax7.bar(np.arange(len(positive.index))+0.4, height=positive["Percentage"], zorder=3, color="#0f4c81", width=0.05)
ax7.scatter(np.arange(len(positive.index))+0.4, positive["Percentage"], zorder=3,s=200, color="#0f4c81")

ax7.yaxis.set_major_formatter(mtick.PercentFormatter())
ax7.yaxis.set_major_locator(mtick.MultipleLocator(10))
ax7.set_xticks(np.arange(len(positive.index))+0.4 / 2)
ax7.set_xticklabels(list(positive.index),rotation=0)
ax7.text(-0.5, 66, 'Work Type', fontsize=14, fontweight='bold', fontfamily='serif', color="#323232")

# hypertension

positive = pd.DataFrame(str_only["hypertension"].value_counts())
positive["Percentage"] = positive["hypertension"].apply(lambda x: x/sum(positive["hypertension"])*100)
negative = pd.DataFrame(no_str_only["hypertension"].value_counts())
negative["Percentage"] = negative["hypertension"].apply(lambda x: x/sum(negative["hypertension"])*100)

x = np.arange(len(positive))
ax8.text(-0.45, 100, 'Hypertension', fontsize=14, fontweight='bold', fontfamily='serif', color="#323232")
ax8.grid(color='gray', linestyle=':', axis='y', zorder=0, dashes=(1,5))
ax8.bar(x, height=positive["Percentage"], zorder=3, color="#0f4c81", width=0.4)
ax8.bar(x+0.4, height=negative["Percentage"], zorder=3, color="#9bb7d4", width=0.4)
ax8.set_xticks(x + 0.4 / 2)
ax8.set_xticklabels(['No History','History'])
ax8.yaxis.set_major_formatter(mtick.PercentFormatter())
ax8.yaxis.set_major_locator(mtick.MultipleLocator(20))
for i,j in zip([0, 1], positive["Percentage"]):
ax8.annotate(f'{j:0.0f}%',xy=(i, j/2), color='#f6f6f6', horizontalalignment='center', verticalalignment='center')
for i,j in zip([0, 1], negative["Percentage"]):
ax8.annotate(f'{j:0.0f}%',xy=(i+0.4, j/2), color='#f6f6f6', horizontalalignment='center', verticalalignment='center')


# tidy up

for s in ["top","right","left"]:
for i in range(0,9):
locals()["ax"+str(i)].spines[s].set_visible(False)

for i in range(0,9):
locals()["ax"+str(i)].set_facecolor(background_color)
locals()["ax"+str(i)].tick_params(axis=u'both', which=u'both',length=0)
locals()["ax"+str(i)].set_facecolor(background_color)


plt.show()

建模

模型baseline

In [38]:

1
len(str_only)

Out[38]:

1
249

In [39]:

1
249 / len(df)

Out[39]:

1
0.0487279843444227

说明总共有249个人是中风的。本数据的总人数是len(df),根据下面的表达式能够得到本次模型的baseline。

也就说,对于阳性中风患者的召回率,一个好的目标是4.8%。

字段编码

对4个字符型的字段进行编码工作:

In [40]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
df['gender'] = df['gender'].replace({'Male':0,
'Female':1,
'Other':-1}
).astype(np.uint8)

df['Residence_type'] = df['Residence_type'].map({'Rural':0,
'Urban':1}
).astype(np.uint8)

df['work_type'] = df['work_type'].map({'Private':0,
'Self-employed':1,
'Govt_job':2,
'children':-1,
'Never_worked':-2}
).astype(np.uint8)

df['ever_married'] = df['ever_married'].map({'No':0,'Yes':1}).astype(np.uint8)

df.head()

抽烟状态的独热码转换:

In [41]:

1
df["smoking_status"].value_counts()

Out[41]:

1
2
3
4
5
never smoked       1892
Unknown 1544
formerly smoked 885
smokes 789
Name: smoking_status, dtype: int64

In [42]:

1
2
df = df.join(pd.get_dummies(df["smoking_status"]))
df.drop("smoking_status",axis=1,inplace=True)

数据分割

In [43]:

1
2
3
4
5
6
7
8
# 选取特征
X = df.drop("stroke",axis=1)
# 目标变量
y = df['stroke']
from sklearn.model_selection import train_test_split

# 3-7比例
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.3, random_state=42)

上采样

前文中提到,本案例中风和未中风的数据比例接近1:20,在这里我们采样基于SMOTE的上采样方法

In [44]:

1
2
oversample = SMOTE()
X_train_smote, y_train_smote = oversample.fit_resample(X_train, y_train.ravel())

In [45]:

1
len(y_train_smote)

Out[45]:

1
2914

In [46]:

1
len(X_train_smote)

Out[46]:

1
2914

建模

采用3种不同的分类模型来建立模型:Random Forest, SVM, Logisitc Regression

In [47]:

1
2
3
4
5
6
7
rf_pipeline = Pipeline(steps = [('scale',StandardScaler()), # 标准化
('RF',RandomForestClassifier(random_state=42))] # 模型
)
svm_pipeline = Pipeline(steps = [('scale',StandardScaler()),
('SVM',SVC(random_state=42))])
logreg_pipeline = Pipeline(steps = [('scale',StandardScaler()),
('LR',LogisticRegression(random_state=42))])

10折交叉验证

In [48]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
rf_cv = cross_val_score(rf_pipeline,
X_train_smote,
y_train_smote,
cv=10,
scoring='f1' # 模型得分评价指标
)

svm_cv = cross_val_score(svm_pipeline,
X_train_smote,
y_train_smote,
cv=10,
scoring='f1'
)

logreg_cv = cross_val_score(logreg_pipeline,
X_train_smote,
y_train_smote,
cv=10,
scoring='f1'
)

3种模型得分对比

In [49]:

1
2
3
4
5
6
print('随机森林:', rf_cv.mean())
print('支持向量机:',svm_cv.mean())
print('逻辑回归:', logreg_cv.mean())
随机森林: 0.9628909366701726
支持向量机: 0.9363667907023254
逻辑回归: 0.8859930523017683

很明显:随机森林表现的最好!

模型训练fit

In [50]:

1
2
3
4
5
rf_pipeline.fit(X_train_smote,y_train_smote)

svm_pipeline.fit(X_train_smote,y_train_smote)

logreg_pipeline.fit(X_train_smote,y_train_smote)

Out[50]:

1
2
Pipeline(steps=[('scale', StandardScaler()),
('LR', LogisticRegression(random_state=42))])

In [51]:

1
2
3
4
5
# 3种模型预测

rf_pred =rf_pipeline.predict(X_test)
svm_pred = svm_pipeline.predict(X_test)
logreg_pred = logreg_pipeline.predict(X_test)

评价指标

In [52]:

1
2
3
4
5
# 1、混淆矩阵

rf_cm = confusion_matrix(y_test, rf_pred )
svm_cm = confusion_matrix(y_test, svm_pred)
logreg_cm = confusion_matrix(y_test, logreg_pred)

In [53]:

1
2
3
4
5
6
7
8
9
10
11
12
13
print(rf_cm)
print("----")
print(svm_cm)
print("----")
print(logreg_cm)
[[3338 66]
[ 164 9]]
----
[[3196 208]
[ 148 25]]
----
[[3138 266]
[ 116 57]]

In [54]:

1
2
3
4
5
6
# 2、F_1得分
# F1分数可以看作是模型准确率和召回率的一种加权平均,它的最大值是1,最小值是0,值越大意味着模型越好

rf_f1 = f1_score(y_test,rf_pred)
svm_f1 = f1_score(y_test,svm_pred)
logreg_f1 = f1_score(y_test,logreg_pred)

In [55]:

1
2
3
4
5
6
print('RF mean :',rf_f1)
print('SVM mean :',svm_f1)
print('LR mean :',logreg_f1)
RF mean : 0.07258064516129033
SVM mean : 0.1231527093596059
LR mean : 0.22983870967741934

随机森林模型的分类报告:

In [56]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from sklearn.metrics import plot_confusion_matrix, classification_report

print(classification_report(y_test,rf_pred))

print('Accuracy Score: ',accuracy_score(y_test,rf_pred))
precision recall f1-score support

0 0.95 0.98 0.97 3404
1 0.12 0.05 0.07 173

accuracy 0.94 3577
macro avg 0.54 0.52 0.52 3577
weighted avg 0.91 0.94 0.92 3577

Accuracy Score: 0.9357003075202683

随机森林模型调参

基于网格搜索的参数调优:

In [57]:

1
2
3
4
5
6
7
8
9
10
11
from sklearn.model_selection import GridSearchCV

n_estimators =[64,100,128,200]
max_features = [2,3,5,7]
bootstrap = [True,False]

param_grid = {'n_estimators':n_estimators,
'max_features':max_features,
'bootstrap':bootstrap}

rfc = RandomForestClassifier()

In [58]:

1
2
3
grid = GridSearchCV(rfc,param_grid)

grid.fit(X_train,y_train)

Out[58]:

1
2
3
4
GridSearchCV(estimator=RandomForestClassifier(),
param_grid={'bootstrap': [True, False],
'max_features': [2, 3, 5, 7],
'n_estimators': [64, 100, 128, 200]})

In [59]:

1
grid.best_params_  # 找到最优的参数

Out[59]:

1
{'bootstrap': False, 'max_features': 3, 'n_estimators': 200}

In [60]:

1
2
3
4
5
6
7
8
9
10
# 再次建立随机森林模型

rfc = RandomForestClassifier(
max_features=3,
n_estimators=200,
bootstrap=False)

rfc.fit(X_train_smote,y_train_smote)

rfc_tuned_pred = rfc.predict(X_test)

In [61]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 新的分类报告得分

print(classification_report(y_test,rfc_tuned_pred))

print('Accuracy Score: ',accuracy_score(y_test,rfc_tuned_pred))
print('F1 Score: ',f1_score(y_test,rfc_tuned_pred))
precision recall f1-score support

0 0.95 0.98 0.97 3404
1 0.05 0.02 0.03 173

accuracy 0.94 3577
macro avg 0.50 0.50 0.50 3577
weighted avg 0.91 0.94 0.92 3577

Accuracy Score: 0.9362594352809617
F1 Score: 0.025641025641025644

逻辑回归模型调参

In [62]:

1
2
3
4
5
6
7
8
penalty = ['l1','l2']
C = [0.001, 0.01, 0.1, 1, 10, 100]

log_param_grid = {'penalty': penalty,
'C': C}

logreg = LogisticRegression()
grid = GridSearchCV(logreg,log_param_grid)

In [63]:

1
grid.fit(X_train_smote,y_train_smote)

Out[63]:

1
2
3
GridSearchCV(estimator=LogisticRegression(),
param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100],
'penalty': ['l1', 'l2']})

In [64]:

1
grid.best_params_

Out[64]:

1
{'C': 1, 'penalty': 'l2'}

In [65]:

1
2
3
4
logreg_pipeline = Pipeline(steps = [('scale',StandardScaler()),
('LR',LogisticRegression(C=1,penalty='l2',random_state=42))])

logreg_pipeline.fit(X_train_smote,y_train_smote)

Out[65]:

1
2
Pipeline(steps=[('scale', StandardScaler()),
('LR', LogisticRegression(C=1, random_state=42))])

In [66]:

1
logreg_new_pred   = logreg_pipeline.predict(X_test) # 新预测

In [67]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
print(classification_report(y_test,logreg_new_pred))

print('Accuracy Score: ',accuracy_score(y_test,logreg_new_pred))
print('F1 Score: ',f1_score(y_test,logreg_new_pred))
precision recall f1-score support

0 0.96 0.92 0.94 3404
1 0.18 0.33 0.23 173

accuracy 0.89 3577
macro avg 0.57 0.63 0.59 3577
weighted avg 0.93 0.89 0.91 3577

Accuracy Score: 0.8932065977075762
F1 Score: 0.22983870967741934

支持向量机调参

In [68]:

1
2
3
4
5
6
7
8
svm_param_grid = {
'C': [0.1, 1, 10, 100, 1000],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ['rbf']}

svm = SVC(random_state=42)

grid = GridSearchCV(svm, svm_param_grid)

In [69]:

1
grid.fit(X_train_smote,y_train_smote)

Out[69]:

1
2
3
4
GridSearchCV(estimator=SVC(random_state=42),
param_grid={'C': [0.1, 1, 10, 100, 1000],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ['rbf']})

In [70]:

1
grid.best_params_

Out[70]:

1
{'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}

In [71]:

1
2
3
4
5
svm_pipeline = Pipeline(steps = [('scale',StandardScaler()),('SVM',SVC(C=100,gamma=0.0001,kernel='rbf',random_state=42))])

svm_pipeline.fit(X_train_smote,y_train_smote)

svm_tuned_pred = svm_pipeline.predict(X_test)

In [72]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
print(classification_report(y_test,svm_tuned_pred))

print('Accuracy Score: ',accuracy_score(y_test,svm_tuned_pred))
print('F1 Score: ',f1_score(y_test,svm_tuned_pred))
precision recall f1-score support

0 0.96 0.93 0.94 3404
1 0.16 0.27 0.20 173

accuracy 0.90 3577
macro avg 0.56 0.60 0.57 3577
weighted avg 0.92 0.90 0.91 3577

Accuracy Score: 0.8951635448700028
F1 Score: 0.19700214132762314

结论

  1. 在交叉验证的过程中,随机森林表现的最好。
  2. 3种模型的对比:随机森林的精度最好,但是F1-score缺失最低的
  3. 模型可能特点:更能预测哪些人将会中风,而不是哪些人不会中风

本文标题:kaggle实战-基于机器学习的中风病人预测

发布时间:2022年06月25日 - 12:06

原始链接:http://www.renpeter.cn/2022/06/25/kaggle%E5%AE%9E%E6%88%98-%E5%9F%BA%E4%BA%8E%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E7%9A%84%E4%B8%AD%E9%A3%8E%E7%97%85%E4%BA%BA%E9%A2%84%E6%B5%8B.html

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

Coffee or Tea