Fork me on GitHub

kaggle实战-基于机器学习肾脏病预测

kaggle实战:机器学习建模预测肾脏疾病

本文是针对kaggle上面一份肾脏疾病数据的建模,包含:

  • 数据预处理
  • 特征工程
  • 缺失值填充
  • 分类模型的建立
  • 模型结果对比
  • shap模型可解释性

原数据集地址:

https://www.kaggle.com/datasets/mansoordaku/ckdisease?datasetId=1111&sortBy=voteCount

结果

先看看最终的结果对比:

  • KNN是分数最低的;LGBM第一。一般在kaggle,分类问题LGBM高频使用,且效果一般都比较好
  • 树模型中,以决策树为基础,效果都有所提升。

导入库

笔记1📒:一般在建模中,导入库包含:

  • 数据处理pandas为主
  • 可视化库:笔者一般用的Plotly结合seaborn;偶尔用原生的matplotlib和pyecharts
  • 各种回归和分类模型 + 评价指标
  • 其他:切分数据、降维、采样、标准化等
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import missingno as ms
import plotly.express as px
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.offline as pyo
pyo.init_notebook_mode()
sns.set_style('darkgrid')

plt.style.use('fivethirtyeight')
%matplotlib inline

from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split,cross_val_score

from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score as f1
from sklearn.metrics import confusion_matrix

import eli5
from eli5.sklearn import PermutationImportance
import shap
from pdpbox import pdp, info_plots

plt.rc('figure',figsize=(18,9))

import warnings
warnings.filterwarnings("ignore")

pd.set_option('display.max_columns', 26)

数据基本信息

很明显:上面数据中id字段是对建模无用的,直接drop函数删除:

In [3]:

1
df.drop("id",axis=1,inplace=True)

查看数据量大小:行数和字段属性数量

In [4]:

1
df.shape

Out[4]:

1
(400, 25)

总共是400条数据,25个字段

不同的字段类型统计:

In [5]:

1
df.dtypes

Out[5]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
age               float64
bp float64
sg float64
al float64
su float64
rbc object
pc object
pcc object
ba object
bgr float64
bu float64
sc float64
sod float64
pot float64
hemo float64
pcv object
wc object
rc object
htn object
dm object
cad object
appet object
pe object
ane object
classification object
dtype: object

In [6]:

1
pd.value_counts(df.dtypes)

Out[6]:

只包含两个类型的字段

1
2
3
object     14
float64 11
dtype: int64

查看缺失值情况:

In [7]:

1
df.isnull().sum().sort_values(ascending=False)

Out[7]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
rbc               152
rc 130
wc 105
pot 88
sod 87
pcv 70
pc 65
hemo 52
su 49
sg 47
al 46
bgr 44
bu 19
sc 17
bp 12
age 9
ba 4
pcc 4
htn 2
dm 2
cad 2
appet 1
pe 1
ane 1
classification 0
dtype: int64

数值型字段的描述统计信息,通常是查看这些字段的统计值信息:总统计量、最值、四分位数等:

In [8]:

1
df.describe().style.background_gradient(cmap="ocean_r")  # 描述统计信息

数据基本信息:

In [9]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 391 non-null float64
1 bp 388 non-null float64
2 sg 353 non-null float64
3 al 354 non-null float64
4 su 351 non-null float64
5 rbc 248 non-null object
6 pc 335 non-null object
7 pcc 396 non-null object
8 ba 396 non-null object
9 bgr 356 non-null float64
10 bu 381 non-null float64
11 sc 383 non-null float64
12 sod 313 non-null float64
13 pot 312 non-null float64
14 hemo 348 non-null float64
15 pcv 330 non-null object
16 wc 295 non-null object
17 rc 270 non-null object
18 htn 398 non-null object
19 dm 398 non-null object
20 cad 398 non-null object
21 appet 399 non-null object
22 pe 399 non-null object
23 ane 399 non-null object
24 classification 400 non-null object
dtypes: float64(11), object(14)
memory usage: 78.2+ KB

字段解释

针对每个字段的中文含义解释:

In [10]:

1
2
columns = df.columns
columns

Out[10]:

1
2
3
4
Index(['age', 'bp', 'sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba', 'bgr', 'bu',
'sc', 'sod', 'pot', 'hemo', 'pcv', 'wc', 'rc', 'htn', 'dm', 'cad',
'appet', 'pe', 'ane', 'classification'],
dtype='object')
  • age:年龄
  • bp:blood_pressure,血压
  • sg:specific_gravity,比重值;肾脏疾病通常是检测尿比重
  • al:albumin,白蛋白
  • su:sugar,葡萄糖
  • rbc:red_blood_cells,【红血细胞】是否正常?
  • pc:pus_cell,【脓细胞】含量是否正常?
  • pcc:pus_cell_clumps,【脓细胞群】是否正常
  • ba:bacteria,是否【细菌】感染?
  • bgr:blood_glucose_random,随机血糖量
  • bu:blood_urea,血尿素
  • sc:serum_creatinine,血清肌酐
  • sod:sodium,钠
  • pot:potassium,钾
  • hemo:haemoglobin,血红蛋白
  • pcv:packed_cell_volume(PCV),血细胞压积,红细胞在血液中所占容积比
  • wc:white_blood_cell_count,白血细胞计数
  • rc:red_blood_cell_count,红血细胞计数
  • htn:hypertension,是否有【高血压】?
  • dm:diabetes_mellitus,是否有【糖尿病】?
  • cad:coronary_artery_disease,是否有【冠状动脉疾病】?
  • appet:appetite,是否有【食欲】?
  • pe:peda_edema,足部是否【水肿】?
  • ane:aanemia,是否【贫血】?
  • classification:分类结果,是否患病

字段预处理

下面我们对部分字段进行处理

字段classification

最终分类结果的处理

In [11]:

1
df["classification"].value_counts()  # 修改前

Out[11]:

1
2
3
4
ckd       248
notckd 150
ckd\t 2
Name: classification, dtype: int64

可以看到有2个记录是异常的,这种情况就是属于数据异常,需要手动定位发现统一改成ckd:

In [12]:

1
df["classification"] = df["classification"].apply(lambda x: x if x == "notckd" else "ckd")

In [13]:

1
df["classification"].value_counts()  # 修改后

Out[13]:

1
2
3
ckd       250
notckd 150
Name: classification, dtype: int64

年龄age

In [14]:

1
px.violin(df,y="age",color="classification")

pcv:packed_cell_volume(PCV)

PCV-血细胞压积,红细胞在血液中所占容积比

In [15]:

1
df["pcv"].value_counts()  # 修改前

部分截图

可以看到这个字段存在不规范的记录,也需要处理:

In [16]:

1
df["pcv"] = pd.to_numeric(df["pcv"], errors="coerce")

In [17]:

1
df["pcv"].value_counts()  # 修改后

wc:white_blood_cell_count

白血细胞计数

In [18]:

1
df["wc"].value_counts()  # 修改后

Out[18]:

1
2
3
4
5
6
7
8
9
10
11
12
9800     11
6700 10
9200 9
9600 9
7200 9
..
19100 1
\t? 1
12300 1
14900 1
12700 1
Name: wc, Length: 92, dtype: int64

In [19]:

1
df["wc"] = pd.to_numeric(df["wc"], errors="coerce")

rc:red_blood_cell_count

红血细胞计数

In [20]:

1
df["rc"].value_counts()  # 修改前

也需要进行转化:

In [21]:

1
df["rc"] = pd.to_numeric(df["rc"], errors="coerce")

In [22]:

1
2
3
# 不同字段类型统计

pd.value_counts(df.dtypes)

Out[22]:

1
2
3
float64    14
object 11
dtype: int64

dm:diabetes_mellitus

是否有【糖尿病】?

In [23]:

1
df["dm"].value_counts()

Out[23]:

1
2
3
4
5
6
no       258
yes 134
\tno 3
\tyes 2
yes 1
Name: dm, dtype: int64

dm字段存在异常,一般是空格和换行符引起的;我们将取值统一成no和yes

In [24]:

1
df["dm"] = df["dm"].str.strip()  # 去除空格

In [25]:

1
df["dm"].value_counts()

Out[25]:

1
2
3
no     261
yes 137
Name: dm, dtype: int64

cad:coronary_artery_disease

是否有【冠状动脉疾病】?

In [26]:

1
df["cad"].value_counts()

Out[26]:

1
2
3
4
no      362
yes 34
\tno 2
Name: cad, dtype: int64

In [27]:

1
df["cad"] = df["cad"].str.strip()  # 去除空格

查看处理后df的信息:

In [28]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 391 non-null float64
1 bp 388 non-null float64
2 sg 353 non-null float64
3 al 354 non-null float64
4 su 351 non-null float64
5 rbc 248 non-null object
6 pc 335 non-null object
7 pcc 396 non-null object
8 ba 396 non-null object
9 bgr 356 non-null float64
10 bu 381 non-null float64
11 sc 383 non-null float64
12 sod 313 non-null float64
13 pot 312 non-null float64
14 hemo 348 non-null float64
15 pcv 329 non-null float64
16 wc 294 non-null float64
17 rc 269 non-null float64
18 htn 398 non-null object
19 dm 398 non-null object
20 cad 398 non-null object
21 appet 399 non-null object
22 pe 399 non-null object
23 ane 399 non-null object
24 classification 400 non-null object
dtypes: float64(14), object(11)
memory usage: 78.2+ KB

不同特征分布

In [29]:

1
2
3
4
# 分类型
cat_cols = [col for col in df.columns if df[col].dtype == "object"]
# 连续型
num_cols = [col for col in df.columns if df[col].dtype != "object"]

分类型变量取值

下面查看分类型变量的不同取值情况:

In [30]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
for col in cat_cols:
print("变量:", col)
print(df[col].value_counts())
print("-"* 10)
变量: rbc
normal 201
abnormal 47
Name: rbc, dtype: int64
----------
变量: pc
normal 259
abnormal 76
Name: pc, dtype: int64
----------
变量: pcc
notpresent 354
present 42
Name: pcc, dtype: int64
----------
变量: ba
notpresent 374
present 22
Name: ba, dtype: int64
----------
变量: htn
no 251
yes 147
Name: htn, dtype: int64
----------
变量: dm
no 261
yes 137
Name: dm, dtype: int64
----------
变量: cad
no 364
yes 34
Name: cad, dtype: int64
----------
变量: appet
good 317
poor 82
Name: appet, dtype: int64
----------
变量: pe
no 323
yes 76
Name: pe, dtype: int64
----------
变量: ane
no 339
yes 60
Name: ane, dtype: int64
----------
变量: classification
ckd 250
notckd 150
Name: classification, dtype: int64
----------

In [31]:

1
2
3
# 分类型变量统计

len(cat_cols)

Out[31]:

1
11

In [32]:

1
2
3
4
5
6
7
8
9
10
11
12
13
plt.figure(figsize = (20, 16))
sub_plotnumber = 1

for col in cat_cols:
if sub_plotnumber <= 12:
ax = plt.subplot(4, 3, sub_plotnumber) # 43列的画布;plotnumber表示子图位置
sns.countplot(df[col], palette = "gist_earth") # 绘图
plt.xlabel(col) # x轴标签

sub_plotnumber += 1 # 自加1

plt.tight_layout()
plt.show()

连续型变量分布

In [33]:

1
len(num_cols)  # 总共是14个连续型数值变量

Out[33]:

1
14

In [34]:

1
2
3
4
5
6
7
8
9
10
11
12
13
plt.figure(figsize = (20, 16))
sub_plotnumber = 1

for col in num_cols:
if sub_plotnumber <= 14:
ax = plt.subplot(4, 4, sub_plotnumber) # 35列的画布;plotnumber表示子图位置
sns.distplot(df[col]) # 绘图
plt.xlabel(col) # x轴标签

sub_plotnumber += 1 # 自加1

plt.tight_layout()
plt.show()

小结:可以看到多个特征的分布存在一定的偏度skewness(更多的是左偏)

不同特征分布

定义3个不同绘图函数

In [35]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# 1、小提琴图:查看数据分布情况
def violin(col):
fig = px.violin(df,
y=col,
x="classification",
color="classification",
box=True,
template = 'plotly_dark')

return fig.show()

# 2、kde密度图:是否有正态分布
def kde(col):
grid = sns.FacetGrid(df,
hue="classification",
height = 6,
aspect=2)
grid.map(sns.kdeplot, col)
grid.add_legend()

# 3、散点图:两个变量之间的关系
def scatter(col1, col2):
fig = px.scatter(df,
x=col1,
y=col2,
color="classification",
template = 'plotly_dark')
return fig.show()

In [36]:

1
violin("rc")

1
kde("rc")

两两变量关系

两个连续型变量之间的关系

缺失值处理

整体缺失情况

In [56]:

1
2
3
# 全部字段的缺失值情况

df.isnull().sum().sort_values(ascending = False)

Out[56]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
rbc               152
rc 131
wc 106
pot 88
sod 87
pcv 71
pc 65
hemo 52
su 49
sg 47
al 46
bgr 44
bu 19
sc 17
bp 12
age 9
ba 4
pcc 4
htn 2
dm 2
cad 2
appet 1
pe 1
ane 1
classification 0
dtype: int64

In [57]:

1
2
3
# 连续型变量缺失

df[num_cols].isnull().sum()

Out[57]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
age       9
bp 12
sg 47
al 46
su 49
bgr 44
bu 19
sc 17
sod 87
pot 88
hemo 52
pcv 71
wc 106
rc 131
dtype: int64

In [58]:

1
2
3
# 分类型变量缺失

df[cat_cols].isnull().sum()

Out[58]:

1
2
3
4
5
6
7
8
9
10
11
12
rbc               152
pc 65
pcc 4
ba 4
htn 2
dm 2
cad 2
appet 1
pe 1
ane 1
classification 0
dtype: int64

两种填充方式

  1. 随机采样填充:在字段现有值的数据中随机采样进行填充,针对的缺失值较多的字段
  2. 均值或众数填充:针对缺失值较少的字段,用该字段现有数据的均值或者众数填充

In [59]:

1
df["rbc"].isna().sum() # 表示某个字段的缺失量

Out[59]:

1
152

In [60]:

1
df["dm"].mode()[0]   # 某个字段的众数

Out[60]:

1
'no'

In [61]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def random_value_imputate(col):
"""
函数:随机填充方法(缺失值较多的字段)
"""

# 1、确定填充的数量;在取出缺失值随机选择缺失值数量的样本
random_sample = df[col].dropna().sample(df[col].isna().sum())
# 2、索引号就是原缺失值记录的索引号
random_sample.index = df[df[col].isnull()].index
# 3、通过loc函数定位填充
df.loc[df[col].isnull(), col] = random_sample


def mode_impute(col):
"""
函数:众数填充缺失值
"""
# 1、确定众数
mode = df[col].mode()[0]
# 2、fillna函数填充众数
df[col] = df[col].fillna(mode)

1、连续型变量使用随机填充方法:

In [62]:

1
2
for col in num_cols:
random_value_imputate(col)

2、分类型变量,针对字段不同方法不同:

In [63]:

1
2
3
# 随机填充
random_value_imputate('rbc')
random_value_imputate('pc')

In [64]:

1
2
3
4
# 其他字段是众数填充

for col in cat_cols:
mode_impute(col)

填充完成后数据就没有缺失值:

In [65]:

1
df.isnull().sum()

Out[65]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
age               0
bp 0
sg 0
al 0
su 0
rbc 0
pc 0
pcc 0
ba 0
bgr 0
bu 0
sc 0
sod 0
pot 0
hemo 0
pcv 0
wc 0
rc 0
htn 0
dm 0
cad 0
appet 0
pe 0
ane 0
classification 0
dtype: int64

相关性分析

In [67]:

1
2
3
4
5
6
7
8
9
10
11
12
df["classification"] = df["classification"].map({"ckd":0, "notckd":1})

corr = df.corr() # 仅针对数值连续变量

plt.figure(figsize = (15, 8))

sns.heatmap(corr,
annot = True,
linewidths = 2,
linecolor = 'lightgrey')

plt.show()

可以看到和classification强相关的特征主要是:sg(尿比重)、hemo(血红蛋白)、pcv(血细胞压积,红细胞在血液中所占容积比)、rc(红细胞数量)

特征编码

针对分类型变量编码:

In [68]:

1
2
3
4
5
6
7
8
9
10
11
12
13
for col in cat_cols:
print(f"Categories of {col}: {df[col].nunique()} ")
Categories of rbc: 2
Categories of pc: 2
Categories of pcc: 2
Categories of ba: 2
Categories of htn: 2
Categories of dm: 2
Categories of cad: 2
Categories of appet: 2
Categories of pe: 2
Categories of ane: 2
Categories of classification: 2

所有的分类型变量都是两种取值情况,我们直接使用类型编码,变成0-1即可:

In [69]:

1
2
3
4
5
6
from sklearn.preprocessing import LabelEncoder

led = LabelEncoder()

for col in cat_cols:
df[col] = led.fit_transform(df[col])

为了分析的方便,也对classification字段进行编码:

In [70]:

1
df["classification"].value_counts()

Out[70]:

1
2
3
0    250
1 150
Name: classification, dtype: int64

建模

特征和目标

In [71]:

1
2
X = df.drop("classification",axis=1)
y = df["classification"]

训练集和测试集

In [72]:

1
2
3
4
# 随机打乱数据

from sklearn.utils import shuffle
df = shuffle(df)

In [73]:

1
2
3
# from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

定义建模函数

In [75]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def create_model(model):
# 模型训练
model.fit(X_train, y_train)
# 模型预测
y_pred = model.predict(X_test)
# 准确率acc
acc = accuracy_score(y_test, y_pred)
# 混淆矩阵
cm = confusion_matrix(y_test, y_pred)
# 分类报告
cr = classification_report(y_test,y_pred)

print(f"Test Accuracy of {model} : {acc}")
print(f"Confusion Matrix of {model}: \n{cm}")
print(f"Classification Report of {model} : \n {cr}")

8种模型

KNN

In [76]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
create_model(knn)
Test Accuracy of KNeighborsClassifier() : 0.7375
Confusion Matrix of KNeighborsClassifier():
[[35 17]
[ 8 20]]
Classification Report of KNeighborsClassifier() :
precision recall f1-score support

0 0.81 0.67 0.74 52
1 0.54 0.71 0.62 28

accuracy 0.69 80
macro avg 0.68 0.69 0.68 80
weighted avg 0.72 0.69 0.69 80

决策树

In [77]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()
create_model(dt)
Test Accuracy of DecisionTreeClassifier() : 0.9375
Confusion Matrix of DecisionTreeClassifier():
[[50 2]
[ 3 25]]
Classification Report of DecisionTreeClassifier() :
precision recall f1-score support

0 0.94 0.96 0.95 52
1 0.93 0.89 0.91 28

accuracy 0.94 80
macro avg 0.93 0.93 0.93 80
weighted avg 0.94 0.94 0.94 80

随机森林Random Forest Classifier

In [78]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from sklearn.ensemble import RandomForestClassifier

rd_clf = RandomForestClassifier(criterion = 'entropy',
max_depth = 11,
max_features = 'auto',
min_samples_leaf = 2, min_samples_split = 3, n_estimators = 130)
create_model(rd_clf)
Test Accuracy of RandomForestClassifier(criterion='entropy', max_depth=11, min_samples_leaf=2,
min_samples_split=3, n_estimators=130) : 0.95
Confusion Matrix of RandomForestClassifier(criterion='entropy', max_depth=11, min_samples_leaf=2,
min_samples_split=3, n_estimators=130):
[[52 0]
[ 4 24]]
Classification Report of RandomForestClassifier(criterion='entropy', max_depth=11, min_samples_leaf=2,
min_samples_split=3, n_estimators=130) :
precision recall f1-score support

0 0.93 1.00 0.96 52
1 1.00 0.86 0.92 28

accuracy 0.95 80
macro avg 0.96 0.93 0.94 80
weighted avg 0.95 0.95 0.95 80

Ada Boost Classifier

In [80]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(base_estimator = dt)
create_model(ada)
Test Accuracy of AdaBoostClassifier(base_estimator=DecisionTreeClassifier()) : 0.95
Confusion Matrix of AdaBoostClassifier(base_estimator=DecisionTreeClassifier()):
[[51 1]
[ 3 25]]
Classification Report of AdaBoostClassifier(base_estimator=DecisionTreeClassifier()) :
precision recall f1-score support

0 0.94 0.98 0.96 52
1 0.96 0.89 0.93 28

accuracy 0.95 80
macro avg 0.95 0.94 0.94 80
weighted avg 0.95 0.95 0.95 80

Gradient Boosting Classifier

In [81]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier()
create_model(gb)
Test Accuracy of GradientBoostingClassifier() : 0.9625
Confusion Matrix of GradientBoostingClassifier():
[[51 1]
[ 3 25]]
Classification Report of GradientBoostingClassifier() :
precision recall f1-score support

0 0.94 0.98 0.96 52
1 0.96 0.89 0.93 28

accuracy 0.95 80
macro avg 0.95 0.94 0.94 80
weighted avg 0.95 0.95 0.95 80

XgBoost

In [82]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
from xgboost import XGBClassifier

xgb = XGBClassifier(objective = 'binary:logistic',
learning_rate = 0.5,
max_depth = 5,
n_estimators = 150)

create_model(xgb)
Test Accuracy of XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.5, max_delta_step=0, max_depth=5,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=150, n_jobs=0, num_parallel_tree=1, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None) : 0.9625
Confusion Matrix of XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.5, max_delta_step=0, max_depth=5,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=150, n_jobs=0, num_parallel_tree=1, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None):
[[52 0]
[ 3 25]]
Classification Report of XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.5, max_delta_step=0, max_depth=5,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=150, n_jobs=0, num_parallel_tree=1, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None) :
precision recall f1-score support

0 0.95 1.00 0.97 52
1 1.00 0.89 0.94 28

accuracy 0.96 80
macro avg 0.97 0.95 0.96 80
weighted avg 0.96 0.96 0.96 80

Cat Boost Classifier

In [83]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from catboost import CatBoostClassifier

cab = CatBoostClassifier(iterations=10)
create_model(cab)
Learning rate set to 0.432149
0: learn: 0.2531464 total: 62.5ms remaining: 562ms
1: learn: 0.1524287 total: 63.9ms remaining: 256ms
2: learn: 0.0906595 total: 65.3ms remaining: 152ms
3: learn: 0.0578563 total: 66.8ms remaining: 100ms
4: learn: 0.0460263 total: 68.2ms remaining: 68.2ms
5: learn: 0.0356541 total: 69.6ms remaining: 46.4ms
6: learn: 0.0268575 total: 70.9ms remaining: 30.4ms
7: learn: 0.0206936 total: 72.2ms remaining: 18ms
8: learn: 0.0186242 total: 73.6ms remaining: 8.17ms
9: learn: 0.0162996 total: 75ms remaining: 0us
Test Accuracy of <catboost.core.CatBoostClassifier object at 0x1296bfe50> : 0.9750
Confusion Matrix of <catboost.core.CatBoostClassifier object at 0x1296bfe50>:
[[51 1]
[ 3 25]]
Classification Report of <catboost.core.CatBoostClassifier object at 0x1296bfe50> :
precision recall f1-score support

0 0.94 0.98 0.96 52
1 0.96 0.89 0.93 28

accuracy 0.95 80
macro avg 0.95 0.94 0.94 80
weighted avg 0.95 0.95 0.95 80

Extra Trees Classifier

In [84]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from sklearn.ensemble import ExtraTreesClassifier

etc = ExtraTreesClassifier()
create_model(etc)
Test Accuracy of ExtraTreesClassifier() : 0.9625
Confusion Matrix of ExtraTreesClassifier():
[[52 0]
[ 3 25]]
Classification Report of ExtraTreesClassifier() :
precision recall f1-score support

0 0.95 1.00 0.97 52
1 1.00 0.89 0.94 28

accuracy 0.96 80
macro avg 0.97 0.95 0.96 80
weighted avg 0.96 0.96 0.96 80

LGBM

In [85]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from lightgbm import LGBMClassifier

lgbm = LGBMClassifier(learning_rate = 0.1)
create_model(lgbm)
Test Accuracy of LGBMClassifier() : 0.9625
Confusion Matrix of LGBMClassifier():
[[51 1]
[ 2 26]]
Classification Report of LGBMClassifier() :
precision recall f1-score support

0 0.96 0.98 0.97 52
1 0.96 0.93 0.95 28

accuracy 0.96 80
macro avg 0.96 0.95 0.96 80
weighted avg 0.96 0.96 0.96 80

模型对比

In [86]:

1
2
3
4
5
models = pd.DataFrame({"model":["KNN","Decision Tree","Random Forest","Ada Boost ",
"Gradient Boosting","Xgboost","Cat Boost","Extra Trees","LGBM"],
"acc":[0.6875,0.9375,0.95,0.95,0.95,0.9625,0.95,0.9625,0.9625]})

models

Out[86]:

In [87]:

1
2
models = models.sort_values("acc",ascending=True)  # 升序排列
models

Out[87]:

In [88]:

1
2
3
4
5
6
7
px.bar(models,
x="acc",
y="model",
text="acc",
color = 'acc',
template = 'plotly_dark',
title = 'Nine Models Comparison')

模型可解释性

我们在这里选择随机森林模型(rd_clf)同时使用shap库来进行解释

shap值计算

In [89]:

1
2
3
4
explainer = shap.TreeExplainer(rd_clf)
# 在explainer中传入特征值的数据,计算shap值
shap_values = explainer.shap_values(X_test)
shap_values

Out[89]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
[array([[ 0.00082722, -0.00174422,  0.08114011, ..., -0.00147523,
0.01219736, 0.00090329],
[ 0.00241676, -0.00674329, -0.10784976, ..., -0.00708956,
-0.00392391, -0.00032504],
[ 0.00211145, -0.00874862, -0.12467772, ..., -0.00752903,
-0.00413827, -0.00035173],
...,
[-0.00775482, -0.00930411, -0.12662254, ..., -0.00807709,
-0.00414294, -0.00035052],
[-0.00719232, -0.00715492, -0.11854169, ..., -0.00821187,
-0.00448367, -0.00035973],
[-0.00499459, -0.00897095, -0.12441733, ..., -0.0081471 ,
-0.00434574, -0.00035173]]),
array([[-0.00082722, 0.00174422, -0.08114011, ..., 0.00147523,
-0.01219736, -0.00090329],
[-0.00241676, 0.00674329, 0.10784976, ..., 0.00708956,
0.00392391, 0.00032504],
[-0.00211145, 0.00874862, 0.12467772, ..., 0.00752903,
0.00413827, 0.00035173],
...,
[ 0.00775482, 0.00930411, 0.12662254, ..., 0.00807709,
0.00414294, 0.00035052],
[ 0.00719232, 0.00715492, 0.11854169, ..., 0.00821187,
0.00448367, 0.00035973],
[ 0.00499459, 0.00897095, 0.12441733, ..., 0.0081471 ,
0.00434574, 0.00035173]])]

Feature Importance

In [90]:

1
shap.summary_plot(shap_values[1], X_test, plot_type="bar")

从结果来看,sg(尿比重)、sc(血清肌酐)、hemo(血红蛋白)是重点影响特征。

1
shap.summary_plot(shap_values[1], X_test)

summary plot 为每个样本绘制其每个特征的SHAP值;一个点代表一个样本,颜色表示特征值的高低(红色高,蓝色低)

个体差异

查看单个病人的不同特征属性对其结果的影响:

从选择3个病人的结果来看,即使同样是患病者shap值的个体差异仍然很大。

本文标题:kaggle实战-基于机器学习肾脏病预测

发布时间:2022年10月20日 - 23:10

原始链接:http://www.renpeter.cn/2022/10/20/kaggle%E5%AE%9E%E6%88%98-%E5%9F%BA%E4%BA%8E%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E8%82%BE%E8%84%8F%E7%97%85%E9%A2%84%E6%B5%8B.html

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

Coffee or Tea