3大树模型实战乳腺癌分类预测
本文从特征的探索分析出发,经过特征工程和样本均衡性处理,使用决策树、随机森林、梯度提升树对一份女性乳腺癌的数据集进行分析和预测建模。
数据集
数据是来自UCI官网,很老的一份数据,主要是用于分类问题,可以自行下载学习
https://archive.ics.uci.edu/ml/datasets/breast+cancer
导入库
1 | import pandas as pd |
导入数据
数据是来自UCI官网
1 | # 来自uci |
基本信息
In [3]:
1 | df.dtypes |
Out[3]:
1 | Class object |
In [4]:
1 | df.isnull().sum() |
Out[4]:
1 | Class 0 |
In [5]:
1 | ## 字段解释 |
Out[5]:
1 | ['Class', |
下面是每个字段的含义和具体的取值范围:
属性名 | 含义 | 取值范围 |
---|---|---|
Class | 是否复发 | no-recurrence-events, recurrence-events |
age | 年龄 | 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99 |
menopause | 绝经情况 | lt40(40岁之前绝经), ge40(40岁之后绝经), premeno(还未绝经) |
tumor-size | 肿瘤大小 | 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59 |
inv-nodes | 受侵淋巴结数 | 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39 |
node-caps | 有无结节帽 | yes, no |
deg-malig | 恶性肿瘤程度 | 1, 2, 3 |
breast | 肿块位置 | left, right |
breast-quad | 肿块所在象限 | left-up, left-low, right-up, right-low, central |
irradiat | 是否放疗 | yes,no |
去除缺失值
In [6]:
1 | df = df[(df["node-caps"] != "?") & (df["breast-quad"] != "?")] |
Out[6]:
1 | 277 |
字段处理
In [7]:
1 | from sklearn.preprocessing import LabelEncoder |
年龄段-age
In [8]:
1 | age = df["age"].value_counts().reset_index() |
可以看到数据中大部分的用户集中在40-59岁。对年龄段执行独热码:
1 | df = df.join(pd.get_dummies(df["age"])) |
绝经-menopause
In [11]:
1 | menopause = df["menopause"].value_counts().reset_index() |
Out[11]:
index | menopause | |
---|---|---|
0 | premeno | 149 |
1 | ge40 | 123 |
2 | lt40 | 5 |
In [12]:
1 | fig = px.pie(menopause,names="index",values="menopause") |
1 | df = df.join(pd.get_dummies(df["menopause"])) # 独热码 |
肿瘤大小-tumor-size
In [14]:
1 | tumor_size = df["tumor-size"].value_counts().reset_index() |
Out[14]:
index | tumor-size | |
---|---|---|
0 | 30-34 | 57 |
1 | 25-29 | 51 |
2 | 20-24 | 48 |
3 | 15-19 | 29 |
4 | 10-14 | 28 |
5 | 40-44 | 22 |
6 | 35-39 | 19 |
7 | 0-4 | 8 |
8 | 50-54 | 8 |
9 | 5-9 | 4 |
10 | 45-49 | 3 |
In [15]:
1 | fig = px.bar(tumor_size, |
1 | df = df.join(pd.get_dummies(df["tumor-size"])) |
受侵淋巴结数-inv-nodes
In [17]:
1 | df["inv-nodes"].value_counts() |
Out[17]:
1 | 0-2 209 |
In [18]:
1 | df = df.join(pd.get_dummies(df["inv-nodes"])) |
有无结节帽-node-caps
In [19]:
1 | df["node-caps"].value_counts() |
Out[19]:
1 | no 221 |
In [20]:
1 | df = df.join(pd.get_dummies(df["node-caps"]).rename(columns={"no":"node_capes_no", "yes":"node_capes_yes"})) |
恶性肿瘤程度-deg-malig
In [21]:
1 | df["deg-malig"].value_counts() |
Out[21]:
1 | 2 129 |
肿块位置-breast
In [22]:
1 | df["breast"].value_counts() |
Out[22]:
1 | left 145 |
In [23]:
1 | df = df.join(pd.get_dummies(df["breast"])) |
肿块所在象限-breast-quad
In [24]:
1 | breast_quad = df["breast-quad"].value_counts().reset_index() |
Out[24]:
index | breast-quad | |
---|---|---|
0 | left_low | 106 |
1 | left_up | 94 |
2 | right_up | 33 |
3 | right_low | 23 |
4 | central | 21 |
In [25]:
1 | fig = px.bar(breast_quad, |
1 | df = df.join(pd.get_dummies(df["breast-quad"])) |
是否放疗-irradiat
In [27]:
1 | df["irradiat"].value_counts() |
Out[27]:
1 | no 215 |
In [28]:
1 | df = df.join(pd.get_dummies(df["irradiat"]).rename(columns={"no":"irradiat_no","yes":"irradiat_yes"})) |
是否复发-Class
这个是最终预测的因变量
In [29]:
1 | dic = {"no-recurrence-events":0, "recurrence-events":1} |
复发和非复发的统计:
1 | sns.countplot(df['Class'],label="Count") |
样本不均衡处理
In [31]:
1 | # 样本量分布 |
Out[31]:
1 | 0 196 |
In [32]:
1 | from imblearn.over_sampling import SMOTE |
In [33]:
1 | X = df.iloc[:,1:] |
Out[33]:
1 | 0 0 |
In [34]:
1 | groupby_df = df.groupby('Class').count() |
1 | # 建立SMOTE模型对象 |
建模
相关性
分析每个新字段和因变量之间的相关性
In [36]:
1 | corr = df_smoted.corr() |
绘制相关性热力图:
1 | fig = plt.figure(figsize=(12,8)) |
数据集划分
In [38]:
1 | X = df_smoted.iloc[:,:-1] |
决策树
In [39]:
1 | dt = DecisionTreeClassifier(max_depth=5) |
Out[39]:
1 | DecisionTreeClassifier(max_depth=5) |
In [40]:
1 | # 预测 |
Out[40]:
1 | 1.0 |
In [41]:
1 | # 混淆矩阵 |
Out[41]:
1 | array([[29, 8], |
In [42]:
1 | # 分类得分报告 |
In [43]:
1 | # roc |
Out[43]:
1 | 0.6657014157014157 |
In [44]:
1 | # roc曲线 |
随机森林
In [45]:
1 | rf = RandomForestClassifier(max_depth=5) |
Out[45]:
1 | RandomForestClassifier(max_depth=5) |
In [46]:
1 | # 预测 |
Out[46]:
1 | 1.0 |
In [47]:
1 | # 混淆矩阵 |
Out[47]:
1 | array([[31, 6], |
In [48]:
1 | # roc |
Out[48]:
1 | 0.7522522522522522 |
In [49]:
1 | # roc曲线 |
梯度提升树
In [50]:
1 | from sklearn.ensemble import GradientBoostingClassifier |
In [51]:
1 | gbc = GradientBoostingClassifier(loss='deviance', |
Out[51]:
1 | GradientBoostingClassifier(n_estimators=5, subsample=1) |
In [52]:
1 | # 预测 |
Out[52]:
1 | 1.0 |
In [53]:
1 | # 混淆矩阵 |
Out[53]:
1 | array([[32, 5], |
In [54]:
1 | # roc |
Out[54]:
1 | 0.6467181467181469 |
In [55]:
1 | # roc曲线 |
PCA降维
降维过程
In [56]:
1 | from sklearn.decomposition import PCA |
In [57]:
1 | sum(pca.explained_variance_ratio_) |
Out[57]:
1 | 0.9026686181152915 |
降维后数据
In [58]:
1 | X_NEW = pca.transform(X) |
Out[58]:
1 | array([[ 1.70510215e-01, 5.39929099e-01, -1.04314303e+00, ..., |
In [59]:
1 | X_NEW.shape |
Out[59]:
1 | (392, 17) |
重新划分数据
In [60]:
1 | X_train,X_test,y_train,y_test = train_test_split(X_NEW,y,test_size=0.20,random_state=123) |
再用随机森林
In [61]:
1 | rf = RandomForestClassifier(max_depth=5) |
Out[61]:
1 | RandomForestClassifier(max_depth=5) |
In [62]:
1 | # 预测 |
Out[62]:
1 | 1.0 |
In [63]:
1 | # 混淆矩阵 |
Out[63]:
1 | array([[26, 11], |
In [64]:
1 | # roc |
Out[64]:
1 | 0.6965894465894465 |
In [65]:
1 | # roc曲线 |
总结
从数据预处理和特征工程出发,建立不同的树模型表现来看,随机森林表现的最好,AUC值高达0.81,在经过对特征简单的降维之后,我们选择前17个特征,它们的重要性超过90%,再次建模,此时AUC值达到0.83。