基于7大分类模型的钢材缺陷检测

基于机器学习的钢材缺陷类型检测

本文的数据集是来自uci，专门为机器学习提供数据的一个网站：https://archive.ics.uci.edu/ml/datasets/Steel+Plates+Faults

该数据集包含了7种带钢缺陷类型（钢板故障的7种类型：装饰、Z_划痕、K_划痕、污渍、肮脏、颠簸、其他故障），带钢缺陷的27种特征数据。

本文的主要知识点：

数据预处理

导入数据

In [1]:

import pandas as pd
import numpy as np

import plotly_express as px
import plotly.graph_objects as go
# 子图
from plotly.subplots import make_subplots

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="whitegrid")
%matplotlib inline

# 忽略警告
import warnings
warnings.filterwarnings('ignore')

In [2]:

1 2	df = pd.read_excel("faults.xlsx") df.head()

Out[2]:

数据分割

将7种不同的类型和前面的特征字段分开：

df1 = df.loc[:,"Pastry":]  # 7种不同的类型
df2 = df.loc[:,:"SigmoidOfAreas"]  # 全部是特征字段

# 分类数据
df1.head()

下面是27个特征的数据：

分类标签生成

将7种不同的标签进行分类生成：

类型编码

In [7]:

dic = {}
for i, v in enumerate(columns):
    dic[v]=i  # 类别从0开始

dic

Out[7]:

{'Pastry': 0,
 'Z_Scratch': 1,
 'K_Scatch': 2,
 'Stains': 3,
 'Dirtiness': 4,
 'Bumps': 5,
 'Other_Faults': 6}

In [8]:

1
2
3

df1["Label"] = df1["Label"].map(dic)

df1.head()

Out[8]:

数据合并

In [9]:

1 2	df2["Label"] = df1["Label"] df2.head()

EDA

数据的基本统计信息

In [10]:

1
2
3

# 缺失值

df2.isnull().sum()

结果显示是没有缺失值的：

单个特征分布

parameters = df2.columns[:-1].tolist()

sns.boxplot(data=df2, y="Steel_Plate_Thickness")
plt.show()

从箱型图中能够观察到单个特征的取值分布情况。下面绘制全部参数的取值分布箱型图：

# 两个基本参数：设置行、列
fig = make_subplots(rows=7, cols=4)  # 1行2列

# fig = go.Figure()
# 添加两个数据轨迹，形成图形

for i, v in enumerate(parameters):
    r = i // 4 + 1
    c = (i+1) % 4

    if c ==0:
        fig.add_trace(go.Box(y=df2[v].tolist(),name=v),
                 row=r, col=4)
    else:
        fig.add_trace(go.Box(y=df2[v].tolist(),name=v),
                 row=r, col=c)

fig.update_layout(width=1000, height=900)

fig.show()

几点结论：

特征之间的取值范围不同：从负数到10M
部分特征的取值中存在异常值
有些特征的取值只存在0和1

样本不均衡

每种类别数量

In [15]:

1 2	# 每种类型的数量 df2["Label"].value_counts()

Out[15]:

6    673
5    402
2    391
1    190
0    158
3     72
4     55
Name: Label, dtype: int64

可以看到第6类的样本有673条，但是第4类的样本只有55条。明显地不均衡

SMOTE解决

In [16]:

1 2	X = df2.drop("Label",axis=1) y = df2[["Label"]]

In [17]:

# 使用imlbearn库中上采样方法中的SMOTE接口
from imblearn.over_sampling import SMOTE

# 设置随机数种子
smo = SMOTE(random_state=42)
X_smo, y_smo = smo.fit_resample(X, y)
y_smo

统计一下每个类别的数量：

数据归一化

特征矩阵归一化

In [19]:

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

ss = StandardScaler()
data_ss = ss.fit_transform(X_smo)

# 还原到原数据
# origin_data = ss.inverse_transform(data_ss)

归一化后的特征矩阵

In [21]:

1 2	df3 = pd.DataFrame(data_ss, columns=X_smo.columns) df3.head()

Out[21]:

添加y_smo

In [22]:

1 2	df3["Label"] = y_smo df3.head()

建模

随机打乱数据

In [23]:

1 2	from sklearn.utils import shuffle df3 = shuffle(df3)

数据集划分

In [24]:

1 2	X = df3.drop("Label",axis=1) y = df3[["Label"]]

In [25]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=4)

建模与评价

用函数的形式来解决：

In [26]:

from sklearn.model_selection import cross_val_score  # 交叉验证得分
from sklearn import metrics  # 模型评价


def build_model(model, X_test, y_test):

    model.fit(X_train, y_train)
    # 预测概率
    y_proba = model_LR.predict_proba(X_test)
    # 找出概率值最大的所在索引，作为预测的分类结果
    y_pred = np.argmax(y_proba,axis=1)
    y_test = np.array(y_test).reshape(943)

    print(f"{model}模型得分：")
    print("召回率: ",metrics.recall_score(y_test, y_pred, average="macro"))
    print("精准率: ",metrics.precision_score(y_test, y_pred, average="macro"))

# 逻辑回归（分类）
from sklearn.linear_model import LogisticRegression
# 建立模型
model_LR = LogisticRegression()
# 调用函数
build_model(model_LR, X_test, y_test)

LogisticRegression()模型得分：
召回率:  0.8247385525937151
精准率:  0.8126617210922679

下面是单独建立每个模型：

逻辑回归

建模

In [28]:

from sklearn.linear_model import LogisticRegression  # 逻辑回归（分类）
from sklearn.model_selection import cross_val_score  # 交叉验证得分
from sklearn import metrics  # 模型评价

# 建立模型
model_LR = LogisticRegression()
model_LR.fit(X_train, y_train)

Out[28]:

1	LogisticRegression()

预测

In [29]:

1
2
3

# 预测概率
y_proba = model_LR.predict_proba(X_test)
y_proba[:3]

Out[29]:

array([[4.83469692e-01, 4.23685363e-07, 1.08028560e-10, 3.19294899e-07,
        8.92035714e-02, 1.33695855e-02, 4.13956408e-01],
       [3.49120137e-03, 6.25018002e-03, 9.36037717e-03, 3.64702993e-01,
        1.96814910e-01, 1.35722642e-01, 2.83657697e-01],
       [1.82751269e-05, 5.55981861e-01, 3.16768568e-05, 4.90023258e-03,
        2.84504970e-03, 3.67190965e-01, 6.90319398e-02]])

In [30]:

# 找出概率值最大的所在索引，作为预测的分类结果

y_pred = np.argmax(y_proba,axis=1)
y_pred[:3]

Out[30]:

1	array([0, 3, 1])

评价

In [31]:

1
2
3

# 混淆矩阵
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
confusion_matrix

Out[31]:

array([[114,   6,   0,   0,   7,  11,  10],
       [  0, 114,   1,   0,   2,   4,   4],
       [  0,   1, 130,   0,   0,   0,   2],
       [  0,   0,   0, 140,   0,   1,   0],
       [  1,   0,   0,   0, 120,   3,   6],
       [ 13,   3,   2,   0,   3,  84,  11],
       [ 21,  13,   9,   2,   9,  25,  71]])

In [32]:

1	y_pred.shape

Out[32]:

(943,)

In [33]:

1	y_test = np.array(y_test).reshape(943)

In [34]:

print("召回率: ",metrics.recall_score(y_test, y_pred, average="macro"))
print("精准率: ",metrics.precision_score(y_test, y_pred, average="macro"))
召回率:  0.8247385525937151
精准率:  0.8126617210922679

随机森林回归

SVR

决策树回归

神经网络

GBDT

from sklearn.ensemble import GradientBoostingClassifier
gbdt = GradientBoostingClassifier(
#     loss='deviance',
#     learning_rate=1,
#     n_estimators=5,
#     subsample=1,
#     min_samples_split=2,
#     min_samples_leaf=1,
#     max_depth=2,
#     init=None,
#     random_state=None,
#     max_features=None,
#     verbose=0,
#     max_leaf_nodes=None,
#     warm_start=False
)

gbdt.fit(X_train, y_train)

# 预测概率
y_proba = gbdt.predict_proba(X_test)
# 最大概率的索引
y_pred = np.argmax(y_proba,axis=1)

print("召回率: ",metrics.recall_score(y_test, y_pred, average="macro"))
print("精准率: ",metrics.precision_score(y_test, y_pred, average="macro"))

召回率:  0.9034547294196564
精准率:  0.9000750791353891

LightGBM

结果

模型	Recall	Precision
逻辑回归	0.82473	0.8126
随机森林回归	0.9176	0.9149
SVR	0.8897	0.8856
决策树回归	0.8698	0.8646
神经网络	0.8908	0.8863
GBDT	0.9034	0.9
LightGBM	0.9363	0.9331

上述结果很明显：

集成学习的方案LightGBM、GBDT、随机森林的效果是高于其他的模型
LightGBM 模型效果最佳！