天池大赛-工业蒸汽量预测-预处理篇

天池大赛：工业蒸汽量预测baseline

本文是对阿里云天池官网的工业蒸汽数据的建模分析，数据来自天池官网，本文仅供学习使用。

https://tianchi.aliyun.com/competition/entrance/231693/information

赛题背景

火力发电的基本原理是：燃料在燃烧时加热水生成蒸汽，蒸汽压力推动汽轮机旋转，然后汽轮机带动发电机旋转，产生电能。在这一系列的能量转化中，影响发电效率的核心是锅炉的燃烧效率，即燃料燃烧加热水产生高温高压蒸汽。

锅炉的燃烧效率的影响因素很多，包括锅炉的可调参数，如燃烧给量，一二次风，引风，返料风，给水水量；以及锅炉的工况，比如锅炉床温、床压，炉膛温度、压力，过热器的温度等。

常见回归模型

常见的回归预测模型使用的算法包括：

线性回归(Linear Regression)
岭回归(Ridge Regression)
LASSO (Least Absolute Shrinkage and Selection Operator)回归
决策树回归(Decision TreeRegressor)
梯度提升树回归(Gradient Boosting Decison Tree Regressor)

导入库

主要是用于数据处理、可视化、建模等

import pandas as pd
import numpy as np
from scipy import stats

# 1、基于plotly
import plotly as py
import plotly.express as px
import plotly.graph_objects as go
py.offline.init_notebook_mode(connected = True)
from plotly.subplots import make_subplots  # 多子图

# 2、基于matplotlib
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
%matplotlib inline
# 中文显示问题
plt.rcParams["font.sans-serif"]=["SimHei"] #设置字体
plt.rcParams["axes.unicode_minus"]=False #正常显示负号

# 3、基于seaborn
import seaborn as sns
# plt.style.use("fivethirtyeight")
plt.style.use('ggplot')

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit

# 标准化
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler

# 忽略notebook中的警告
import warnings
warnings.filterwarnings("ignore")

导入数据

导入训练集和测试集

数据来自阿里天池官网，两个txt的文件：

In [2]:

train = pd.read_table("zhengqi_train.txt")
test = pd.read_table("zhengqi_test.txt")

train.head()

Out[2]:

数据说明

数据分成训练数据（train.txt）和测试数据（test.txt），其中字段”V0”-“V37”，这38个字段是作为特征变量，”target”作为目标变量。

利用训练数据训练出模型，预测测试数据的目标变量，排名结果依据预测结果的均方误差MSE（mean square error）为评判标准。评价指标为：

$$\mathrm{MSE}=\frac{S S E}{n}=\frac{1}{n} \sum_{i=1}^{n} w_{i}\left(y_{i}-\hat{y}_{i}\right)^{2}$$

在scikit-learn中如何实现指标的验证？

1 2	from sklearn.metrics import mean_squared_error mean_squared_error(y_test, y_predict)

数据EDA

基本信息

主要是针对训练集train的基本信息探索：

In [3]:

1	train.shape

Out[3]:

1	(2888, 39)

In [4]:

1 2	# 缺失值情况 train.isnull().sum()

Out[4]:

数据比较完美，没有任何缺失值

V0        0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
V29       0
V30       0
V31       0
V32       0
V33       0
V34       0
V35       0
V36       0
V37       0
target    0
dtype: int64

In [5]:

train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2888 entries, 0 to 2887
Data columns (total 39 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   V0      2888 non-null   float64
 1   V1      2888 non-null   float64
 2   V2      2888 non-null   float64
 3   V3      2888 non-null   float64
 4   V4      2888 non-null   float64
 5   V5      2888 non-null   float64
 6   V6      2888 non-null   float64
 7   V7      2888 non-null   float64
 8   V8      2888 non-null   float64
 9   V9      2888 non-null   float64
 10  V10     2888 non-null   float64
 11  V11     2888 non-null   float64
 12  V12     2888 non-null   float64
 13  V13     2888 non-null   float64
 14  V14     2888 non-null   float64
 15  V15     2888 non-null   float64
 16  V16     2888 non-null   float64
 17  V17     2888 non-null   float64
 18  V18     2888 non-null   float64
 19  V19     2888 non-null   float64
 20  V20     2888 non-null   float64
 21  V21     2888 non-null   float64
 22  V22     2888 non-null   float64
 23  V23     2888 non-null   float64
 24  V24     2888 non-null   float64
 25  V25     2888 non-null   float64
 26  V26     2888 non-null   float64
 27  V27     2888 non-null   float64
 28  V28     2888 non-null   float64
 29  V29     2888 non-null   float64
 30  V30     2888 non-null   float64
 31  V31     2888 non-null   float64
 32  V32     2888 non-null   float64
 33  V33     2888 non-null   float64
 34  V34     2888 non-null   float64
 35  V35     2888 non-null   float64
 36  V36     2888 non-null   float64
 37  V37     2888 non-null   float64
 38  target  2888 non-null   float64
dtypes: float64(39)
memory usage: 880.1 KB

In [6]:

描述统计信息主要是查看数值型字段的统计指标，比如：数量、均值、方差、最值、分位数等

1
2
3

# 描述统计信息

train.describe()

Out[6]:

数据可视化-箱型图

箱型图主要是查看特征变量的分布，检验是否有异常值：

In [7]:

1	parameters = train.columns.tolist()[:-1]

In [8]:

# 两个基本参数：设置行、列
fig = make_subplots(rows=8, cols=5)  # 8行5列

for i, v in enumerate(parameters):  # parameters 长度是28
    r = i // 5 + 1
    c = (i+1) % 5

    if c ==0:
        fig.add_trace(go.Box(y=train[v].tolist(),name=v),
                 row=r, col=5)
    else:
        fig.add_trace(go.Box(y=train[v].tolist(),name=v),
                 row=r, col=c)

fig.update_layout(width=800, height=900)

fig.show()

从上面的图形箱型图中能够观察到：很多字段是存在异常值的（圆点部分），可以考虑进行删除

获取异常数据

通过建模的方法来检查数据中的异常值：

In [9]:

def find_outliers(model,X,y,sigma=3):
    try:
        y_pred = pd.Series(model.predict(X), index=y.index)
    except:
        model.fit(X,y)
        y_pred = pd.Series(model.predict(X), index=y.index)

    # 计算残差
    resid = y - y_pred
    # 均值
    mean_resid = resid.mean()
    # 方差
    std_resid = resid.std()

    z = (resid - mean_resid) / std_resid
    outliers = z[abs(z) > sigma].index

    # 得分
    print("R2=", model.score(X,y))
    print("mse=", mean_squared_error(y,y_pred))

    # ----------------------------
    print("mean of residuals", mean_resid)
    print("std of residuals", std_resid)

    # -------------打印离群点信息--------------
    print(len(outliers))
    print(outliers.tolist())

    # 绘图
    # 图形大小设置
    plt.figure(figsize=(15,5))

    # 第一个子图信息：真实值和预测值的散点图
    ax_131 = plt.subplot(1,3,1)
    plt.plot(y, y_pred, ".")
    plt.plot(y.loc[outliers],y_pred[outliers],"ro")
    plt.legend(["Accepted", "Outlier"])  # 图例
    plt.xlabel("y")  # 两个轴
    plt.ylabel("y_pred")

    # 第二个子图：真实值和残差值
    ax_132 = plt.subplot(1,3,2)
    plt.plot(y,y-y_pred,".")
    plt.plot(y.loc[outliers], y.loc[outliers] - y_pred[outliers], "ro")
    plt.legend(["Accepted", "Outlier"])
    plt.xlabel("y")
    plt.ylabel("y - y_pred")

    # 第三个子图
    ax_133 = plt.subplot(1,3,3)
    z.plot.hist(bins=50,ax=ax_133)
    z.loc[outliers].plot.hist(color="r",bins=50,ax=ax_133)
    plt.legend(["Accepted", "Outlier"])
    plt.xlabel("z")

     # 保存图片
    plt.savefig("outliers.png")

    return outliers

岭回归

通过岭回归来找到异常值，并且绘制出分布图形：

In [10]:

from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

X_train = train.iloc[:, 0:-1]  # 特征数据
y_train = train.iloc[:,-1]  # 标签数据

outliers = find_outliers(Ridge(), X_train, y_train)
outliers

R2= 0.8890858938210386
mse= 0.10734857773123632
mean of residuals -1.4031558309840103e-18
std of residuals 0.3276976673193502
31
[321, 348, 376, 777, 884, 1145, 1164, 1310, 1458, 1466, 1484, 1523, 1704, 1874, 1879, 1979,
 2002, 2279, 2528, 2620, 2645, 2647, 2667, 2668, 2669, 2696, 2767, 2769, 2807, 2842, 2863]

Out[10]:

Int64Index([ 321,  348,  376,  777,  884, 1145, 1164, 1310, 1458, 1466, 1484,
            1523, 1704, 1874, 1879, 1979, 2002, 2279, 2528, 2620, 2645, 2647,
            2667, 2668, 2669, 2696, 2767, 2769, 2807, 2842, 2863],
           dtype='int64')

下面的散点图中能够观察到异常值：

随机森林回归

In [11]:

from sklearn import ensemble
rfr= ensemble.RandomForestRegressor(n_estimators=50) # 使用50个决策树

outliers = find_outliers(rfr, X_train, y_train)
outliers
R2= 0.9814837452622395
mse= 0.017921017258171745
mean of residuals -0.0034968836565097126
std of residuals 0.13384689878757441
37
[344, 348, 376, 419, 693, 771, 776, 777, 847, 848, 1036, 1069, 1070, 1083, 1141, 1145, 1164, 1458, 1704, 1837, 1921, 1932, 2002, 2166, 2211, 2279, 2532, 2620, 2645, 2647, 2655, 2667, 2668, 2669, 2800, 2801, 2807]

Out[11]:

Int64Index([ 344,  348,  376,  419,  693,  771,  776,  777,  847,  848, 1036,
            1069, 1070, 1083, 1141, 1145, 1164, 1458, 1704, 1837, 1921, 1932,
            2002, 2166, 2211, 2279, 2532, 2620, 2645, 2647, 2655, 2667, 2668,
            2669, 2800, 2801, 2807],
           dtype='int64')

直方图和QQ图

QQ图：数据的分位数和正态分布的分位数的对比参照图。如果数据符合正态分布，则所有的图都会落在直线上。以变量V0为例：

In [12]:

plt.figure(figsize=(10,5))

ax = plt.subplot(1,2,1)
sns.distplot(train["V0"], fit=stats.norm)

ax = plt.subplot(1,2,2)
res = stats.probplot(train["V0"], plot=plt)

可以看到V0的分布不是正态分布。下面绘制全部变量的QQ图：

In [13]:

V_list = train.columns[:-1]

for v in V_list:
    plt.figure(figsize=(10,5))
    ax = plt.subplot(1,2,1)
    sns.distplot(train[v], fit=stats.norm)
    ax = plt.subplot(1,2,2)
    res = stats.probplot(train[v], plot=plt)

部分截图展示：

KDE图

KDE(Kernel Density Estimation，核密度图)，可以认为是对直方图的加窗平滑。通过KDE分布图，可以查看训练集和测试集中特征变量的分布情况。

对比V0在训练集和测试集中的分布情况：

In [14]:

plt.figure(figsize=(6,3), dpi=150)

ax = sns.kdeplot(train["V0"], color="green", shade=True)
ax = sns.kdeplot(test["V0"], color="blue", shade=True)

ax.set_xlabel("V0")
ax.set_ylabel("Frequency")
ax = ax.legend(["train","test"])

结果显示：V0变量在训练集和测试集上分布基本是一致的。

线性回归关系图

单个变量

线性回归关系图主要是用于分析变量之间的线性回归关系：

In [15]:

plt.figure(figsize=(6,3), dpi=150)

ax = plt.subplot(1,2,1)
sns.regplot(x="V0",y="target",data=train,ax=ax,
            scatter_kws={"marker":".", "s":3, "alpha":0.3},
            line_kws={"color":"k"})

plt.xlabel("V0")
plt.ylabel("target")

ax = plt.subplot(1,2,2)
sns.distplot(train["V0"].dropna())
plt.xlabel("V0")

plt.show()

plt.figure(figsize=(6,3), dpi=150)

ax = plt.subplot(1,2,1)
sns.regplot(x="V10",y="target",data=train,ax=ax,
            scatter_kws={"marker":".", "s":3, "alpha":0.3},
            line_kws={"color":"k"})

plt.xlabel("V0")
plt.ylabel("target")

ax = plt.subplot(1,2,2)
sns.distplot(train["V0"].dropna())
plt.xlabel("V0")

plt.show()

所有变量

# fcols = 6
# frows = len(test)
# plt.figure(figsize=(fcols, frows))

# i = 0
# for col in test.columns:
#     i += 1
#     ax = plt.subplot(frows, fcols, i)
#     sns.regplot(x=col, y="target", data=train,
#                 ax=ax, scatter_kws={"marker":".", "s":3, "alpha":0.3},
#                 line_kws={"color":"k"}
#                )
#     plt.xlabel(col)
#     plt.ylabel("target")

#     i += 1

#     ax = plt.subplot(frows, fcols, i)
#     sns.distplot(train[col].dropna())

#     plt.xlabel(col)

drop_columns = ["V5","V9","V11","V17","V22","V28"]

train_x = train.drop(["target"],axis=1)

# 数据合并
data = pd.concat([train_x, test], axis=0, ignore_index=True)

# 删除无效字段
data.drop(drop_columns, axis=1, inplace=True)
data.head()

归一化操作

In [28]:

columns_all = data.columns.tolist()

def scale_minmax(col):
    result = (col - col.min()) / (col.max() - col.min())
    return result

In [29]:

1
2
3

data[columns_all] = data[columns_all].apply(scale_minmax, axis=0)

data[columns_all].describe()

Out[29]:

# 训练集和测试集单独归一化操作

train_process = train[columns_all]
train_process = train_process[columns_all].apply(scale_minmax,axis=0)

test_process = test[columns_all]
test_process = test_process[columns_all].apply(scale_minmax,axis=0)

可视化展示

将进行转换后的数据进行展示：

In [31]:

cols_left = columns_all[:13]
cols_right = columns_all[13:]

train_process = pd.concat([train_process, train["target"]], axis=1)
train_process.head()

Out[31]:

fcols = 6
frows = len(cols_left)

plt.figure(figsize=(3 * fcols, 3 * frows))
i = 0

for var in cols_left:
    data = train_process[[var, "target"]].dropna()
    i += 1
    plt.subplot(frows,fcols,i)
    sns.distplot(data[var], fit=stats.norm)
    plt.title(var + " Original")
    plt.xlabel("")

    i += 1
    plt.subplot(frows,fcols,i)
    _ = stats.probplot(data[var], plot=plt)
    plt.title('skew=' + '{:.4f}'.format(stats.skew(data[var])))
    plt.xlabel('')
    plt.ylabel('')

    i += 1
    plt.subplot(frows, fcols, i)
    plt.plot(data[var], data["target"],'.',alpha=0.5)
    plt.title('corr=' + '{:.2f}'.format(np.corrcoef(data[var], data["target"])[0][1]))

    i += 1
    plt.subplot(frows, fcols, i)
    trans_var, lambda_var = stats.boxcox(data[var].dropna() + 1)
    trans_var = scale_minmax(trans_var)
    sns.distplot(trans_var,fit=stats.norm)
    plt.title(var + " Transformed")
    plt.xlabel("")

    i += 1
    plt.subplot(frows,fcols,i)
    _ = stats.probplot(trans_var, plot=plt)
    plt.title('skew=' + '{:.4f}'.format(stats.skew(trans_var)))
    plt.xlabel('')
    plt.ylabel('')

    i += 1
    plt.subplot(frows,fcols,i)
    plt.plot(trans_var,data["target"],'.',alpha=0.5)
    plt.title('corr=' + '{:.2f}'.format(np.corrcoef(trans_var, data["target"])[0][1]))

特征工程

数据和特征决定了机器学习的上限，而模型和算法只是逼近这个上限而已。

特征工程一般包括：

特征使用
特征获取
特征处理
特征选择
特征监控

主要的处理流程：

先去掉无用特征
接着去除冗余特征
可以考虑生成新的有用特征
特征转换（数值化、类型转换、归一化等）
特征处理以及符合模型的使用