Fork me on GitHub

kaggle实战-精美可视化与时序预测

kaggle实战-销售数据的精美可视化分析与时序预测

本文是基于一份商品销售数据,使用Pandas、seaborn、statmodels、sklearn、线性回归预测、xgboost等库和方法进行多角度的可视化分析和时序预测。

结果

先展示部分的可视化结果:

导入库

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess
from sklearn.linear_model import LinearRegression
from pandas import date_range
from statsmodels.graphics.tsaplots import plot_pacf

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder
from xgboost import XGBRegressor

from sklearn.linear_model import ElasticNet, Lasso, Ridge

from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.multioutput import RegressorChain
import warnings
warnings.filterwarnings("ignore")

读取数据

1
2
3
4
5
6
7
holidays = pd.read_csv('holidays_events.csv')
oil = pd.read_csv('oil.csv')
stores = pd.read_csv('stores.csv')
trans = pd.read_csv('transactions.csv')

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

下面是具体的字段解释:

train/test

id:id号

date:日期

store_nbr:所在商店;比如1-2-3-4…

family:所属商品类型;比如:AUTOMOTIVE, HOME APPLIANCES, SCHOOL AND OFFICE SUPPLIES

sales:当日的销售额

onpromotion:商品在当日的进货数量

holidays

date:日期

type:类型;包含Holiday/Event/Other3个取值

locale:事件范围区域;National/Local/Other(全国的、本地的或者其他)

locale_name:区域名称

description:节日的描述信息

transferred:是否是推迟后节日;True或者False

oil

id:日期

dcoilwtico:当天的油价

stores

store_nbr:所属商店类型

city:所在城市

state:所在州

type:类型;A-B-C-D-E

cluster:聚类结果;就是一群类似的商店组成的团体;1-2-3-4-5…

trans

date:日期

store_nbr:所在商店

transcations:当天的交易额

思维导图

思维导图中整理了5个csv文件中的数据字段以及它们之间的关联关系:

数据基本信息

以holidays数据为例,查看数据的基本信息:

1、head方法查看前5条数据

1
2
3
# holidays

holidays.head()

2、shape方法查看数据和行和列

1
holidays.shape
(350, 6)

3、describe查看数据的描述统计信息;只针对数值型字段

1
holidays.describe()

4、isnull方法查看数据的缺失值情况

1
holidays.isnull().sum()
date           0
type           0
locale         0
locale_name    0
description    0
transferred    0
dtype: int64
1
train.head()

1
oil.head()

1
stores.head()

1
trans.head()

时间格式转化

1
2
3
4
5
holidays['date'] = pd.to_datetime(holidays['date'], format = "%Y-%m-%d")
oil['date'] = pd.to_datetime(oil['date'], format = "%Y-%m-%d")
trans['date'] = pd.to_datetime(trans['date'], format = "%Y-%m-%d")
train['date'] = pd.to_datetime(train['date'], format = "%Y-%m-%d")
test['date'] = pd.to_datetime(test['date'], format = "%Y-%m-%d")

精美可视化

油价随时间变化(原始数据)

1
oil.head()

1
2
3
4
5
6
7
8
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(25,15))
oil.plot.line(x="date",
y="dcoilwtico",
color='b', # 颜色
title ="dcoilwtico", # 标题
ax = axes,
rot=0)
plt.show()

原始数据分组可视化-groupby

1
2
3
4
5
6
7
8
9
# 定义分组统计函数

def grouped(df, key, freq, col):
"""
定义分组统计均值的函数
"""
df_groupby = df.groupby([pd.Grouper(key=key,freq=freq)]).agg(mean=(col,'mean'))
df_groupby = df_groupby.reset_index()
return df_groupby
1
2
df_groupby_trans_w = grouped(trans, 'date', 'W', 'transactions')
df_groupby_trans_w.head(10)

在分组统计后的信息中加入time字段:

1
2
3
4
5
6
7
8
9
def add_time(df, key, freq, col):
# 调用上面的函数
df_groupby = grouped(df, key, freq, col)
# 添加time字段
df_groupby['time'] = np.arange(len(df_groupby.index))
# 将time字段放在索引为1的位置
column_time = df_groupby.pop('time')
df_groupby.insert(1, 'time', column_time)
return df_groupby

将训练集train分别按照不同的时间频率进行统计:

1
2
3
4
5
# 基于week和month
df_groupby_train_w = add_time(train, 'date', 'W', 'sales')
df_groupby_train_m = add_time(train, 'date', 'M', 'sales')

df_groupby_train_w.head()
date time mean
0 2013-01-06 0 206.843478
1 2013-01-13 1 190.285220
2 2013-01-20 2 189.835452
3 2013-01-27 3 182.152050
4 2013-02-03 4 198.564267

线性回归-Linear Regression

对原始数据的先线性回归

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
fig, axes = plt.subplots(nrows=3,
ncols=1,
figsize=(30,20)
)

# transactions基于weekly
axes[0].plot('date',
'mean',
data=df_groupby_trans_w,
color='grey',
marker='o')

axes[0].set_title("Transactions (groupby by week)", fontsize=20)

# sales-weekly
axes[1].plot('time',
'mean',
data=df_groupby_train_w,
color='0.75')
axes[1].set_title("Sales (groupby by week)", fontsize=20)

# 线性回归
axes[1] = sns.regplot(x='time',
y='mean',
data=df_groupby_train_w,
scatter_kws=dict(color='0.75'),
ax = axes[1])

# sales-monthly
axes[2].plot('time',
'mean',
data=df_groupby_train_m,
color='0.75')

axes[2].set_title("Sales (groupby by month)", fontsize=20)
# 基于线性回归
axes[2] = sns.regplot(x='time',
y='mean',
data=df_groupby_train_m,
scatter_kws=dict(color='0.75'),
line_kws={"color": "red"},
ax = axes[2])

plt.show()

基于差分平移特征的线性回归

构造添加差分特征函数
1
2
3
4
5
6
7
8
def add_lag(df, key, freq, col, lag):
"""
lag表示移动的标志位长度
"""
df_groupby = grouped(df, key, freq, col)
name = 'Lag_' + str(lag)
df_groupby['Lag'] = df_groupby['mean'].shift(lag)
return df_groupby
1
2
3
4
5
6
7
# 移动一个单位

df_groupby_train_w_lag1 = add_lag(train, 'date', 'W','sales',1)

df_groupby_train_m_lag1 = add_lag(train, 'date', 'W', 'sales', 1)

df_groupby_train_w_lag1.head()
date mean Lag
0 2013-01-06 206.843478 NaN
1 2013-01-13 190.285220 206.843478
2 2013-01-20 189.835452 190.285220
3 2013-01-27 182.152050 189.835452
4 2013-02-03 198.564267 182.152050
可视化绘图-train数据
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
fig, axes = plt.subplots(nrows=2,
ncols=1,
figsize=(30,20))

axes[0].plot('Lag',
'mean',
data=df_groupby_train_w_lag1,
color='0.75',
linestyle=(0, (1, 10)))

axes[0].set_title("Sales (groupby by week)", fontsize=20)
axes[0] = sns.regplot(x='Lag',
y='mean',
data=df_groupby_train_w_lag1,
scatter_kws=dict(color='0.75'),
ax = axes[0])


axes[1].plot('Lag',
'mean',
data=df_groupby_train_m_lag1,
color='0.75',
linestyle=(0, (1, 10)))

axes[1].set_title("Sales (groupby by month)", fontsize=20)
axes[1] = sns.regplot(x='Lag',
y='mean',
data=df_groupby_train_m_lag1,
scatter_kws=dict(color='0.75'),
line_kws={"color": "red"},
ax = axes[1])

plt.show()

基于不同属性个数value_counts

1
2
3
4
5
6
7
8
9
10
11
12
13
def plot_stats(df, col, ax, color, angle):
# 统计数量
count_classes = df[col].value_counts()

ax = sns.barplot(x=count_classes.index,
y=count_classes,
ax=ax,
palette=color)
# 标题
ax.set_title(col.upper(), fontsize=18)
# 设置x轴旋转角度
for tick in ax.get_xticklabels():
tick.set_rotation(angle)

基于holidays

1
2
3
4
5
6
7
8
fig, axes = plt.subplots(nrows=1,
ncols=2,
figsize=(15,5))
fig.autofmt_xdate()
fig.suptitle("Stats of df_holidays".upper())
plot_stats(holidays, "type", axes[0], "pastel", 45)
plot_stats(holidays, "locale", axes[1], "rocket", 45)
plt.show()

基于stores

统计在不同city、state、type和cluster下的stores数量

1
2
3
4
5
6
7
fig, axes = plt.subplots(nrows=4, ncols=1, figsize=(20,40))

plot_stats(stores, "city", axes[0], "mako_r", 45)
plot_stats(stores, "state", axes[1], "rocket_r", 45)
plot_stats(stores, "type", axes[2], "magma", 0)
plot_stats(stores, "cluster", axes[3], "viridis", 0)
plt.show()

基于train

统计在不同family下的数量

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(20,10))
count_classes = train['family'].value_counts()
print("count_classes: \n", count_classes)

plt.title("Stats of train")
colors = ['#ff9999','#66b3ff','#99ff99',
'#ffcc99', '#ffccf9', '#ff99f8',
'#ff99af', '#ffe299', '#a8ff99',
'#cc99ff', '#9e99ff', '#99c9ff',
'#99f5ff', '#99ffe4', '#99ffaf']

plt.pie(count_classes,
labels = count_classes.index,
autopct='%1.1f%%',
shadow=True,
startangle=90,
colors=colors)

plt.show()

数据分布区间对比

构造绘图函数-箱型图BoxPlot

1
2
3
4
5
6
7
def plot_boxplot(palette,x,y,hue,ax,title):
# 设置主题
sns.set_theme(style="ticks",palette=palette)
# 绘图
ax = sns.boxplot(x=x,y=y,hue=hue,ax=ax)
# 设置标题和字体大小
ax.set_title(title,fontsize=18)

oil数据

1
2
3
4
5
fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(10,20))

plot_boxplot("pastel", oil['date'].dt.year, oil['dcoilwtico'], oil['date'].dt.month, axes[0], "oil")
plot_boxplot("pastel", oil['date'].dt.year, oil['dcoilwtico'], oil['date'].dt.year, axes[1], "oil")
plt.show()

trans数据

1
2
3
4
5
fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(10,20))

plot_boxplot("pastel", trans['date'].dt.year, trans['transactions'], trans['date'].dt.month, axes[0], "trans")
plot_boxplot("pastel", trans['date'].dt.year, trans['transactions'], trans['date'].dt.year, axes[1], "trans")
plt.show()

趋势变化 & 移动平均

如何理解移动平均?

结合之前构造grouped函数进行理解

1
2
3
4
5
6
7
8
9
# 0-定义分组统计函数-grouped

def grouped(df, key, freq, col):
"""
定义分组统计均值的函数
"""
df_groupby = df.groupby([pd.Grouper(key=key,freq=freq)]).agg(mean=(col,'mean'))
df_groupby = df_groupby.reset_index()
return df_groupby
1
2
3
# 1、groupby函数直接统计-以train数据为例

train.head()
id date store_nbr family sales onpromotion
0 0 2013-01-01 1 AUTOMOTIVE 0.0 0
1 1 2013-01-01 1 BABY CARE 0.0 0
2 2 2013-01-01 1 BEAUTY 0.0 0
3 3 2013-01-01 1 BEVERAGES 0.0 0
4 4 2013-01-01 1 BOOKS 0.0 0
1
2
3
4
5
# 2、直接统计:基于date和W进行统计sales的均值

train_groupby = train.groupby([pd.Grouper(key="date",freq="W")]).agg(mean=("sales",'mean'))
train_groupby = train_groupby.reset_index()
train_groupby

可以看到上面的日期是以进行统计的。

1
train_groupby.head(10)
date mean
0 2013-01-06 206.843478
1 2013-01-13 190.285220
2 2013-01-20 189.835452
3 2013-01-27 182.152050
4 2013-02-03 198.564267
5 2013-02-10 187.788172
6 2013-02-17 194.052318
7 2013-02-24 190.572470
8 2013-03-03 210.889059
9 2013-03-10 206.540301
1
2
3
4
5
6
# 3、加上移动平均功能
# center 表示以当前元素为中心进行统计
# min_periods表示窗口的最小观测值个数

# 一个窗口周期内至少要有4天的数据;前面3天不满足,均为NaN
train_groupby.rolling(window=7,min_periods=4).mean()
mean
0 NaN
1 NaN
2 NaN
3 192.279050
4 193.536094
... ...
237 476.977266
238 473.438121
239 479.967617
240 475.546673
241 466.683298

242 rows × 1 columns

如何理解参数center?以当前的元素为中心,上下筛选数据:

1
2
3
# 如何理解参数center

train_groupby.rolling(window=7,center=True,min_periods=4).mean()
mean
0 192.279050
1 193.536094
2 192.578107
3 192.788708
4 190.464279
... ...
237 475.546673
238 466.683298
239 461.530692
240 461.668874
241 461.959927

242 rows × 1 columns

如何理解上面的结果呢?

1
2
3
4
5
# 3+中心+3

# 第一个元素前面没有数据,只能从本身向后取数据:0(中心) + 3----->[0:4]

train_groupby.iloc[:4,1].sum() / 4
192.27905005901843
1
2
3
# 第二个元素为中心:0+1(中心)+2/3/4----->[0:5]

train_groupby.iloc[0:5,1].sum() / 5
193.5360935220985
1
2
3
# 第三个元素为中心:0/1 + 2(中心)+3/4/5----->[0:6]

train_groupby.iloc[0:6, 1].sum() / 6
192.57810663810037
1
2
3
# 第四个元素为中心:0/1/2 + 3(中心)+4/5/6----->[0:7]

train_groupby.iloc[0:7, 1].sum() / 7
192.7887082607605

构造移动平均绘图函数

1
2
3
4
5
6
7
8
9
10
def plot_moving_average(df,key,freq,col,window,min_periods,ax,title):
df_groupby = grouped(df,key,freq,col)
moving_average = df_groupby["mean"].rolling(window=window,
center=True,
min_periods=min_periods).mean()
# print("moving_average: \n", moving_average)

ax = df_groupby["mean"].plot(color="0.75",linestyle="dashdot",ax=ax)
ax = moving_average.plot(linewidth=3,color="g",ax=ax)
ax.set_title(title,fontsize=18)

可视化绘图

1
2
3
4
5
6
7
8
9
fig, axes = plt.subplots(nrows=2,
ncols=1,
figsize=(30,20))

# 基于train和trans数据
plot_moving_average(train, 'date', 'W', 'sales', 7, 4, axes[0], 'Sales Moving Average')
plot_moving_average(trans, 'date', 'W', 'transactions', 7, 4, axes[1], 'Transactions Moving Average')

plt.show()

趋势预测

构造函数

基于statsmodels库的DeterministicProcess类进行线性回归预测

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def plot_deterministic_process(df, key, freq, col, ax1, title1, ax2, title2):
# 分组统计
df_groupby = grouped(df, key, freq, col)
df_groupby["date"] = pd.to_datetime(df_groupby["date"],format="%Y-%m-%d")

dp = DeterministicProcess(index=df_groupby["date"],
constant=True,
order=1,
drop=True)
# 将频率设置成索引
dp.index.freq = freq

X1 = dp.in_sample() # 属性特征X
y1 = df_groupby["mean"] # 待预测的目标值y
y1.index = X1.index
# 实例化类
model = LinearRegression(fit_intercept=False)
model.fit(X1, y1)

# 预测结果
y1_pred = pd.Series(model.predict(X1), index=X1.index)

ax1 = y1.plot(linestyle='dashed', label="mean", color="0.75", ax=ax1, use_index=True)
ax1 = y1_pred.plot(linewidth=3, label="Trend", color='r', ax=ax1, use_index=True)
# 设置标题
ax1.set_title(title1, fontsize=18)
_ = ax1.legend()


# 预测未来30个步长的数据
steps = 30
X2 = dp.out_of_sample(steps=steps)
y2_fore = pd.Series(model.predict(X2), index=X2.index)

ax2 = y1.plot(linestyle='dashed', label="mean", color="0.75", ax=ax2, use_index=True)
ax2 = y1_pred.plot(linewidth=3, label="Trend", color='r', ax=ax2, use_index=True)
ax2 = y2_fore.plot(linewidth=3, label="Predicted Trend", color='r', ax=ax2, use_index=True)
ax2.set_title(title2, fontsize=18)
_ = ax2.legend()

数据可视化

1
2
3
4
5
6
7
8
9
10
fig, axes = plt.subplots(nrows=4, ncols=1, figsize=(30,30))

plot_deterministic_process(train, 'date', 'W', 'sales',
axes[0], "Sales Linear Trend",
axes[1], "Sales Linear Trend Forecast")
plot_deterministic_process(trans, 'date', 'W', 'transactions',
axes[2], "Transactions Linear Trend",
axes[3], "Transactions Linear Trend Forecast")

plt.show()

季节性趋势

构造季节性绘图函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def seasonal_plot(X,y,period,freq,ax=None):
"""
seasonal_plot函数
"""
if ax is None:
_, ax = plot.subplots()

# 颜色设置
palette = sns.color_palette("husl", n_colors=X[period].nunique(),)

ax = sns.lineplot(x=X[freq],
y=X[y],
ax=ax,
hue=X[period],
palette=palette,
legend=False)
# 标题设置
ax.set_title(f'Seasonal Plot({period}/{freq})')

for line, name in zip(ax.lines, X[period].unique()):
y_ = line.get_ydata()[-1]
ax.annotate(name,
xy=(1,y_),
xytext=(6,0),
color=line.get_color(),
xycoords=ax.get_yaxis_transform(),
textcoords="offset points",
size=14,
va="center")

return ax
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def plot_periodogram(ts, detrend="linear", ax=None):
"""
plot_periodogram函数:基于周期的直方图
"""

from scipy.signal import periodogram # 导入模块
fs = pd.Timedelta("365D") / pd.Timedelta("1D")

freqencies, spectrum = periodogram(ts,
fs=fs,
detrend=detrend,
window="boxcar",
scaling="spectrum")

if ax is None:
_, ax = plt.subplots()

ax.step(freqencies, spectrum, color="purple")
ax.set_xscale("log")
ax.set_xticks([1, 2, 4, 6, 12, 26, 52, 104])
ax.set_xticklabels(["Annual (1)", "Semiannual (2)", "Quarterly (4)",
"Bimonthly (6)", "Monthly (12)", "Biweekly (26)",
"Weekly (52)", "Semiweekly (104)"], rotation=35)

ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))
ax.set_ylabel("Variance")
ax.set_title("Periodogram")
return ax
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def seasonality(df, key, freq, col):
"""
seasonality函数:调用seasonal_plot 和 plot_periodogram函数
"""
df_groupby = grouped(df, key, freq, col) # 调用grouped函数
df_groupby['date'] = pd.to_datetime(df_groupby['date'], format = "%Y-%m-%d")
df_groupby.index = df_groupby['date'] # 索引设置
df_groupby = df_groupby.drop(columns=['date']) # 删除date列
df_groupby.index.freq = freq # 索引设置成频率

X = df_groupby.copy() # 副本
X.index = pd.to_datetime(X.index, format = "%Y-%m-%d")
X.index.freq = freq # 索引设置成频率

# 一周中的第几天
X["day"] = X.index.dayofweek
X["week"] = pd.Int64Index(X.index.isocalendar().week)
# 一年中的第几天
X["dayofyear"] = X.index.dayofyear
X["year"] = X.index.year

fig, (ax0, ax1, ax2) = plt.subplots(3, 1, figsize=(20, 30))

# 调用seasonal_plot函数
seasonal_plot(X, y='mean', period="week", freq="day", ax=ax0) # 周期按照week
seasonal_plot(X, y='mean', period="year", freq="dayofyear", ax=ax1) # 周期按照year
X_new = (X['mean'].copy()).dropna()

# 调用plot_periodogram函数
plot_periodogram(X_new, ax=ax2)

数据可视化-trans

1
seasonality(trans, 'date', 'D', 'transactions')

数据可视化-train

1
seasonality(train, 'date', 'D', 'sales')

季节性预测

构造预测函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def predict_seasonality(df, key, freq, col, ax1, title1):
fourier = CalendarFourier(freq="A", order=10)
df_groupby = grouped(df,key,freq,col)

# 时间格式转化
df_groupby['date'] = pd.to_datetime(df_groupby['date'], format = "%Y-%m-%d")
# 频率设置
df_groupby['date'].freq = freq

dp = DeterministicProcess(index=df_groupby["date"],
constant=True,
order=1,
period=None,
seasonal=True,
additional_terms=[fourier],
drop=True)

dp.index.freq = freq

# X-y
X1 = dp.in_sample()
y1 = df_groupby["mean"]
y1.index = X1.index

# 线性回归模型实例化
model = LinearRegression(fit_intercept=False)
model.fit(X1, y1)
# 模型预测
y1_pred = pd.Series(model.predict(X1), index=X1.index)
X1_fore = dp.out_of_sample(steps=90)
y1_fore = pd.Series(model.predict(X1_fore), index=X1_fore.index)
# 绘图
ax1 = y1.plot(linestyle='dashed', style='.', label="init mean values", color="0.4", ax=ax1, use_index=True)
ax1 = y1_pred.plot(linewidth=3, label="Seasonal", color='b', ax=ax1, use_index=True)
ax1 = y1_fore.plot(linewidth=3, label="Seasonal Forecast", color='r', ax=ax1, use_index=True)
# 标题
ax1.set_title(title1, fontsize=18)
_ = ax1.legend()

季节预测可视化

1
2
3
4
5
fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(30,30))

predict_seasonality(trans, 'date', 'W', 'transactions', axes[0], "Transactions Seasonal Forecast")
predict_seasonality(train, 'date', 'W', 'sales', axes[1], "Sales Seasonal Forecast")
plt.show()

时序预测

构造时序数据

基于train数据构造时序特征:

1
2
3
4
5
store_sales = train.copy() # copy副本

store_sales["date"] = store_sales.date.dt.to_period("D") # 提取日期D
store_sales = store_sales.set_index(["store_nbr","family","date"]).sort_index() # 设置双层索引
store_sales.head()
id sales onpromotion
store_nbr family date
1 AUTOMOTIVE 2013-01-01 0 0.0 0
2013-01-02 1782 2.0 0
2013-01-03 3564 3.0 0
2013-01-04 5346 3.0 0
2013-01-05 7128 5.0 0
1
2
3
4
5
6
7
8
9
family_sales = (
store_sales # 数据df
.groupby(['family', 'date']) # 分组对象
.mean() # 统计函数
.unstack('family') # 反堆叠对象
.loc['2017', ['sales', 'onpromotion']] # 位置索引
)

family_sales.head()
sales ... onpromotion
family AUTOMOTIVE BABY CARE BEAUTY BEVERAGES BOOKS BREAD/BAKERY CELEBRATION CLEANING DAIRY DELI ... MAGAZINES MEATS PERSONAL CARE PET SUPPLIES PLAYERS AND ELECTRONICS POULTRY PREPARED FOODS PRODUCE SCHOOL AND OFFICE SUPPLIES SEAFOOD
date
2017-01-01 0.092593 0.037037 0.055556 74.222222 0.000000 9.084685 0.129630 7.500000 11.518519 3.629167 ... 0.0 0.018519 0.111111 0.018519 0.0 0.000000 0.037037 0.129630 0.0 0.000000
2017-01-02 11.481481 0.259259 11.648148 6208.055556 0.481481 844.836296 14.203704 2233.648148 1545.000000 539.114833 ... 0.0 0.462963 10.592593 0.537037 0.0 0.259259 1.166667 5.629630 0.0 0.407407
2017-01-03 8.296296 0.296296 7.185185 4507.814815 0.814815 665.124111 10.629630 1711.907407 1204.203704 404.300074 ... 0.0 0.481481 9.722222 0.444444 0.0 0.388889 1.351852 56.296296 0.0 0.407407
2017-01-04 6.833333 0.333333 6.888889 3911.833333 0.759259 594.160611 11.185185 1508.037037 1107.796296 309.397685 ... 0.0 0.370370 12.037037 0.444444 0.0 0.296296 5.444444 101.277778 0.0 0.333333
2017-01-05 6.333333 0.351852 5.925926 3258.796296 0.407407 495.511611 12.444444 1241.833333 829.277778 260.776500 ... 0.0 8.981481 5.666667 0.000000 0.0 0.296296 0.907407 5.018519 0.0 0.444444

5 rows × 66 columns

1
2
mag_sales = family_sales.loc(axis=1)[:, 'MAGAZINES']
mag_sales.head(10)
sales onpromotion
family MAGAZINES MAGAZINES
date
2017-01-01 0.074074 0.0
2017-01-02 7.777778 0.0
2017-01-03 3.500000 0.0
2017-01-04 3.500000 0.0
2017-01-05 3.203704 0.0
2017-01-06 2.759259 0.0
2017-01-07 5.962963 0.0
2017-01-08 5.074074 0.0
2017-01-09 3.537037 0.0
2017-01-10 3.222222 0.0

时序特征可视化

1
2
y = mag_sales.loc[:,"sales"].squeeze()
y
date
2017-01-01    0.074074
2017-01-02    7.777778
2017-01-03    3.500000
2017-01-04    3.500000
2017-01-05    3.203704
                ...
2017-08-11    9.259259
2017-08-12    8.944444
2017-08-13    8.685185
2017-08-14    8.462963
2017-08-15    8.537037
Freq: D, Name: MAGAZINES, Length: 227, dtype: float64
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
fourier = CalendarFourier(freq="M", order=4)
dp = DeterministicProcess(constant=True,
index=y.index,
order=1,
seasonal=True,
drop=True,
additional_terms=[fourier]
)

X_time = dp.in_sample()
# 元旦
X_time["NewYearsDay"] = (X_time.index.dayofyear == 1)

model = LinearRegression(fit_intercept=False)
model.fit(X_time,y)

y_deseason = y - model.predict(X_time)
y_deseason.name = "sales_deseasoned"

ax = y_deseason.plot()
ax.set_title("Magazine Sales (deseasonalized)")
Text(0.5, 1.0, 'Magazine Sales (deseasonalized)')

自相关和偏自相关

  1. 自相关ACF:也叫序列相关,是一个序列于其自身在不同时间点的互相关

  2. 偏自相关PACF:偏自相关是剔除干扰后时间序列观察与先前时间步长时间序列观察之间关系的总结。

在滞后k处的偏自相关是在消除由于较短滞后条件导致的任何相关性的影响之后产生的相关性。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def lagplot(x, y=None, lag=1, standardize=False, ax=None, **kwargs):
"""
lagplot函数
"""
from matplotlib.offsetbox import AnchoredText

x_ = x.shift(lag) # 移位函数
if standardize: # 数据标准化
x_ = (x_ - x_.mean()) / x_.std()
if y is not None:
y_ = (y - y.mean()) / y.std() if standardize else y
else:
y_ = x

# 相关系数
corr = y_.corr(x_)
if ax is None:
fig, ax = plt.subplots()
scatter_kws = dict(alpha=0.75,s=3)
line_kws = dict(color='C3', )
ax = sns.regplot(x=x_,
y=y_,
scatter_kws=scatter_kws,
line_kws=line_kws,
lowess=True,
ax=ax,
**kwargs)

at = AnchoredText(f"{corr:.2f}",
prop=dict(size="large"),
frameon=True,
loc="upper left"
)

at.patch.set_boxstyle("square, pad=0.0")
ax.add_artist(at)
ax.set(title=f"Lag {lag}", xlabel=x_.name, ylabel=y_.name)
return ax
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def plot_lags(x, y=None, lags=6, nrows=1, lagplot_kwargs={}, **kwargs):
import math
kwargs.setdefault('nrows', nrows)
kwargs.setdefault('ncols', math.ceil(lags / nrows))
kwargs.setdefault('figsize', (kwargs['ncols'] * 2 + 10, nrows * 2 + 5))
fig, axs = plt.subplots(sharex=True,
sharey=True,
squeeze=False,
**kwargs)

for ax, k in zip(fig.get_axes(), range(kwargs['nrows'] * kwargs['ncols'])):
if k + 1 <= lags:
ax = lagplot(x,
y,
lag=k + 1,
ax=ax,
**lagplot_kwargs)

ax.set_title(f"Lag {k + 1}",
fontdict=dict(fontsize=14))
ax.set(xlabel="", ylabel="")
else:
ax.axis('off')
plt.setp(axs[-1, :], xlabel=x.name)
plt.setp(axs[:, 0], ylabel=y.name if y is not None else x.name)
fig.tight_layout(w_pad=0.1, h_pad=0.1)
return fig

自相关可视化

1
fig = plot_lags(y_deseason, lags=8, nrows=2)  # 多行多列数据

偏自相关可视化

1
fig = plot_pacf(y_deseason, lags=8)  # 一个图形

基于magazine的可视化

1
2
3
4
onpromotion = mag_sales.loc[:,'onpromotion'].squeeze().rename('onpromotion')

# 去除离群点:每年的第一天
plot_lags(x=onpromotion.iloc[1:], y=y_deseason.iloc[1:], lags=3, nrows=1)

时序预测可视化

1
2
3
4
5
6
7
def make_lags(ts, lags):
return pd.concat(
{
f'y_lag_{i}': ts.shift(i)
for i in range(1, lags + 1)
},
axis=1)
1
2
3
4
X = make_lags(y_deseason, lags=4)
X = X.fillna(0.0)

X
y_lag_1 y_lag_2 y_lag_3 y_lag_4
date
2017-01-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
2017-01-02 -3.586020e-14 0.000000e+00 0.000000e+00 0.000000e+00
2017-01-03 3.257003e+00 -3.586020e-14 0.000000e+00 0.000000e+00
2017-01-04 -2.105813e-01 3.257003e+00 -3.586020e-14 0.000000e+00
2017-01-05 -1.304980e-01 -2.105813e-01 3.257003e+00 -3.586020e-14
... ... ... ... ...
2017-08-11 1.795609e-01 8.802533e-01 1.126343e+00 3.000790e+00
2017-08-12 1.112449e+00 1.795609e-01 8.802533e-01 1.126343e+00
2017-08-13 -1.466291e+00 1.112449e+00 1.795609e-01 8.802533e-01
2017-08-14 -7.308822e-01 -1.466291e+00 1.112449e+00 1.795609e-01
2017-08-15 1.522645e+00 -7.308822e-01 -1.466291e+00 1.112449e+00

227 rows × 4 columns

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
y = y_deseason.copy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=60, shuffle=False)

# 线性回归
model = LinearRegression()
# 模型训练
model.fit(X_train, y_train)
# 预测
y_pred = pd.Series(model.predict(X_train), index=y_train.index)
y_fore = pd.Series(model.predict(X_test), index=y_test.index)

# 绘图
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(20,10))
ax = y_train.plot(color="0.75",
style=".-",
markeredgecolor="0.25",
markerfacecolor="0.25",
ax=ax)

ax = y_test.plot(color="0.75",
style=".-",
markeredgecolor="0.25",
markerfacecolor="0.25",
ax=ax)

ax = y_pred.plot(ax=ax)
_ = y_fore.plot(ax=ax, color='C3')

plt.show()

基于混合模型预测

构建混合模型

将XGboost模型和线性回归模型结合起来:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
store_sales = train.copy()

store_sales['date'] = store_sales.date.dt.to_period('D')

store_sales = store_sales.set_index(['store_nbr', 'family', 'date']).sort_index()

family_sales = (
store_sales
.groupby(['family', 'date'])
.mean()
.unstack('family')
.loc['2017']
)
family_sales.head()
id ... onpromotion
family AUTOMOTIVE BABY CARE BEAUTY BEVERAGES BOOKS BREAD/BAKERY CELEBRATION CLEANING DAIRY DELI ... MAGAZINES MEATS PERSONAL CARE PET SUPPLIES PLAYERS AND ELECTRONICS POULTRY PREPARED FOODS PRODUCE SCHOOL AND OFFICE SUPPLIES SEAFOOD
date
2017-01-01 2597248.5 2597249.5 2597250.5 2597251.5 2597252.5 2597253.5 2597254.5 2597255.5 2597256.5 2597257.5 ... 0.0 0.018519 0.111111 0.018519 0.0 0.000000 0.037037 0.129630 0.0 0.000000
2017-01-02 2599030.5 2599031.5 2599032.5 2599033.5 2599034.5 2599035.5 2599036.5 2599037.5 2599038.5 2599039.5 ... 0.0 0.462963 10.592593 0.537037 0.0 0.259259 1.166667 5.629630 0.0 0.407407
2017-01-03 2600812.5 2600813.5 2600814.5 2600815.5 2600816.5 2600817.5 2600818.5 2600819.5 2600820.5 2600821.5 ... 0.0 0.481481 9.722222 0.444444 0.0 0.388889 1.351852 56.296296 0.0 0.407407
2017-01-04 2602594.5 2602595.5 2602596.5 2602597.5 2602598.5 2602599.5 2602600.5 2602601.5 2602602.5 2602603.5 ... 0.0 0.370370 12.037037 0.444444 0.0 0.296296 5.444444 101.277778 0.0 0.333333
2017-01-05 2604376.5 2604377.5 2604378.5 2604379.5 2604380.5 2604381.5 2604382.5 2604383.5 2604384.5 2604385.5 ... 0.0 8.981481 5.666667 0.000000 0.0 0.296296 0.907407 5.018519 0.0 0.444444

5 rows × 99 columns

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
class BoostedHybrid:
def __init__(self, model_1, model_2):
self.model_1 = model_1
self.model_2 = model_2
self.y_columns = None


def fit(self, X_1, X_2, y):
"""
模型训练fit
"""
# 训练模型1
self.model_1.fit(X_1, y)

# 预测
y_fit = pd.DataFrame(
self.model_1.predict(X_1),
index=X_1.index,
columns=y.columns
)

# 计算残差
y_resid = y - y_fit
# 宽表转成长表
y_resid = y_resid.stack().squeeze()

# 基于残差训练模型2
self.model_2.fit(X_2, y_resid)
self.y_columns = y.columns
self.y_fit = y_fit
self.y_resid = y_resid

# 将fit方法添加类中
BoostedHybrid.fit = fit
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def predict(self, X_1, X_2):
"""
模型预测
"""
y_pred = pd.DataFrame(
self.model_1.predict(X_1),
index=X_1.index,
columns=self.y_columns
)

# 宽表转成长表
y_pred = y_pred.stack().squeeze()
# 将两个模型预测的结果累加
y_pred += self.model_2.predict(X_2)

return y_pred.unstack()

# 将预测predict添加到类中
BoostedHybrid.predict = predict

构建数据集

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 目标值y
y = family_sales.loc[:, 'sales']

dp = DeterministicProcess(index=y.index, order=1)
X_1 = dp.in_sample()

# X_2:XBoost
X_2 = family_sales.drop('sales', axis=1).stack()

# family类比编码
le = LabelEncoder()
X_2 = X_2.reset_index('family')
X_2['family'] = le.fit_transform(X_2['family'])

# 季节性特征
X_2["day"] = X_2.index.day

模型训练

1
2
3
4
5
6
7
8
9
model = BoostedHybrid(
model_1 = LinearRegression(),
model_2 = XGBRegressor()
)

model.fit(X_1, X_2, y)

y_pred = model.predict(X_1,X_2)
y_pred = y_pred.clip(0.0)
1
2
3
4
5
6
# 模型混合

model = BoostedHybrid(
model_1=Ridge(),
model_2=KNeighborsRegressor(),
)
1
2
3
4
5
# 构造训练集和验证集

y_train, y_valid = y[:"2017-07-01"], y["2017-07-02":]
X1_train, X1_valid = X_1[: "2017-07-01"], X_1["2017-07-02" :]
X2_train, X2_valid = X_2.loc[:"2017-07-01"], X_2.loc["2017-07-02":]
1
2
3
4
5
6
# 训练
model.fit(X1_train, X2_train, y_train)

# 预测
y_fit = model.predict(X1_train, X2_train).clip(0.0)
y_pred = model.predict(X1_valid, X2_valid).clip(0.0)

预测可视化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
families = y.columns[0:6]
axs = y.loc(axis=1)[families].plot(subplots=True,
sharex=True,
figsize=(30, 20),
color="0.75",
style=".-",
markeredgecolor="0.25",
markerfacecolor="0.25",
alpha=0.5)
_ = y_fit.loc(axis=1)[families].plot(subplots=True,
sharex=True,
color='C0',
ax=axs)
_ = y_pred.loc(axis=1)[families].plot(subplots=True,
sharex=True,
color='C3',
ax=axs)
for ax, family in zip(axs, families):
ax.legend([])
ax.set_ylabel(family)

基于机器学习预测

训练集数据

1
2
3
store_sales = train.copy()
store_sales['date'] = store_sales.date.dt.to_period('D')
store_sales = store_sales.set_index(['store_nbr', 'family', 'date']).sort_index()
1
2
3
4
5
6
7
family_sales = (
store_sales
.groupby(['family', 'date'])
.mean()
.unstack('family')
.loc['2017']
)

测试集数据

1
2
3
test = test.copy()
test['date'] = test.date.dt.to_period('D')
test = test.set_index(['store_nbr', 'family', 'date']).sort_index()

构建多步预测目标函数

1
2
3
4
5
def make_multistep_target(ts, steps):
return pd.concat(
{f'y_step_{i + 1}': ts.shift(-i)
for i in range(steps)},
axis=1)
1
2
3
4
5
6
7
8
y = family_sales.loc[:, 'sales']

X = make_lags(y, lags=5).dropna()

# 构建多步预测
y = make_multistep_target(y, steps=16).dropna()

y, X = y.align(X, join='inner', axis=0)
1
2
3
4
5
6
7
8
9
10
11
# 准备好输入到XGBoost模型的数据

le = LabelEncoder()
X = (X
.stack('family')
.reset_index('family')
.assign(family=lambda x: le.fit_transform(x.family)) # 编码
)
y = y.stack('family') # 宽表转成长表

y
y_step_1 y_step_2 y_step_3 y_step_4 y_step_5 y_step_6 y_step_7 y_step_8 y_step_9 y_step_10 y_step_11 y_step_12 y_step_13 y_step_14 y_step_15 y_step_16
date family
2017-01-06 AUTOMOTIVE 6.018519 10.259259 9.388889 5.944444 4.777778 6.314815 5.388889 5.240741 8.500000 10.259259 6.407407 5.685185 5.703704 4.777778 5.148148 8.685185
BABY CARE 0.277778 0.259259 0.240741 0.444444 0.240741 0.277778 0.296296 0.296296 0.388889 0.425926 0.314815 0.166667 0.222222 0.129630 0.166667 0.277778
BEAUTY 6.518519 10.037037 11.611111 5.648148 6.500000 5.277778 4.370370 4.703704 7.777778 9.037037 5.648148 5.351852 4.740741 3.981481 4.592593 7.111111
BEVERAGES 3507.277778 4848.518519 5503.648148 3448.203704 3171.740741 3046.870370 2693.722222 3226.037037 4667.296296 5580.611111 3700.370370 3409.796296 3263.462963 2676.574074 3003.555556 4900.611111
BOOKS 0.537037 0.481481 0.722222 0.500000 0.518519 0.481481 0.388889 0.444444 0.574074 0.555556 0.388889 0.500000 0.407407 0.277778 0.351852 0.685185
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2017-07-31 POULTRY 364.955648 403.601334 377.313980 316.436093 533.497054 416.454018 464.596557 344.051740 313.780869 305.270204 278.819870 468.857370 354.342779 379.801204 344.398297 325.679815
PREPARED FOODS 84.698648 87.836796 88.735963 77.173000 91.886760 100.384963 102.248148 86.627444 77.344130 84.796537 78.791444 96.286926 84.693815 91.509426 86.062500 85.954129
PRODUCE 2257.140589 2609.180150 3122.895724 1792.220910 2079.319469 2418.970157 2675.105815 2111.133423 2168.535465 2663.076241 1670.264889 2198.854500 2070.154646 2331.922267 2134.399926 2316.832796
SCHOOL AND OFFICE SUPPLIES 30.111111 49.333333 57.481481 51.907407 63.222222 85.203704 100.277778 64.407407 59.759259 53.740741 42.962963 65.240741 67.481481 68.851852 52.333333 46.851852
SEAFOOD 20.488333 20.346852 20.801037 17.116296 25.553963 24.209519 23.512852 18.419852 18.481130 18.181426 13.284463 23.566963 19.037593 20.704574 17.975556 17.966241

6831 rows × 16 columns

建模预测

1
2
3
4
5
6
7
8
9
# 模型实例化
model = RegressorChain(base_estimator=XGBRegressor())

# 模型预测
model.fit(X, y)
# 预测结果
y_pred = pd.DataFrame(model.predict(X),
index=y.index,
columns=y.columns).clip(0.0)

预测可视化

1
2
3
4
5
6
7
8
9
10
11
12
def plot_multistep(y, every=1, ax=None, palette_kwargs=None):
palette_kwargs_ = dict(palette='husl', n_colors=16, desat=None)
if palette_kwargs is not None:
palette_kwargs_.update(palette_kwargs)
palette = sns.color_palette(**palette_kwargs_)
if ax is None:
fig, ax = plt.subplots()
ax.set_prop_cycle(plt.cycler('color', palette))
for date, preds in y[::every].iterrows():
preds.index = pd.period_range(start=date, periods=len(preds))
preds.plot(ax=ax)
return ax
1
2
3
4
5
6
7
8
9
10
11
FAMILY = 'BEAUTY'
START = '2017-04-01'
EVERY = 16

y_pred_ = y_pred.xs(FAMILY, level='family', axis=0).loc[START:]
y_ = family_sales.loc[START:, 'sales'].loc[:, FAMILY]

fig, ax = plt.subplots(1, 1, figsize=(11, 4))
ax = y_.plot(color="0.75",style=".-",markeredgecolor="0.25", markerfacecolor="0.25",ax=ax, alpha=0.5)
ax = plot_multistep(y_pred_, ax=ax, every=EVERY)
_ = ax.legend([FAMILY, FAMILY + ' Forecast'])

本文标题:kaggle实战-精美可视化与时序预测

发布时间:2023年03月15日 - 23:03

原始链接:http://www.renpeter.cn/2023/03/15/kaggle%E5%AE%9E%E6%88%98-%E7%B2%BE%E7%BE%8E%E5%8F%AF%E8%A7%86%E5%8C%96%E4%B8%8E%E6%97%B6%E5%BA%8F%E9%A2%84%E6%B5%8B.html

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

Coffee or Tea