kaggle实战-销售数据的精美可视化分析与时序预测
本文是基于一份商品销售数据,使用Pandas、seaborn、statmodels、sklearn、线性回归预测、xgboost
等库和方法进行多角度的可视化分析和时序预测。
结果
先展示部分的可视化结果:
导入库
1 | import math |
读取数据
1 | holidays = pd.read_csv('holidays_events.csv') |
下面是具体的字段解释:
train/test
id:id号
date:日期
store_nbr:所在商店;比如1-2-3-4…
family:所属商品类型;比如:AUTOMOTIVE, HOME APPLIANCES, SCHOOL AND OFFICE SUPPLIES
sales:当日的销售额
onpromotion:商品在当日的进货数量
holidays
date:日期
type:类型;包含Holiday/Event/Other
3个取值
locale:事件范围区域;National/Local/Other
(全国的、本地的或者其他)
locale_name:区域名称
description:节日的描述信息
transferred:是否是推迟后节日;True或者False
oil
id:日期
dcoilwtico:当天的油价
stores
store_nbr:所属商店类型
city:所在城市
state:所在州
type:类型;A-B-C-D-E
cluster:聚类结果;就是一群类似的商店组成的团体;1-2-3-4-5…
trans
date:日期
store_nbr:所在商店
transcations:当天的交易额
思维导图
思维导图中整理了5个csv文件中的数据字段以及它们之间的关联关系:
数据基本信息
以holidays数据为例,查看数据的基本信息:
1、head方法查看前5条数据
1 | # holidays |
2、shape方法查看数据和行和列
1 | holidays.shape |
(350, 6)
3、describe查看数据的描述统计信息;只针对数值型字段
1 | holidays.describe() |
4、isnull方法查看数据的缺失值情况
1 | holidays.isnull().sum() |
date 0
type 0
locale 0
locale_name 0
description 0
transferred 0
dtype: int64
1 | train.head() |
1 | oil.head() |
1 | stores.head() |
1 | trans.head() |
时间格式转化
1 | holidays['date'] = pd.to_datetime(holidays['date'], format = "%Y-%m-%d") |
精美可视化
油价随时间变化(原始数据)
1 | oil.head() |
1 | fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(25,15)) |
原始数据分组可视化-groupby
1 | # 定义分组统计函数 |
1 | df_groupby_trans_w = grouped(trans, 'date', 'W', 'transactions') |
在分组统计后的信息中加入time字段:
1 | def add_time(df, key, freq, col): |
将训练集train分别按照不同的时间频率进行统计:
1 | # 基于week和month |
date | time | mean | |
---|---|---|---|
0 | 2013-01-06 | 0 | 206.843478 |
1 | 2013-01-13 | 1 | 190.285220 |
2 | 2013-01-20 | 2 | 189.835452 |
3 | 2013-01-27 | 3 | 182.152050 |
4 | 2013-02-03 | 4 | 198.564267 |
线性回归-Linear Regression
对原始数据的先线性回归
1 | fig, axes = plt.subplots(nrows=3, |
基于差分平移特征的线性回归
构造添加差分特征函数
1 | def add_lag(df, key, freq, col, lag): |
1 | # 移动一个单位 |
date | mean | Lag | |
---|---|---|---|
0 | 2013-01-06 | 206.843478 | NaN |
1 | 2013-01-13 | 190.285220 | 206.843478 |
2 | 2013-01-20 | 189.835452 | 190.285220 |
3 | 2013-01-27 | 182.152050 | 189.835452 |
4 | 2013-02-03 | 198.564267 | 182.152050 |
可视化绘图-train数据
1 | fig, axes = plt.subplots(nrows=2, |
基于不同属性个数value_counts
1 | def plot_stats(df, col, ax, color, angle): |
基于holidays
1 | fig, axes = plt.subplots(nrows=1, |
基于stores
统计在不同city、state、type和cluster下的stores数量
1 | fig, axes = plt.subplots(nrows=4, ncols=1, figsize=(20,40)) |
基于train
统计在不同family下的数量
1 | fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(20,10)) |
数据分布区间对比
构造绘图函数-箱型图BoxPlot
1 | def plot_boxplot(palette,x,y,hue,ax,title): |
oil数据
1 | fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(10,20)) |
trans数据
1 | fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(10,20)) |
趋势变化 & 移动平均
如何理解移动平均?
结合之前构造grouped函数进行理解
1 | # 0-定义分组统计函数-grouped |
1 | # 1、groupby函数直接统计-以train数据为例 |
id | date | store_nbr | family | sales | onpromotion | |
---|---|---|---|---|---|---|
0 | 0 | 2013-01-01 | 1 | AUTOMOTIVE | 0.0 | 0 |
1 | 1 | 2013-01-01 | 1 | BABY CARE | 0.0 | 0 |
2 | 2 | 2013-01-01 | 1 | BEAUTY | 0.0 | 0 |
3 | 3 | 2013-01-01 | 1 | BEVERAGES | 0.0 | 0 |
4 | 4 | 2013-01-01 | 1 | BOOKS | 0.0 | 0 |
1 | # 2、直接统计:基于date和W进行统计sales的均值 |
可以看到上面的日期是以周进行统计的。
1 | train_groupby.head(10) |
date | mean | |
---|---|---|
0 | 2013-01-06 | 206.843478 |
1 | 2013-01-13 | 190.285220 |
2 | 2013-01-20 | 189.835452 |
3 | 2013-01-27 | 182.152050 |
4 | 2013-02-03 | 198.564267 |
5 | 2013-02-10 | 187.788172 |
6 | 2013-02-17 | 194.052318 |
7 | 2013-02-24 | 190.572470 |
8 | 2013-03-03 | 210.889059 |
9 | 2013-03-10 | 206.540301 |
1 | # 3、加上移动平均功能 |
mean | |
---|---|
0 | NaN |
1 | NaN |
2 | NaN |
3 | 192.279050 |
4 | 193.536094 |
... | ... |
237 | 476.977266 |
238 | 473.438121 |
239 | 479.967617 |
240 | 475.546673 |
241 | 466.683298 |
242 rows × 1 columns
如何理解参数center?以当前的元素为中心,上下筛选数据:
1 | # 如何理解参数center |
mean | |
---|---|
0 | 192.279050 |
1 | 193.536094 |
2 | 192.578107 |
3 | 192.788708 |
4 | 190.464279 |
... | ... |
237 | 475.546673 |
238 | 466.683298 |
239 | 461.530692 |
240 | 461.668874 |
241 | 461.959927 |
242 rows × 1 columns
如何理解上面的结果呢?
1 | # 3+中心+3 |
192.27905005901843
1 | # 第二个元素为中心:0+1(中心)+2/3/4----->[0:5] |
193.5360935220985
1 | # 第三个元素为中心:0/1 + 2(中心)+3/4/5----->[0:6] |
192.57810663810037
1 | # 第四个元素为中心:0/1/2 + 3(中心)+4/5/6----->[0:7] |
192.7887082607605
构造移动平均绘图函数
1 | def plot_moving_average(df,key,freq,col,window,min_periods,ax,title): |
可视化绘图
1 | fig, axes = plt.subplots(nrows=2, |
趋势预测
构造函数
基于statsmodels库的DeterministicProcess类进行线性回归预测
1 | def plot_deterministic_process(df, key, freq, col, ax1, title1, ax2, title2): |
数据可视化
1 | fig, axes = plt.subplots(nrows=4, ncols=1, figsize=(30,30)) |
季节性趋势
构造季节性绘图函数
1 | def seasonal_plot(X,y,period,freq,ax=None): |
1 | def plot_periodogram(ts, detrend="linear", ax=None): |
1 | def seasonality(df, key, freq, col): |
数据可视化-trans
1 | seasonality(trans, 'date', 'D', 'transactions') |
数据可视化-train
1 | seasonality(train, 'date', 'D', 'sales') |
季节性预测
构造预测函数
1 | def predict_seasonality(df, key, freq, col, ax1, title1): |
季节预测可视化
1 | fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(30,30)) |
时序预测
构造时序数据
基于train数据构造时序特征:
1 | store_sales = train.copy() # copy副本 |
id | sales | onpromotion | |||
---|---|---|---|---|---|
store_nbr | family | date | |||
1 | AUTOMOTIVE | 2013-01-01 | 0 | 0.0 | 0 |
2013-01-02 | 1782 | 2.0 | 0 | ||
2013-01-03 | 3564 | 3.0 | 0 | ||
2013-01-04 | 5346 | 3.0 | 0 | ||
2013-01-05 | 7128 | 5.0 | 0 |
1 | family_sales = ( |
sales | ... | onpromotion | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
family | AUTOMOTIVE | BABY CARE | BEAUTY | BEVERAGES | BOOKS | BREAD/BAKERY | CELEBRATION | CLEANING | DAIRY | DELI | ... | MAGAZINES | MEATS | PERSONAL CARE | PET SUPPLIES | PLAYERS AND ELECTRONICS | POULTRY | PREPARED FOODS | PRODUCE | SCHOOL AND OFFICE SUPPLIES | SEAFOOD |
date | |||||||||||||||||||||
2017-01-01 | 0.092593 | 0.037037 | 0.055556 | 74.222222 | 0.000000 | 9.084685 | 0.129630 | 7.500000 | 11.518519 | 3.629167 | ... | 0.0 | 0.018519 | 0.111111 | 0.018519 | 0.0 | 0.000000 | 0.037037 | 0.129630 | 0.0 | 0.000000 |
2017-01-02 | 11.481481 | 0.259259 | 11.648148 | 6208.055556 | 0.481481 | 844.836296 | 14.203704 | 2233.648148 | 1545.000000 | 539.114833 | ... | 0.0 | 0.462963 | 10.592593 | 0.537037 | 0.0 | 0.259259 | 1.166667 | 5.629630 | 0.0 | 0.407407 |
2017-01-03 | 8.296296 | 0.296296 | 7.185185 | 4507.814815 | 0.814815 | 665.124111 | 10.629630 | 1711.907407 | 1204.203704 | 404.300074 | ... | 0.0 | 0.481481 | 9.722222 | 0.444444 | 0.0 | 0.388889 | 1.351852 | 56.296296 | 0.0 | 0.407407 |
2017-01-04 | 6.833333 | 0.333333 | 6.888889 | 3911.833333 | 0.759259 | 594.160611 | 11.185185 | 1508.037037 | 1107.796296 | 309.397685 | ... | 0.0 | 0.370370 | 12.037037 | 0.444444 | 0.0 | 0.296296 | 5.444444 | 101.277778 | 0.0 | 0.333333 |
2017-01-05 | 6.333333 | 0.351852 | 5.925926 | 3258.796296 | 0.407407 | 495.511611 | 12.444444 | 1241.833333 | 829.277778 | 260.776500 | ... | 0.0 | 8.981481 | 5.666667 | 0.000000 | 0.0 | 0.296296 | 0.907407 | 5.018519 | 0.0 | 0.444444 |
5 rows × 66 columns
1 | mag_sales = family_sales.loc(axis=1)[:, 'MAGAZINES'] |
sales | onpromotion | |
---|---|---|
family | MAGAZINES | MAGAZINES |
date | ||
2017-01-01 | 0.074074 | 0.0 |
2017-01-02 | 7.777778 | 0.0 |
2017-01-03 | 3.500000 | 0.0 |
2017-01-04 | 3.500000 | 0.0 |
2017-01-05 | 3.203704 | 0.0 |
2017-01-06 | 2.759259 | 0.0 |
2017-01-07 | 5.962963 | 0.0 |
2017-01-08 | 5.074074 | 0.0 |
2017-01-09 | 3.537037 | 0.0 |
2017-01-10 | 3.222222 | 0.0 |
时序特征可视化
1 | y = mag_sales.loc[:,"sales"].squeeze() |
date
2017-01-01 0.074074
2017-01-02 7.777778
2017-01-03 3.500000
2017-01-04 3.500000
2017-01-05 3.203704
...
2017-08-11 9.259259
2017-08-12 8.944444
2017-08-13 8.685185
2017-08-14 8.462963
2017-08-15 8.537037
Freq: D, Name: MAGAZINES, Length: 227, dtype: float64
1 | fourier = CalendarFourier(freq="M", order=4) |
Text(0.5, 1.0, 'Magazine Sales (deseasonalized)')
自相关和偏自相关
-
自相关ACF:也叫序列相关,是一个序列于其自身在不同时间点的互相关
-
偏自相关PACF:偏自相关是剔除干扰后时间序列观察与先前时间步长时间序列观察之间关系的总结。
在滞后k处的偏自相关是在消除由于较短滞后条件导致的任何相关性的影响之后产生的相关性。
1 | def lagplot(x, y=None, lag=1, standardize=False, ax=None, **kwargs): |
1 | def plot_lags(x, y=None, lags=6, nrows=1, lagplot_kwargs={}, **kwargs): |
自相关可视化
1 | fig = plot_lags(y_deseason, lags=8, nrows=2) # 多行多列数据 |
偏自相关可视化
1 | fig = plot_pacf(y_deseason, lags=8) # 一个图形 |
基于magazine的可视化
1 | onpromotion = mag_sales.loc[:,'onpromotion'].squeeze().rename('onpromotion') |
时序预测可视化
1 | def make_lags(ts, lags): |
1 | X = make_lags(y_deseason, lags=4) |
y_lag_1 | y_lag_2 | y_lag_3 | y_lag_4 | |
---|---|---|---|---|
date | ||||
2017-01-01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
2017-01-02 | -3.586020e-14 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
2017-01-03 | 3.257003e+00 | -3.586020e-14 | 0.000000e+00 | 0.000000e+00 |
2017-01-04 | -2.105813e-01 | 3.257003e+00 | -3.586020e-14 | 0.000000e+00 |
2017-01-05 | -1.304980e-01 | -2.105813e-01 | 3.257003e+00 | -3.586020e-14 |
... | ... | ... | ... | ... |
2017-08-11 | 1.795609e-01 | 8.802533e-01 | 1.126343e+00 | 3.000790e+00 |
2017-08-12 | 1.112449e+00 | 1.795609e-01 | 8.802533e-01 | 1.126343e+00 |
2017-08-13 | -1.466291e+00 | 1.112449e+00 | 1.795609e-01 | 8.802533e-01 |
2017-08-14 | -7.308822e-01 | -1.466291e+00 | 1.112449e+00 | 1.795609e-01 |
2017-08-15 | 1.522645e+00 | -7.308822e-01 | -1.466291e+00 | 1.112449e+00 |
227 rows × 4 columns
1 | y = y_deseason.copy() |
基于混合模型预测
构建混合模型
将XGboost模型和线性回归模型结合起来:
1 | store_sales = train.copy() |
id | ... | onpromotion | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
family | AUTOMOTIVE | BABY CARE | BEAUTY | BEVERAGES | BOOKS | BREAD/BAKERY | CELEBRATION | CLEANING | DAIRY | DELI | ... | MAGAZINES | MEATS | PERSONAL CARE | PET SUPPLIES | PLAYERS AND ELECTRONICS | POULTRY | PREPARED FOODS | PRODUCE | SCHOOL AND OFFICE SUPPLIES | SEAFOOD |
date | |||||||||||||||||||||
2017-01-01 | 2597248.5 | 2597249.5 | 2597250.5 | 2597251.5 | 2597252.5 | 2597253.5 | 2597254.5 | 2597255.5 | 2597256.5 | 2597257.5 | ... | 0.0 | 0.018519 | 0.111111 | 0.018519 | 0.0 | 0.000000 | 0.037037 | 0.129630 | 0.0 | 0.000000 |
2017-01-02 | 2599030.5 | 2599031.5 | 2599032.5 | 2599033.5 | 2599034.5 | 2599035.5 | 2599036.5 | 2599037.5 | 2599038.5 | 2599039.5 | ... | 0.0 | 0.462963 | 10.592593 | 0.537037 | 0.0 | 0.259259 | 1.166667 | 5.629630 | 0.0 | 0.407407 |
2017-01-03 | 2600812.5 | 2600813.5 | 2600814.5 | 2600815.5 | 2600816.5 | 2600817.5 | 2600818.5 | 2600819.5 | 2600820.5 | 2600821.5 | ... | 0.0 | 0.481481 | 9.722222 | 0.444444 | 0.0 | 0.388889 | 1.351852 | 56.296296 | 0.0 | 0.407407 |
2017-01-04 | 2602594.5 | 2602595.5 | 2602596.5 | 2602597.5 | 2602598.5 | 2602599.5 | 2602600.5 | 2602601.5 | 2602602.5 | 2602603.5 | ... | 0.0 | 0.370370 | 12.037037 | 0.444444 | 0.0 | 0.296296 | 5.444444 | 101.277778 | 0.0 | 0.333333 |
2017-01-05 | 2604376.5 | 2604377.5 | 2604378.5 | 2604379.5 | 2604380.5 | 2604381.5 | 2604382.5 | 2604383.5 | 2604384.5 | 2604385.5 | ... | 0.0 | 8.981481 | 5.666667 | 0.000000 | 0.0 | 0.296296 | 0.907407 | 5.018519 | 0.0 | 0.444444 |
5 rows × 99 columns
1 | class BoostedHybrid: |
1 | def predict(self, X_1, X_2): |
构建数据集
1 | # 目标值y |
模型训练
1 | model = BoostedHybrid( |
1 | # 模型混合 |
1 | # 构造训练集和验证集 |
1 | # 训练 |
预测可视化
1 | families = y.columns[0:6] |
基于机器学习预测
训练集数据
1 | store_sales = train.copy() |
1 | family_sales = ( |
测试集数据
1 | test = test.copy() |
构建多步预测目标函数
1 | def make_multistep_target(ts, steps): |
1 | y = family_sales.loc[:, 'sales'] |
1 | # 准备好输入到XGBoost模型的数据 |
y_step_1 | y_step_2 | y_step_3 | y_step_4 | y_step_5 | y_step_6 | y_step_7 | y_step_8 | y_step_9 | y_step_10 | y_step_11 | y_step_12 | y_step_13 | y_step_14 | y_step_15 | y_step_16 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | family | ||||||||||||||||
2017-01-06 | AUTOMOTIVE | 6.018519 | 10.259259 | 9.388889 | 5.944444 | 4.777778 | 6.314815 | 5.388889 | 5.240741 | 8.500000 | 10.259259 | 6.407407 | 5.685185 | 5.703704 | 4.777778 | 5.148148 | 8.685185 |
BABY CARE | 0.277778 | 0.259259 | 0.240741 | 0.444444 | 0.240741 | 0.277778 | 0.296296 | 0.296296 | 0.388889 | 0.425926 | 0.314815 | 0.166667 | 0.222222 | 0.129630 | 0.166667 | 0.277778 | |
BEAUTY | 6.518519 | 10.037037 | 11.611111 | 5.648148 | 6.500000 | 5.277778 | 4.370370 | 4.703704 | 7.777778 | 9.037037 | 5.648148 | 5.351852 | 4.740741 | 3.981481 | 4.592593 | 7.111111 | |
BEVERAGES | 3507.277778 | 4848.518519 | 5503.648148 | 3448.203704 | 3171.740741 | 3046.870370 | 2693.722222 | 3226.037037 | 4667.296296 | 5580.611111 | 3700.370370 | 3409.796296 | 3263.462963 | 2676.574074 | 3003.555556 | 4900.611111 | |
BOOKS | 0.537037 | 0.481481 | 0.722222 | 0.500000 | 0.518519 | 0.481481 | 0.388889 | 0.444444 | 0.574074 | 0.555556 | 0.388889 | 0.500000 | 0.407407 | 0.277778 | 0.351852 | 0.685185 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2017-07-31 | POULTRY | 364.955648 | 403.601334 | 377.313980 | 316.436093 | 533.497054 | 416.454018 | 464.596557 | 344.051740 | 313.780869 | 305.270204 | 278.819870 | 468.857370 | 354.342779 | 379.801204 | 344.398297 | 325.679815 |
PREPARED FOODS | 84.698648 | 87.836796 | 88.735963 | 77.173000 | 91.886760 | 100.384963 | 102.248148 | 86.627444 | 77.344130 | 84.796537 | 78.791444 | 96.286926 | 84.693815 | 91.509426 | 86.062500 | 85.954129 | |
PRODUCE | 2257.140589 | 2609.180150 | 3122.895724 | 1792.220910 | 2079.319469 | 2418.970157 | 2675.105815 | 2111.133423 | 2168.535465 | 2663.076241 | 1670.264889 | 2198.854500 | 2070.154646 | 2331.922267 | 2134.399926 | 2316.832796 | |
SCHOOL AND OFFICE SUPPLIES | 30.111111 | 49.333333 | 57.481481 | 51.907407 | 63.222222 | 85.203704 | 100.277778 | 64.407407 | 59.759259 | 53.740741 | 42.962963 | 65.240741 | 67.481481 | 68.851852 | 52.333333 | 46.851852 | |
SEAFOOD | 20.488333 | 20.346852 | 20.801037 | 17.116296 | 25.553963 | 24.209519 | 23.512852 | 18.419852 | 18.481130 | 18.181426 | 13.284463 | 23.566963 | 19.037593 | 20.704574 | 17.975556 | 17.966241 |
6831 rows × 16 columns
建模预测
1 | # 模型实例化 |
预测可视化
1 | def plot_multistep(y, every=1, ax=None, palette_kwargs=None): |
1 | FAMILY = 'BEAUTY' |