Fork me on GitHub

kaggle实战-酒店预订建模

kaggle实战:酒店预订建模

本文是针对kaggle上面一份酒店预订数据的分析与建模,主要内容包含:

结果

官网上面有一个大佬的模型跑出来acc为99.5%,参考了他的一些代码(向大佬学习),加入了一些我自己的处理思路,最终结果为99.6%,主要还是体现在特征工程上面下了功夫!

原kaggle官网数据集地址:

https://www.kaggle.com/code/niteshyadav3103/hotel-booking-prediction-99-5-acc/notebook

导入库

主要是用于数据处理、可视化、建模、评分等

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 32)

# 可视化
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import plotly.express as px
%matplotlib inline

# 缺失值可视化
import missingno as msno
# 地图可视化
import folium
from folium.plugins import HeatMap

# 建模相关
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, recall_score, roc_auc_score, precision_score, f1_score
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import VotingClassifier

# 忽略警告
import warnings
warnings.filterwarnings('ignore')

导入数据

查看数据的基本信息:

查看数据的基本信息:

In [3]:

1
2
3
# 1、字段信息

df.columns

Out[3]:

1
2
3
4
5
6
7
8
9
10
11
12
Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',
'arrival_date_month', 'arrival_date_week_number',
'arrival_date_day_of_month', 'stays_in_weekend_nights',
'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',
'country', 'market_segment', 'distribution_channel',
'is_repeated_guest', 'previous_cancellations',
'previous_bookings_not_canceled', 'reserved_room_type',
'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',
'company', 'days_in_waiting_list', 'customer_type', 'adr',
'required_car_parking_spaces', 'total_of_special_requests',
'reservation_status', 'reservation_status_date'],
dtype='object')

字段的具体中文含义:

  • hotel 酒店
  • is_canceled 是否取消
  • lead_time 预订时间
  • arrival_date_year 入住年份
  • arrival_date_month 入住月份
  • arrival_date_week_number 入住周次
  • arrival_date_day_of_month 入住天号
  • stays_in_weekend_nights 周末夜晚数
  • stays_in_week_nights 工作日夜晚数
  • adults 成人数量
  • children 儿童数量
  • babies 幼儿数量
  • meal 餐食
  • country 国家
  • market_segment 细分市场
  • distribution_channel 分销渠道
  • is_repeated_guest 是否是回头客
  • previous_cancellations 先前取消数
  • previous_bookings_not_canceled 先前未取消数
  • reserved_room_type 预订房间类型
  • assigned_room_type 实际房间类型
  • booking_changes 预订更改数
  • deposit_type 押金方式
  • agent 代理
  • company 公司
  • days_in_waiting_list 排队天数
  • customer_type 客户类型
  • adr 每日房间均价 (Average Daily Rate)
  • required_car_parking_spaces 停车位数量
  • total_of_special_requests 特殊需求数(例如高层或双床)
  • reservation_status 订单状态
  • reservation_status_date 订单状态确定日期

In [4]:

1
2
# 2、总字段个数
len(df.columns)

Out[4]:

1
32

In [5]:

1
2
# 3、字段类型
df.dtypes

Out[5]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
hotel                              object
is_canceled int64
lead_time int64
arrival_date_year int64
arrival_date_month object
arrival_date_week_number int64
arrival_date_day_of_month int64
stays_in_weekend_nights int64
stays_in_week_nights int64
adults int64
children float64
babies int64
meal object
country object
market_segment object
distribution_channel object
is_repeated_guest int64
previous_cancellations int64
previous_bookings_not_canceled int64
reserved_room_type object
assigned_room_type object
booking_changes int64
deposit_type object
agent float64
company float64
days_in_waiting_list int64
customer_type object
adr float64
required_car_parking_spaces int64
total_of_special_requests int64
reservation_status object
reservation_status_date object
dtype: object

In [6]:

1
2
# 4、不同类型字段统计
df.dtypes.value_counts()

Out[6]:

1
2
3
4
int64      16
object 12
float64 4
dtype: int64

In [7]:

1
2
# 5、数据量
df.shape

Out[7]:

1
(119390, 32)

In [8]:

1
2
# 6、描述统计信息
df.describe()

info函数能够查看数据的完整信息,包含字段类型,行列索引、缺失值情况等

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# 7、数据完整信息
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 hotel 119390 non-null object
1 is_canceled 119390 non-null int64
2 lead_time 119390 non-null int64
3 arrival_date_year 119390 non-null int64
4 arrival_date_month 119390 non-null object
5 arrival_date_week_number 119390 non-null int64
6 arrival_date_day_of_month 119390 non-null int64
7 stays_in_weekend_nights 119390 non-null int64
8 stays_in_week_nights 119390 non-null int64
9 adults 119390 non-null int64
10 children 119386 non-null float64
11 babies 119390 non-null int64
12 meal 119390 non-null object
13 country 118902 non-null object
14 market_segment 119390 non-null object
15 distribution_channel 119390 non-null object
16 is_repeated_guest 119390 non-null int64
17 previous_cancellations 119390 non-null int64
18 previous_bookings_not_canceled 119390 non-null int64
19 reserved_room_type 119390 non-null object
20 assigned_room_type 119390 non-null object
21 booking_changes 119390 non-null int64
22 deposit_type 119390 non-null object
23 agent 103050 non-null float64
24 company 6797 non-null float64
25 days_in_waiting_list 119390 non-null int64
26 customer_type 119390 non-null object
27 adr 119390 non-null float64
28 required_car_parking_spaces 119390 non-null int64
29 total_of_special_requests 119390 non-null int64
30 reservation_status 119390 non-null object
31 reservation_status_date 119390 non-null object
dtypes: float64(4), int64(16), object(12)
memory usage: 29.1+ MB

缺失值信息

统计每个字段缺失值信息

统计每个字段的缺失值数量及比例

In [10]:

1
2
3
4
5
null_df = pd.DataFrame({"Null Values": df.isnull().sum(),
"Percentage Null Values": (df.isnull().sum()) / (df.shape[0]) * 100
})

null_df

缺失值可视化

将缺失值信息进行可视化展示:

In [11]:

1
2
3
msno.bar(df, color="blue")

plt.show()

缺失值处理

1、字段 children和字段country 缺失值比例都不到1%,比例很小;我们直接把缺失值的部分删除

1
2
3
4
5
6
# 把非缺失值的数据筛选出来

df = df[df["country"].isnull() == False]
df = df[df["children"].isnull() == False]

df.head()

2、字段company缺失值比例高达94.3%,我们考虑直接删除该字段:

In [14]:

1
df.drop("company", axis=1, inplace=True)

3、字段agent(代理商费用)的缺失值为13.68%,处理为:

In [15]:

1
2
# 1、先查看字段具体信息
df["agent"].value_counts()

Out[15]:

1
2
3
4
5
6
7
8
9
10
11
12
9.0      31959
240.0 13871
1.0 7191
14.0 3638
7.0 3539
...
70.0 1
93.0 1
54.0 1
497.0 1
59.0 1
Name: agent, Length: 332, dtype: int64

我们可以考虑使用的值来进行填充,比如:

  • 0:无法确定缺失值的具体数据
  • 9:众数
  • 均值:字段现有值的均值

在这里我们考虑使用0来进行填充:

In [16]:

1
df["agent"].fillna(0,inplace=True)

特殊处理

处理1:入住人数不能为0

考虑到一个房间中adults、children和babies的数量不能同时为0:

In [18]:

1
2
special = (df["children"] == 0) & (df.adults == 0) & (df.babies == 0)
special.head()

Out[18]:

1
2
3
4
5
6
0    False
1 False
2 False
3 False
4 False
dtype: bool

1
2
# 排除特殊情况
df = df[~special]

处理2:adr(日均价)

  • 取值不能为负数
  • 最大值为5400,可以判断属于异常值

In [21]:

1
df["adr"].value_counts().sort_index()

Out[21]:

1
2
3
4
5
6
7
8
9
10
11
12
-6.38          1
0.00 1799
0.26 1
0.50 1
1.00 14
...
450.00 1
451.50 1
508.00 1
510.00 1
5400.00 1
Name: adr, Length: 8857, dtype: int64

In [22]:

通过小提琴图来查看数据的分布情况:处理前明显有离群点

1
px.violin(y=df["adr"])   # 处理前

箱型图也可以观察:

1
px.box(df,y="adr")

可以看到这个特殊点在City Hotel中:

实施删除的过程:

In [25]:

1
2
3
# 删除大于1000的信息  df = df.drop(df[df.adr >1000].index)

df = df[(df["adr"] >= 0) & (df["adr"] < 5400)] # 排除异常值

In [26]:

1
px.violin(y=df["adr"])  # 删除后

1
px.box(df,y="adr",color="hotel")   # 删除后

数据EDA-Exploratory Data Analysis

取消和未取消的顾客数对比

In [28]:

1
df["is_canceled"].value_counts()

Out[28]:

1
2
3
0    74589
1 44137
Name: is_canceled, dtype: int64

In [29]:

1
2
3
4
# 取消和未取消人数对比  0-未取消 1-取消
sns.countplot(df["is_canceled"])

plt.show()

未取消的顾客来自哪里?

In [30]:

1
data = df[df.is_canceled == 0]  # 未取消的数据

In [31]:

1
2
3
4
number_no_canceled = data["country"].value_counts().reset_index()
number_no_canceled.columns = ["country", "number_of_no_canceled"]

number_no_canceled

1
2
3
4
5
6
7
8
# 地图可视化

basemap = folium.Map()
guests_map = px.choropleth(number_no_canceled, # 传入数据
locations = number_no_canceled['country'], # 地理位置
color = number_no_canceled['number_of_no_canceled'], # 颜色取值
hover_name = number_no_canceled['country']) # 悬停信息
guests_map.show()

结论1:预订的顾客主要是来自Portugal,大部分是欧洲的国家

房间的每日均价是多少?

In [33]:

1
2
3
4
5
6
7
px.box(data,  # 数据
x="reserved_room_type", # x
y="adr", # y
color="hotel", # 颜色
template="plotly_dark", # 主题
category_orders={"reserved_room_type":["A","B","C","D","E","F","G","H","L"]} # 指定排列顺序
)

结论2:每个房间的均价还是取决于它的类型和标准差

全年每晚的价格是多少?

两种不同类型酒店的全年均价变化

In [34]:

1
2
data_resort = data[data["hotel"] == "Resort Hotel"]
data_city = data[data["hotel"] == "City Hotel"]

In [35]:

1
2
3
resort_hotel = data_resort.groupby(['arrival_date_month'])['adr'].mean().reset_index()
city_hotel = data_city.groupby(['arrival_date_month'])['adr'].mean().reset_index()
city_hotel

1
2
3
4
5
6
7
# 合并两个数据

total_hotel = pd.merge(resort_hotel, city_hotel,
on="arrival_date_month"
)
total_hotel.columns = ["month","price_resort","price_city"]
total_hotel

为了让月份按照正常时间排序,安装两个包:

jupyter notebook直接安装:前面要加!

In [37]:

1
2
!pip install sort-dataframeby-monthorweek
!pip install sorted-months-weekdays

In [39]:

1
2
3
4
5
6
import sort_dataframeby_monthorweek as sd

# 自定义排序函数
def sort_month(df, column):
result = sd.Sort_Dataframeby_Month(df,column)
return result

In [40]:

1
2
new_total_hotel = sort_month(total_hotel, "month")
new_total_hotel

正确的顺序排列

1
2
3
4
5
6
7
8
fig = px.line(new_total_hotel,
x = "month",
y = ["price_resort", "price_city"],
title = "Price of per night over the Months",
template = "plotly_dark"
)

fig.show()

结论:

  • Resort Hotel在夏季的价格明显比 City Hotel的价格高
  • City Hotel的价格变化相对更小。但是City Hotel的价格从4月开始就已经很高,一直持续到9月份

KDE图

KDE(Kernel Density Estimation,核密度图),可以认为是对直方图的加窗平滑。通过KDE分布图场内看数据在不同情形下的分布

In [42]:

1
2
3
4
5
6
7
8
9
10
11
12
13
plt.figure(figsize=(6,3), dpi=150)

ax = sns.kdeplot(new_total_hotel["price_resort"],
color="green",
shade=True)

ax = sns.kdeplot(new_total_hotel["price_city"],
color="blue",
shade=True)

ax.set_xlabel("month")
ax.set_ylabel("Price per night over the month")
ax = ax.legend(["Resort","City"])

最为繁忙的季节-the most busy months

In [43]:

1
2
3
4
5
6
7
8
resort_guests = data_resort['arrival_date_month'].value_counts().reset_index()
resort_guests.columns=['Month','No_Resort_Guests']

city_guests = data_city['arrival_date_month'].value_counts().reset_index()
city_guests.columns=['Month','No_City_Guests']

# 合并两份DataFrame
final_guests = pd.merge(resort_guests, city_guests)

同样的将月份进行排序处理:

In [45]:

1
2
new_final_guests = sort_month(final_guests, "Month")
new_final_guests

1
2
3
4
5
6
7
8
fig = px.line(new_final_guests,
x = "Month",
y = ["No_Resort_Guests", "No_City_Guests"],
title = "No of per Month",
template = "plotly_dark"
)

fig.show()

结论:

  1. 很明显:City Hotel的人数是高于Resort Hotel,更受欢迎
  2. City Hotel在7-8月份的时候,尽管价格高(上图),但人数也达到了峰值
  3. 两个Hotel在冬季的顾客都是很少的

顾客停留多久?

In [47]:

1
2
data["total_nights"] = data['stays_in_weekend_nights'] + data['stays_in_week_nights']
data.head()

两个不同酒店在不同停留时间下的统计:

In [48]:

1
2
3
4
5
6
stay_groupby = (data.groupby(['total_nights', 'hotel'])["is_canceled"]
.agg("count")
.reset_index()
.rename(columns={"is_canceled":"Number of stays"}))

stay_groupby.head()

Out[48]:

total_nights hotel Number of stays
0 0 City Hotel 251
1 0 Resort Hotel 366
2 1 City Hotel 9155
3 1 Resort Hotel 6368
4 2 City Hotel 10983

In [49]:

1
2
3
4
5
6
7
8
fig = px.bar(stay_groupby,
x = "total_nights",
y = "Number of stays",
color = "hotel",
barmode = "group"
)

fig.show()

数据预处理-Data Pre Processing

相关性判断

In [50]:

1
2
3
4
5
plt.figure (figsize=(24,12))

corr = df.corr()
sns.heatmap(corr, annot = True, linewidths = 1)
plt.show()

查看每个特征和目标变量is_canceled的相关系数的绝对值,并降序排列:

In [51]:

1
2
3
corr_with_iscanceled = df.corr()["is_canceled"].abs().sort_values(ascending=False)

corr_with_iscanceled

Out[51]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
is_canceled                       1.000000
lead_time 0.291619
total_of_special_requests 0.235923
required_car_parking_spaces 0.195013
booking_changes 0.145139
previous_cancellations 0.109911
is_repeated_guest 0.084115
adults 0.056129
previous_bookings_not_canceled 0.055494
days_in_waiting_list 0.054114
agent 0.046842
adr 0.045910
babies 0.032605
stays_in_week_nights 0.024825
arrival_date_year 0.016419
arrival_date_week_number 0.007668
arrival_date_day_of_month 0.006022
children 0.004536
stays_in_weekend_nights 0.002192
Name: is_canceled, dtype: float64

删除无效字段

In [52]:

1
2
3
no_use_col = ['arrival_date_year', 'assigned_room_type',
'booking_changes','reservation_status',
'country', 'days_in_waiting_list']

In [53]:

1
df.drop(no_use_col, axis=1, inplace=True)

特征工程

离散型变量处理

In [54]:

1
df["hotel"].dtype # Series型数据的字段类型

Out[54]:

1
dtype('O')

In [55]:

1
2
cat_cols = [col for col in df.columns if df[col].dtype == "O"]
cat_cols

Out[55]:

1
2
3
4
5
6
7
8
9
['hotel',
'arrival_date_month',
'meal',
'market_segment',
'distribution_channel',
'reserved_room_type',
'deposit_type',
'customer_type',
'reservation_status_date']

In [56]:

1
cat_df = df[cat_cols]

In [57]:

1
cat_df.dtypes

Out[57]:

1
2
3
4
5
6
7
8
9
10
hotel                      object
arrival_date_month object
meal object
market_segment object
distribution_channel object
reserved_room_type object
deposit_type object
customer_type object
reservation_status_date object
dtype: object

In [58]:

1
2
3
# 1、转成时间类型数据

cat_df['reservation_status_date'] = pd.to_datetime(cat_df['reservation_status_date'])

In [59]:

1
2
3
4
5
#  2、提取年月日

cat_df["year"] = cat_df['reservation_status_date'].dt.year
cat_df['month'] = cat_df['reservation_status_date'].dt.month
cat_df['day'] = cat_df['reservation_status_date'].dt.day

In [60]:

1
2
# 3、删除无效字段
cat_df.drop(['reservation_status_date','arrival_date_month'], axis=1, inplace=True)

In [61]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# 4、每个字段的唯一值

for col in cat_df.columns:
print(f"{col}: \n{cat_df[col].unique()}\n")
hotel:
['Resort Hotel' 'City Hotel']

meal:
['BB' 'FB' 'HB' 'SC' 'Undefined']

market_segment:
['Direct' 'Corporate' 'Online TA' 'Offline TA/TO' 'Complementary' 'Groups'
'Aviation']

distribution_channel:
['Direct' 'Corporate' 'TA/TO' 'Undefined' 'GDS']

reserved_room_type:
['C' 'A' 'D' 'E' 'G' 'F' 'H' 'L' 'B']

deposit_type:
['No Deposit' 'Refundable' 'Non Refund']

customer_type:
['Transient' 'Contract' 'Transient-Party' 'Group']

year:
[2015 2014 2016 2017]

month:
[ 7 5 4 6 3 8 9 1 11 10 12 2]

day:
[ 1 2 3 6 22 23 5 7 8 11 16 29 19 18 9 13 4 12 26 17 15 10 20 14
30 28 25 21 27 24 31]

特征编码

In [62]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# 酒店
cat_df['hotel'] = cat_df['hotel'].map({'Resort Hotel' : 0,
'City Hotel' : 1})
# 餐食
cat_df['meal'] = cat_df['meal'].map({'BB' : 0,
'FB': 1,
'HB': 2,
'SC': 3,
'Undefined': 4})
# 细分市场
cat_df['market_segment'] = (cat_df['market_segment']
.map({'Direct': 0,
'Corporate':1,
'Online TA':2,
'Offline TA/TO': 3,
'Complementary': 4,
'Groups': 5,
'Undefined': 6,
'Aviation': 7}))
# 分销渠道
cat_df['distribution_channel'] = (cat_df['distribution_channel']
.map({'Direct': 0,
'Corporate': 1,
'TA/TO': 2,
'Undefined': 3,
'GDS': 4}))
# 预订房间类型
cat_df['reserved_room_type'] = (cat_df['reserved_room_type']
.map({'C': 0,
'A': 1,
'D': 2,
'E': 3,
'G': 4,
'F': 5,
'H': 6,
'L': 7,
'B': 8}))
# 押金方式
cat_df['deposit_type'] = (cat_df['deposit_type']
.map({'No Deposit': 0,
'Refundable': 1,
'Non Refund': 3}))
# 顾客类型
cat_df['customer_type'] = (cat_df['customer_type']
.map({'Transient': 0,
'Contract': 1,
'Transient-Party': 2,
'Group': 3})
)
# 年份
cat_df['year'] = cat_df['year'].map({2015: 0, 2014: 1, 2016: 2, 2017: 3})

连续型变量处理

In [63]:

1
2
3
num_df = df.drop(columns=cat_cols,axis=1)

num_df.drop("is_canceled",axis=1,inplace=True)

1
2
3
4
5
6
7
# 方差偏大的字段进行对数化处理
log_col = ["lead_time","arrival_date_week_number","arrival_date_day_of_month","agent","adr"]

for col in log_col:
num_df[col] = np.log(num_df[col] + 1)

num_df.head()

建模

合并两份df

In [68]:

1
2
X = pd.concat([cat_df, num_df], axis=1)
y = df["is_canceled"]

In [69]:

1
2
3
4
print(X.shape)
print(y.shape)
(118726, 25)
(118726,)

切割数据

In [70]:

1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,random_state=412)

建模1:逻辑回归

In [71]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# 模型实例化
lr = LogisticRegression()
lr.fit(X_train, y_train)

# 预测值
y_pred_lr = lr.predict(X_test)

# 分类问题不同评价指标
acc_lr = accuracy_score(y_test, y_pred_lr)
conf = confusion_matrix(y_test, y_pred_lr)
clf_report = classification_report(y_test, y_pred_lr)

print(f"Accuracy Score of Logistic Regression is : {acc_lr}")
print(f"Confusion Matrix : \n{conf}")
print(f"Classification Report : \n{clf_report}")
Accuracy Score of Logistic Regression is : 0.8126842415564727
Confusion Matrix :
[[14133 836]
[ 3612 5165]]
Classification Report :
precision recall f1-score support

0 0.80 0.94 0.86 14969
1 0.86 0.59 0.70 8777

accuracy 0.81 23746
macro avg 0.83 0.77 0.78 23746
weighted avg 0.82 0.81 0.80 23746

In [72]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 混淆矩阵可视化

classes = ["0","1"]

disp = ConfusionMatrixDisplay(confusion_matrix=conf, display_labels=classes)
disp.plot(
include_values=True, # 混淆矩阵每个单元格上显示具体数值
cmap="GnBu", # matplotlib识别的颜色图
ax=None,
xticks_rotation="horizontal",
values_format="d"
)

plt.show()

模型2:KNN

In [73]:

1
2
3
4
5
6
7
8
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

y_pred= knn.predict(X_test)

acc_knn = accuracy_score(y_test, y_pred)
conf = confusion_matrix(y_test, y_pred)
clf_report = classification_report(y_test, y_pred)

模型3:决策树

In [74]:

1
2
3
4
5
6
7
8
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

y_pred_dtc = dtc.predict(X_test)

acc_dtc = accuracy_score(y_test, y_pred_dtc)
conf = confusion_matrix(y_test, y_pred_dtc)
clf_report = classification_report(y_test, y_pred_dtc)

模型4:随机森林

In [75]:

1
2
3
4
5
6
7
8
rd_clf = RandomForestClassifier()
rd_clf.fit(X_train, y_train)

y_pred_rd_clf = rd_clf.predict(X_test)

acc_rd_clf = accuracy_score(y_test, y_pred_rd_clf)
conf = confusion_matrix(y_test, y_pred_rd_clf)
clf_report = classification_report(y_test, y_pred_rd_clf)

模型5:AdaBoost

In [76]:

1
2
3
4
5
6
7
8
ada = AdaBoostClassifier(base_estimator = dtc)
ada.fit(X_train, y_train)

y_pred_ada = ada.predict(X_test)

acc_ada = accuracy_score(y_test, y_pred_ada)
conf = confusion_matrix(y_test, y_pred_ada)
clf_report = classification_report(y_test, y_pred_ada)

模型6:梯度提升树-Gradient Boosting Classifier

In [77]:

1
2
3
4
5
6
7
8
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)

y_pred_gb = gb.predict(X_test)

acc_gb = accuracy_score(y_test, y_pred_gb)
conf = confusion_matrix(y_test, y_pred_gb)
clf_report = classification_report(y_test, y_pred_gb)

模型7:XgBoost Classifier

In [78]:

1
2
3
4
5
6
7
8
9
10
11
12
xgb = XGBClassifier(booster='gbtree',
learning_rate=0.1,
max_depth=5,
n_estimators=180)

xgb.fit(X_train, y_train)

y_pred_xgb = xgb.predict(X_test)

acc_xgb = accuracy_score(y_test, y_pred_xgb)
conf = confusion_matrix(y_test, y_pred_xgb)
clf_report = classification_report(y_test, y_pred_xgb)

模型8:Cat Boost分类器

In [79]:

1
2
3
4
5
6
7
8
cat = CatBoostClassifier(iterations=100)
cat.fit(X_train, y_train)

y_pred_cat = cat.predict(X_test)

acc_cat = accuracy_score(y_test, y_pred_cat)
conf = confusion_matrix(y_test, y_pred_cat)
clf_report = classification_report(y_test, y_pred_cat)

模型9:极端树-Extra Trees Classifier

In [80]:

1
2
3
4
5
6
7
8
etc = ExtraTreesClassifier()
etc.fit(X_train, y_train)

y_pred_etc = etc.predict(X_test)

acc_etc = accuracy_score(y_test, y_pred_etc)
conf = confusion_matrix(y_test, y_pred_etc)
clf_report = classification_report(y_test, y_pred_etc)

模型10:LGBM

In [81]:

1
2
3
4
5
6
7
8
lgbm = LGBMClassifier(learning_rate = 1)
lgbm.fit(X_train, y_train)

y_pred_lgbm = lgbm.predict(X_test)

acc_lgbm = accuracy_score(y_test, y_pred_lgbm)
conf = confusion_matrix(y_test, y_pred_lgbm)
clf_report = classification_report(y_test, y_pred_lgbm)

模型11:投票分类器-Voting Classifier

这个是重点建模:多分类器的投票表决

In [82]:

1
2
3
4
5
6
7
8
9
10
11
12
13
classifiers = [('Gradient Boosting Classifier', gb),
('Cat Boost Classifier', cat),
('XGboost', xgb),
('Decision Tree', dtc),
('Extra Tree', etc),
('Light Gradient', lgbm),
('Random Forest', rd_clf),
('Ada Boost', ada),
('Logistic', lr),
('Knn', knn)]

vc = VotingClassifier(estimators = classifiers)
vc.fit(X_train, y_train)
1
2
3
4
5
y_pred_vc = vc.predict(X_test)

acc_vtc = accuracy_score(y_test, y_pred_vc)
conf = confusion_matrix(y_test, y_pred_vc)
clf_report = classification_report(y_test, y_pred_vc)

基于深度学习keras建模

数据预处理和切割

In [84]:

1
2
3
4
5
from tensorflow.keras.utils import to_categorical

X = pd.concat([cat_df, num_df], axis = 1)
# 转成分类型变量数据
y = to_categorical(df['is_canceled'])

In [85]:

1
2
3
# 切割数据

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [86]:

1
2
3
4
import tensorflow as tf
import keras
from keras.layers import Dense
from keras.models import Sequential

搭建网络

In [87]:

1
X.shape[1]

Out[87]:

1
25

In [88]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
model = Sequential()

model.add(Dense(100, activation="relu",input_shape=(X.shape[1], )))
model.add(Dense(100, activation="relu"))
model.add(Dense(2, activation="sigmoid"))

model.compile(optimizer="adam",
loss="binary_crossentropy",
metrics=["accuracy"])

model_history = model.fit(X_train,
y_train,
validation_data = (X_test, y_test),
epochs = 50)

指标可视化-loss

In [89]:

1
2
3
4
5
6
7
8
9
train_loss = model_history.history["loss"]
val_loss = model_history.history["val_loss"]

epoch = range(1,51)

loss = pd.DataFrame({"train_loss": train_loss,
"val_loss":val_loss
})
loss.head()

Out[89]:

train_loss val_loss
0 0.320637 0.189505
1 0.152539 0.394518
2 0.112900 0.094703
3 0.091933 0.105522
4 0.078177 0.096059

In [90]:

1
2
3
4
5
6
fig = px.line(loss,
x=epoch,
y=['val_loss','train_loss'],
title='Train and Val Loss')

fig.show()

指标可视化-acc

In [91]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
train_acc = model_history.history["accuracy"]
val_acc = model_history.history["val_accuracy"]

epoch = range(1,51)

acc = pd.DataFrame({"train_acc": train_acc,
"val_acc":val_acc
})

px.line(acc,
x=epoch,
y=['val_acc','train_acc'],
title = 'Train and Val Accuracy',
template = 'plotly_dark')

1
2
3
4
5
6
7
8
#最终预测值

acc_ann = model.evaluate(X_test, y_test)[1]
acc_ann

# 结果
743/743 [==============================] - 2s 2ms/step - loss: 0.0504 - accuracy: 0.9867
0.986692488193512

模型对比

不同模型的结果对比

In [93]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
models = pd.DataFrame({
'Model' : ['Logistic Regression', 'KNN',
'Decision Tree Classifier',
'Random Forest Classifier',
'Ada Boost Classifier',
'Gradient Boosting Classifier',
'XgBoost', 'Cat Boost',
'Extra Trees Classifier',
'LGBM', 'Voting Classifier','ANN'],
'Score' : [acc_lr, acc_knn, acc_dtc,
acc_rd_clf, acc_ada, acc_gb,
acc_xgb, acc_cat, acc_etc,
acc_lgbm, acc_vtc, acc_ann]
})


models = models.sort_values(by = 'Score', ascending = True, ignore_index=True)

models["Score"] = models["Score"].apply(lambda x: round(x,4))
models

不同模型的得分可视化对比:

1
2
3
4
5
6
7
8
9
10
fig = px.bar(models,
x="Score",
y="Model",
text="Score",
color="Score",
template="plotly_dark",
title="Models Comparision"
)

fig.show()

可以看到Cat Boost分类达到了惊人的99.61%

又是收获满满的一篇文章✌🏻!

本文标题:kaggle实战-酒店预订建模

发布时间:2022年07月25日 - 22:07

原始链接:http://www.renpeter.cn/2022/07/25/kaggle%E5%AE%9E%E6%88%98-%E9%85%92%E5%BA%97%E9%A2%84%E8%AE%A2%E5%BB%BA%E6%A8%A1.html

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

Coffee or Tea