kaggle实战-酒店预订建模

kaggle实战：酒店预订建模

本文是针对kaggle上面一份酒店预订数据的分析与建模，主要内容包含：

结果

官网上面有一个大佬的模型跑出来acc为99.5%，参考了他的一些代码（向大佬学习），加入了一些我自己的处理思路，最终结果为99.6%，主要还是体现在特征工程上面下了功夫！

原kaggle官网数据集地址：

https://www.kaggle.com/code/niteshyadav3103/hotel-booking-prediction-99-5-acc/notebook

导入库

主要是用于数据处理、可视化、建模、评分等

import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 32)

# 可视化
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import plotly.express as px
%matplotlib inline

# 缺失值可视化
import missingno as msno
# 地图可视化
import folium
from folium.plugins import HeatMap

# 建模相关
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, recall_score, roc_auc_score, precision_score, f1_score
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import VotingClassifier

# 忽略警告
import warnings
warnings.filterwarnings('ignore')

导入数据

查看数据的基本信息：

In [3]:

1
2
3

# 1、字段信息

df.columns

Out[3]:

Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',
       'arrival_date_month', 'arrival_date_week_number',
       'arrival_date_day_of_month', 'stays_in_weekend_nights',
       'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',
       'country', 'market_segment', 'distribution_channel',
       'is_repeated_guest', 'previous_cancellations',
       'previous_bookings_not_canceled', 'reserved_room_type',
       'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',
       'company', 'days_in_waiting_list', 'customer_type', 'adr',
       'required_car_parking_spaces', 'total_of_special_requests',
       'reservation_status', 'reservation_status_date'],
      dtype='object')

字段的具体中文含义：

hotel 酒店
is_canceled 是否取消
lead_time 预订时间
arrival_date_year 入住年份
arrival_date_month 入住月份
arrival_date_week_number 入住周次
arrival_date_day_of_month 入住天号
stays_in_weekend_nights 周末夜晚数
stays_in_week_nights 工作日夜晚数
adults 成人数量
children 儿童数量
babies 幼儿数量
meal 餐食
country 国家
market_segment 细分市场
distribution_channel 分销渠道
is_repeated_guest 是否是回头客
previous_cancellations 先前取消数
previous_bookings_not_canceled 先前未取消数
reserved_room_type 预订房间类型
assigned_room_type 实际房间类型
booking_changes 预订更改数
deposit_type 押金方式
agent 代理
company 公司
days_in_waiting_list 排队天数
customer_type 客户类型
adr 每日房间均价（Average Daily Rate）
required_car_parking_spaces 停车位数量
total_of_special_requests 特殊需求数(例如高层或双床)
reservation_status 订单状态
reservation_status_date 订单状态确定日期

In [4]:

1 2	# 2、总字段个数 len(df.columns)

Out[4]:

In [5]:

1 2	# 3、字段类型 df.dtypes

Out[5]:

hotel                              object
is_canceled                         int64
lead_time                           int64
arrival_date_year                   int64
arrival_date_month                 object
arrival_date_week_number            int64
arrival_date_day_of_month           int64
stays_in_weekend_nights             int64
stays_in_week_nights                int64
adults                              int64
children                          float64
babies                              int64
meal                               object
country                            object
market_segment                     object
distribution_channel               object
is_repeated_guest                   int64
previous_cancellations              int64
previous_bookings_not_canceled      int64
reserved_room_type                 object
assigned_room_type                 object
booking_changes                     int64
deposit_type                       object
agent                             float64
company                           float64
days_in_waiting_list                int64
customer_type                      object
adr                               float64
required_car_parking_spaces         int64
total_of_special_requests           int64
reservation_status                 object
reservation_status_date            object
dtype: object

In [6]:

1 2	# 4、不同类型字段统计 df.dtypes.value_counts()

Out[6]:

int64      16
object     12
float64     4
dtype: int64

In [7]:

1 2	# 5、数据量 df.shape

Out[7]:

1	(119390, 32)

In [8]:

1 2	# 6、描述统计信息 df.describe()

info函数能够查看数据的完整信息，包含字段类型，行列索引、缺失值情况等

# 7、数据完整信息
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                          Non-Null Count   Dtype
---  ------                          --------------   -----
 0   hotel                           119390 non-null  object
 1   is_canceled                     119390 non-null  int64
 2   lead_time                       119390 non-null  int64
 3   arrival_date_year               119390 non-null  int64
 4   arrival_date_month              119390 non-null  object
 5   arrival_date_week_number        119390 non-null  int64
 6   arrival_date_day_of_month       119390 non-null  int64
 7   stays_in_weekend_nights         119390 non-null  int64
 8   stays_in_week_nights            119390 non-null  int64
 9   adults                          119390 non-null  int64
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64
 12  meal                            119390 non-null  object
 13  country                         118902 non-null  object
 14  market_segment                  119390 non-null  object
 15  distribution_channel            119390 non-null  object
 16  is_repeated_guest               119390 non-null  int64
 17  previous_cancellations          119390 non-null  int64
 18  previous_bookings_not_canceled  119390 non-null  int64
 19  reserved_room_type              119390 non-null  object
 20  assigned_room_type              119390 non-null  object
 21  booking_changes                 119390 non-null  int64
 22  deposit_type                    119390 non-null  object
 23  agent                           103050 non-null  float64
 24  company                         6797 non-null    float64
 25  days_in_waiting_list            119390 non-null  int64
 26  customer_type                   119390 non-null  object
 27  adr                             119390 non-null  float64
 28  required_car_parking_spaces     119390 non-null  int64
 29  total_of_special_requests       119390 non-null  int64
 30  reservation_status              119390 non-null  object
 31  reservation_status_date         119390 non-null  object
dtypes: float64(4), int64(16), object(12)
memory usage: 29.1+ MB

缺失值信息

统计每个字段缺失值信息

统计每个字段的缺失值数量及比例

In [10]:

null_df = pd.DataFrame({"Null Values": df.isnull().sum(),
                         "Percentage Null Values": (df.isnull().sum()) / (df.shape[0]) * 100
                         })

null_df

缺失值可视化

将缺失值信息进行可视化展示：

In [11]:

1
2
3

msno.bar(df, color="blue")

plt.show()

缺失值处理

1、字段 children和字段country 缺失值比例都不到1%，比例很小；我们直接把缺失值的部分删除

# 把非缺失值的数据筛选出来

df = df[df["country"].isnull() == False]
df = df[df["children"].isnull() == False]

df.head()

2、字段company缺失值比例高达94.3%，我们考虑直接删除该字段：

In [14]:

1	df.drop("company", axis=1, inplace=True)

3、字段agent(代理商费用)的缺失值为13.68%，处理为：

In [15]:

1 2	# 1、先查看字段具体信息 df["agent"].value_counts()

Out[15]:

9.0      31959
240.0    13871
1.0       7191
14.0      3638
7.0       3539
         ...
70.0         1
93.0         1
54.0         1
497.0        1
59.0         1
Name: agent, Length: 332, dtype: int64

我们可以考虑使用的值来进行填充，比如：

0：无法确定缺失值的具体数据
9：众数
均值：字段现有值的均值

在这里我们考虑使用0来进行填充：

In [16]:

1	df["agent"].fillna(0,inplace=True)

特殊处理

处理1：入住人数不能为0

考虑到一个房间中adults、children和babies的数量不能同时为0：

In [18]:

1 2	special = (df["children"] == 0) & (df.adults == 0) & (df.babies == 0) special.head()

Out[18]:

0    False
1    False
2    False
3    False
4    False
dtype: bool

1 2	# 排除特殊情况 df = df[~special]

处理2：adr（日均价）

取值不能为负数
最大值为5400，可以判断属于异常值

In [21]:

1	df["adr"].value_counts().sort_index()

Out[21]:

-6.38          1
 0.00       1799
 0.26          1
 0.50          1
 1.00         14
            ...
 450.00        1
 451.50        1
 508.00        1
 510.00        1
 5400.00       1
Name: adr, Length: 8857, dtype: int64

In [22]:

通过小提琴图来查看数据的分布情况：处理前明显有离群点

1	px.violin(y=df["adr"]) # 处理前

箱型图也可以观察：

1	px.box(df,y="adr")

可以看到这个特殊点在City Hotel中：

实施删除的过程：

In [25]:

1
2
3

# 删除大于1000的信息  df = df.drop(df[df.adr >1000].index)

df = df[(df["adr"] >= 0) & (df["adr"] < 5400)]  # 排除异常值

In [26]:

1	px.violin(y=df["adr"]) # 删除后

1	px.box(df,y="adr",color="hotel") # 删除后

数据EDA-Exploratory Data Analysis

取消和未取消的顾客数对比

In [28]:

1	df["is_canceled"].value_counts()

Out[28]:

1
2
3

0    74589
1    44137
Name: is_canceled, dtype: int64

In [29]:

# 取消和未取消人数对比  0-未取消 1-取消
sns.countplot(df["is_canceled"])

plt.show()

未取消的顾客来自哪里？

In [30]:

1	data = df[df.is_canceled == 0] # 未取消的数据

In [31]:

number_no_canceled = data["country"].value_counts().reset_index()
number_no_canceled.columns = ["country", "number_of_no_canceled"]

number_no_canceled

# 地图可视化

basemap = folium.Map()
guests_map = px.choropleth(number_no_canceled, # 传入数据
                           locations = number_no_canceled['country'],  # 地理位置
                           color = number_no_canceled['number_of_no_canceled'],  # 颜色取值
                           hover_name = number_no_canceled['country'])  # 悬停信息
guests_map.show()

结论1：预订的顾客主要是来自Portugal，大部分是欧洲的国家

房间的每日均价是多少？

In [33]:

px.box(data,  # 数据
       x="reserved_room_type",  # x
       y="adr", # y
       color="hotel",  # 颜色
       template="plotly_dark",  # 主题
       category_orders={"reserved_room_type":["A","B","C","D","E","F","G","H","L"]} # 指定排列顺序
      )

结论2：每个房间的均价还是取决于它的类型和标准差

全年每晚的价格是多少？

两种不同类型酒店的全年均价变化

In [34]:

1 2	data_resort = data[data["hotel"] == "Resort Hotel"] data_city = data[data["hotel"] == "City Hotel"]

In [35]:

1
2
3

resort_hotel = data_resort.groupby(['arrival_date_month'])['adr'].mean().reset_index()
city_hotel = data_city.groupby(['arrival_date_month'])['adr'].mean().reset_index()
city_hotel

# 合并两个数据

total_hotel = pd.merge(resort_hotel, city_hotel,
                        on="arrival_date_month"
                        )
total_hotel.columns = ["month","price_resort","price_city"]
total_hotel

为了让月份按照正常时间排序，安装两个包：

jupyter notebook直接安装：前面要加!

In [37]:

1 2	!pip install sort-dataframeby-monthorweek !pip install sorted-months-weekdays

In [39]:

import sort_dataframeby_monthorweek as sd

#  自定义排序函数
def sort_month(df, column):
    result = sd.Sort_Dataframeby_Month(df,column)
    return result

In [40]:

1 2	new_total_hotel = sort_month(total_hotel, "month") new_total_hotel

正确的顺序排列

fig = px.line(new_total_hotel,
        x = "month",
        y = ["price_resort", "price_city"],
        title = "Price of per night over the Months",
        template = "plotly_dark"
       )

fig.show()

结论：

Resort Hotel在夏季的价格明显比 City Hotel的价格高
City Hotel的价格变化相对更小。但是City Hotel的价格从4月开始就已经很高，一直持续到9月份

KDE图

KDE(Kernel Density Estimation，核密度图)，可以认为是对直方图的加窗平滑。通过KDE分布图场内看数据在不同情形下的分布

In [42]:

plt.figure(figsize=(6,3), dpi=150)

ax = sns.kdeplot(new_total_hotel["price_resort"],
                 color="green",
                 shade=True)

ax = sns.kdeplot(new_total_hotel["price_city"],
                 color="blue",
                 shade=True)

ax.set_xlabel("month")
ax.set_ylabel("Price per night over the month")
ax = ax.legend(["Resort","City"])

最为繁忙的季节-the most busy months

In [43]:

resort_guests = data_resort['arrival_date_month'].value_counts().reset_index()
resort_guests.columns=['Month','No_Resort_Guests']

city_guests = data_city['arrival_date_month'].value_counts().reset_index()
city_guests.columns=['Month','No_City_Guests']

# 合并两份DataFrame
final_guests = pd.merge(resort_guests, city_guests)

同样的将月份进行排序处理：

In [45]:

1 2	new_final_guests = sort_month(final_guests, "Month") new_final_guests

fig = px.line(new_final_guests,
        x = "Month",
        y = ["No_Resort_Guests", "No_City_Guests"],
        title = "No of per Month",
        template = "plotly_dark"
       )

fig.show()

结论：

很明显：City Hotel的人数是高于Resort Hotel，更受欢迎
City Hotel在7-8月份的时候，尽管价格高（上图），但人数也达到了峰值
两个Hotel在冬季的顾客都是很少的

顾客停留多久？

In [47]:

1 2	data["total_nights"] = data['stays_in_weekend_nights'] + data['stays_in_week_nights'] data.head()

两个不同酒店在不同停留时间下的统计：

In [48]:

stay_groupby = (data.groupby(['total_nights', 'hotel'])["is_canceled"]
                .agg("count")
                .reset_index()
                .rename(columns={"is_canceled":"Number of stays"}))

stay_groupby.head()

Out[48]:

	total_nights	hotel	Number of stays
0	0	City Hotel	251
1	0	Resort Hotel	366
2	1	City Hotel	9155
3	1	Resort Hotel	6368
4	2	City Hotel	10983

In [49]:

fig = px.bar(stay_groupby,
       x = "total_nights",
       y = "Number of stays",
       color = "hotel",
       barmode = "group"
      )

fig.show()

数据预处理-Data Pre Processing

删除无效字段

In [52]:

1
2
3

no_use_col = ['arrival_date_year', 'assigned_room_type',
             'booking_changes','reservation_status',
             'country', 'days_in_waiting_list']

In [53]:

1	df.drop(no_use_col, axis=1, inplace=True)

特征工程

离散型变量处理

In [54]:

1	df["hotel"].dtype # Series型数据的字段类型

Out[54]:

1	dtype('O')

In [55]:

1 2	cat_cols = [col for col in df.columns if df[col].dtype == "O"] cat_cols

Out[55]:

['hotel',
 'arrival_date_month',
 'meal',
 'market_segment',
 'distribution_channel',
 'reserved_room_type',
 'deposit_type',
 'customer_type',
 'reservation_status_date']

In [56]:

1	cat_df = df[cat_cols]

In [57]:

1	cat_df.dtypes

Out[57]:

hotel                      object
arrival_date_month         object
meal                       object
market_segment             object
distribution_channel       object
reserved_room_type         object
deposit_type               object
customer_type              object
reservation_status_date    object
dtype: object

In [58]:

1
2
3

# 1、转成时间类型数据

cat_df['reservation_status_date'] = pd.to_datetime(cat_df['reservation_status_date'])

In [59]:

#  2、提取年月日

cat_df["year"] = cat_df['reservation_status_date'].dt.year
cat_df['month'] = cat_df['reservation_status_date'].dt.month
cat_df['day'] = cat_df['reservation_status_date'].dt.day

In [60]:

1 2	# 3、删除无效字段 cat_df.drop(['reservation_status_date','arrival_date_month'], axis=1, inplace=True)

In [61]:

# 4、每个字段的唯一值

for col in cat_df.columns:
    print(f"{col}: \n{cat_df[col].unique()}\n")
hotel:
['Resort Hotel' 'City Hotel']

meal:
['BB' 'FB' 'HB' 'SC' 'Undefined']

market_segment:
['Direct' 'Corporate' 'Online TA' 'Offline TA/TO' 'Complementary' 'Groups'
 'Aviation']

distribution_channel:
['Direct' 'Corporate' 'TA/TO' 'Undefined' 'GDS']

reserved_room_type:
['C' 'A' 'D' 'E' 'G' 'F' 'H' 'L' 'B']

deposit_type:
['No Deposit' 'Refundable' 'Non Refund']

customer_type:
['Transient' 'Contract' 'Transient-Party' 'Group']

year:
[2015 2014 2016 2017]

month:
[ 7  5  4  6  3  8  9  1 11 10 12  2]

day:
[ 1  2  3  6 22 23  5  7  8 11 16 29 19 18  9 13  4 12 26 17 15 10 20 14
 30 28 25 21 27 24 31]

特征编码

In [62]:

# 酒店
cat_df['hotel'] = cat_df['hotel'].map({'Resort Hotel' : 0,
                                       'City Hotel' : 1})
# 餐食
cat_df['meal'] = cat_df['meal'].map({'BB' : 0,
                                     'FB': 1,
                                     'HB': 2,
                                     'SC': 3,
                                     'Undefined': 4})
# 细分市场
cat_df['market_segment'] = (cat_df['market_segment']
                            .map({'Direct': 0,
                                 'Corporate':1,
                                 'Online TA':2,
                                 'Offline TA/TO': 3,
                                 'Complementary': 4,
                                 'Groups': 5,
                                 'Undefined': 6,
                                 'Aviation': 7}))
# 分销渠道
cat_df['distribution_channel'] = (cat_df['distribution_channel']
                                  .map({'Direct': 0,
                                        'Corporate': 1,
                                        'TA/TO': 2,
                                        'Undefined': 3,
                                        'GDS': 4}))
# 预订房间类型
cat_df['reserved_room_type'] = (cat_df['reserved_room_type']
                                .map({'C': 0,
                                      'A': 1,
                                      'D': 2,
                                      'E': 3,
                                      'G': 4,
                                      'F': 5,
                                      'H': 6,
                                      'L': 7,
                                      'B': 8}))
# 押金方式
cat_df['deposit_type'] = (cat_df['deposit_type']
                          .map({'No Deposit': 0,
                                'Refundable': 1,
                                'Non Refund': 3}))
# 顾客类型
cat_df['customer_type'] = (cat_df['customer_type']
                           .map({'Transient': 0,
                                 'Contract': 1,
                                 'Transient-Party': 2,
                                 'Group': 3})
                          )
# 年份
cat_df['year'] = cat_df['year'].map({2015: 0, 2014: 1, 2016: 2, 2017: 3})

连续型变量处理

In [63]:

1
2
3

num_df = df.drop(columns=cat_cols,axis=1)

num_df.drop("is_canceled",axis=1,inplace=True)

# 方差偏大的字段进行对数化处理
log_col = ["lead_time","arrival_date_week_number","arrival_date_day_of_month","agent","adr"]

for col in log_col:
    num_df[col] = np.log(num_df[col] + 1)

num_df.head()

建模

合并两份df

In [68]:

1 2	X = pd.concat([cat_df, num_df], axis=1) y = df["is_canceled"]

In [69]:

print(X.shape)
print(y.shape)
(118726, 25)
(118726,)

切割数据

In [70]:

1	X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,random_state=412)

建模1：逻辑回归

In [71]:

# 模型实例化
lr = LogisticRegression()
lr.fit(X_train, y_train)

# 预测值
y_pred_lr = lr.predict(X_test)

# 分类问题不同评价指标
acc_lr = accuracy_score(y_test, y_pred_lr)
conf = confusion_matrix(y_test, y_pred_lr)
clf_report = classification_report(y_test, y_pred_lr)

print(f"Accuracy Score of Logistic Regression is : {acc_lr}")
print(f"Confusion Matrix : \n{conf}")
print(f"Classification Report : \n{clf_report}")
Accuracy Score of Logistic Regression is : 0.8126842415564727
Confusion Matrix :
[[14133   836]
 [ 3612  5165]]
Classification Report :
              precision    recall  f1-score   support

           0       0.80      0.94      0.86     14969
           1       0.86      0.59      0.70      8777

    accuracy                           0.81     23746
   macro avg       0.83      0.77      0.78     23746
weighted avg       0.82      0.81      0.80     23746

In [72]:

# 混淆矩阵可视化

classes = ["0","1"]

disp = ConfusionMatrixDisplay(confusion_matrix=conf, display_labels=classes)
disp.plot(
    include_values=True,            # 混淆矩阵每个单元格上显示具体数值
    cmap="GnBu",                 # matplotlib识别的颜色图
    ax=None,
    xticks_rotation="horizontal",
    values_format="d"
)

plt.show()

模型2：KNN

In [73]:

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

y_pred= knn.predict(X_test)

acc_knn = accuracy_score(y_test, y_pred)
conf = confusion_matrix(y_test, y_pred)
clf_report = classification_report(y_test, y_pred)

模型3：决策树

In [74]:

dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

y_pred_dtc = dtc.predict(X_test)

acc_dtc = accuracy_score(y_test, y_pred_dtc)
conf = confusion_matrix(y_test, y_pred_dtc)
clf_report = classification_report(y_test, y_pred_dtc)

模型4：随机森林

In [75]:

rd_clf = RandomForestClassifier()
rd_clf.fit(X_train, y_train)

y_pred_rd_clf = rd_clf.predict(X_test)

acc_rd_clf = accuracy_score(y_test, y_pred_rd_clf)
conf = confusion_matrix(y_test, y_pred_rd_clf)
clf_report = classification_report(y_test, y_pred_rd_clf)

模型5：AdaBoost

In [76]:

ada = AdaBoostClassifier(base_estimator = dtc)
ada.fit(X_train, y_train)

y_pred_ada = ada.predict(X_test)

acc_ada = accuracy_score(y_test, y_pred_ada)
conf = confusion_matrix(y_test, y_pred_ada)
clf_report = classification_report(y_test, y_pred_ada)

模型6：梯度提升树-Gradient Boosting Classifier

In [77]:

gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)

y_pred_gb = gb.predict(X_test)

acc_gb = accuracy_score(y_test, y_pred_gb)
conf = confusion_matrix(y_test, y_pred_gb)
clf_report = classification_report(y_test, y_pred_gb)

模型7：XgBoost Classifier

In [78]:

xgb = XGBClassifier(booster='gbtree',
                    learning_rate=0.1,
                    max_depth=5,
                    n_estimators=180)

xgb.fit(X_train, y_train)

y_pred_xgb = xgb.predict(X_test)

acc_xgb = accuracy_score(y_test, y_pred_xgb)
conf = confusion_matrix(y_test, y_pred_xgb)
clf_report = classification_report(y_test, y_pred_xgb)

模型8：Cat Boost分类器

In [79]:

cat = CatBoostClassifier(iterations=100)
cat.fit(X_train, y_train)

y_pred_cat = cat.predict(X_test)

acc_cat = accuracy_score(y_test, y_pred_cat)
conf = confusion_matrix(y_test, y_pred_cat)
clf_report = classification_report(y_test, y_pred_cat)

模型9：极端树-Extra Trees Classifier

In [80]:

etc = ExtraTreesClassifier()
etc.fit(X_train, y_train)

y_pred_etc = etc.predict(X_test)

acc_etc = accuracy_score(y_test, y_pred_etc)
conf = confusion_matrix(y_test, y_pred_etc)
clf_report = classification_report(y_test, y_pred_etc)

模型10：LGBM

In [81]:

lgbm = LGBMClassifier(learning_rate = 1)
lgbm.fit(X_train, y_train)

y_pred_lgbm = lgbm.predict(X_test)

acc_lgbm = accuracy_score(y_test, y_pred_lgbm)
conf = confusion_matrix(y_test, y_pred_lgbm)
clf_report = classification_report(y_test, y_pred_lgbm)

模型11：投票分类器-Voting Classifier

这个是重点建模：多分类器的投票表决

In [82]:

classifiers = [('Gradient Boosting Classifier', gb),
               ('Cat Boost Classifier', cat),
               ('XGboost', xgb),
               ('Decision Tree', dtc),
               ('Extra Tree', etc),
               ('Light Gradient', lgbm),
               ('Random Forest', rd_clf),
               ('Ada Boost', ada),
               ('Logistic', lr),
               ('Knn', knn)]

vc = VotingClassifier(estimators = classifiers)
vc.fit(X_train, y_train)

y_pred_vc = vc.predict(X_test)

acc_vtc = accuracy_score(y_test, y_pred_vc)
conf = confusion_matrix(y_test, y_pred_vc)
clf_report = classification_report(y_test, y_pred_vc)

基于深度学习keras建模

数据预处理和切割

In [84]:

from tensorflow.keras.utils import to_categorical

X = pd.concat([cat_df, num_df], axis = 1)
# 转成分类型变量数据
y = to_categorical(df['is_canceled'])

In [85]:

1
2
3

# 切割数据

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [86]:

import tensorflow as tf
import keras
from keras.layers import Dense
from keras.models import Sequential

搭建网络

In [87]:

1	X.shape[1]

Out[87]:

In [88]:

model = Sequential()

model.add(Dense(100, activation="relu",input_shape=(X.shape[1], )))
model.add(Dense(100, activation="relu"))
model.add(Dense(2, activation="sigmoid"))

model.compile(optimizer="adam",
              loss="binary_crossentropy",
              metrics=["accuracy"])

model_history = model.fit(X_train,
                          y_train,
                          validation_data = (X_test, y_test),
                          epochs = 50)

指标可视化-loss

In [89]:

train_loss = model_history.history["loss"]
val_loss = model_history.history["val_loss"]

epoch = range(1,51)

loss = pd.DataFrame({"train_loss": train_loss,
                     "val_loss":val_loss
                    })
loss.head()

Out[89]:

	train_loss	val_loss
0	0.320637	0.189505
1	0.152539	0.394518
2	0.112900	0.094703
3	0.091933	0.105522
4	0.078177	0.096059

In [90]:

fig = px.line(loss,
        x=epoch,
        y=['val_loss','train_loss'],
        title='Train and Val Loss')

fig.show()

指标可视化-acc

In [91]:

train_acc = model_history.history["accuracy"]
val_acc = model_history.history["val_accuracy"]

epoch = range(1,51)

acc = pd.DataFrame({"train_acc": train_acc,
                     "val_acc":val_acc
                    })

px.line(acc,
        x=epoch,
        y=['val_acc','train_acc'],
        title = 'Train and Val Accuracy',
        template = 'plotly_dark')

#最终预测值

acc_ann = model.evaluate(X_test, y_test)[1]
acc_ann

# 结果
743/743 [==============================] - 2s 2ms/step - loss: 0.0504 - accuracy: 0.9867
0.986692488193512

模型对比

不同模型的结果对比

In [93]:

models = pd.DataFrame({
    'Model' : ['Logistic Regression', 'KNN',
               'Decision Tree Classifier',
               'Random Forest Classifier',
               'Ada Boost Classifier',
               'Gradient Boosting Classifier',
               'XgBoost', 'Cat Boost',
               'Extra Trees Classifier',
               'LGBM', 'Voting Classifier','ANN'],
    'Score' : [acc_lr, acc_knn, acc_dtc,
               acc_rd_clf, acc_ada, acc_gb,
               acc_xgb, acc_cat, acc_etc,
               acc_lgbm, acc_vtc, acc_ann]
})


models = models.sort_values(by = 'Score', ascending = True, ignore_index=True)

models["Score"] = models["Score"].apply(lambda x: round(x,4))
models

不同模型的得分可视化对比：

fig = px.bar(models,
       x="Score",
       y="Model",
       text="Score",
       color="Score",
       template="plotly_dark",
       title="Models Comparision"
      )

fig.show()

可以看到Cat Boost分类达到了惊人的99.61%

又是收获满满的一篇文章✌🏻！