Fork me on GitHub

基于sklearn的决策树实现

本文中讲解的是使用sklearn实现决策树及其建模过程,包含

  • 数据的清洗和数据分离train_test_split
  • 采用不同的指标,基尼系数或者信息熵进行建模,使用的是X_train和y_train
    • 实例化
    • fit拟合
  • 预测功能:采用上面的两种实例化进行预测y_pred = clf_gini.predict(X_test)
  • 结果评估
    • 混淆矩阵
    • 准确率
    • 分类报告

封装成函数实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix # 混淆矩阵
from sklearn.model_selection import train_test_split # 数据分离模块
from sklearn.tree import DecisionTreeClassifier # 分类决策树
from sklearn.metrics import accuracy_score # 评价指标
from sklearn.metrics import classification_report # 生成分类结果报告模块

# 读取数据 importing data
def load_data():
balance_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-'+'databases/balance-scale/balance-scale.data',sep=',',header=None) # 导入数据集,同时设置头部
print("Dataset Length", len(balance_data))

print(balance_data.head())
return balance_data

# 训练集和测试集的分离 splitting the dataset into train and test
def split_dataset(balance_data):

X = balance_data.values[:, 1:5] # 提取特征数据
y = balance_data.values[:, 0] # 提取数据标签

X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.3,
random_state=100) # 进行数据分离

return X, y, X_train, X_test, y_train, y_test

# 使用基尼系数进行训练 training with giniIndex
def train_using_gini(X_train, y_train):

# 先建立实例,再进行fit拟合
clf_gini = DecisionTreeClassifier(criterion="gini" # 实例化
,random_state=100
,max_depth=3
,min_samples_leaf=5)
clf_gini.fit(X_train, y_train) # fit拟合
return clf_gini

# 使用信息熵进行训练 training with entropy
def train_using_entropy(X_train, y_train):

# 实例化+fit拟合
clf_entropy = DecisionTreeClassifier(criterion="entropy"
,random_state=100
,max_depth=3
,min_samples_leaf=5)
clf_entropy.fit(X_train, y_train)
return clf_entropy

# 预测功能 make predictions
def prediction(X_test, clf_object):

y_pred = clf_object.predict(X_test)
print("Predicted vlaues:")
print(y_pred)
return y_pred

# 计算准确率 calculate accuracy
def cal_accuracy(y_test, y_pred):

print("Confusion Matrix:", confusion_matrix(y_test, y_pred))

print("Accuracy:", accuracy_score(y_test, y_pred)*100)

print("Report:", classification_report(y_test, y_pred))

def main():
data = load_data()
X, y, X_train, X_test, y_train, y_test = split_dataset(data)
clf_gini = train_using_gini(X_train, y_train)
clf_entropy = train_using_entropy(X_train, y_train)

print("result using gini Index:")
y_pred_gini = prediction(X_test, clf_gini)
cal_accuracy(y_test, y_pred_gini)

print("result using Entropy:")
y_pred_entropy = prediction(X_test, clf_entropy)
cal_accuracy(y_test, y_pred_entropy)

if __name__ == "__main__":
main()

本文标题:基于sklearn的决策树实现

发布时间:2020年01月15日 - 16:01

原始链接:http://www.renpeter.cn/2020/01/15/%E5%9F%BA%E4%BA%8Esklearn%E7%9A%84%E5%86%B3%E7%AD%96%E6%A0%91%E5%AE%9E%E7%8E%B0.html

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

Coffee or Tea