kaggle
上的Titanic
数据处理、特征工程,建模等
- 中位数填充缺失值
- 特征工程处理
- 建模过程
导入相关库
1 | import numpy as np |
1 | train = pd.read_csv("/Users/peter/data-visualization/train.csv") |
查看数据信息及缺失值
1 | train.head(3) |
1 | train.info() # age 字段非常缺失(714) |
1 | <class 'pandas.core.frame.DataFrame'> |
1 | test.head() |
1 | print(train.shape) |
1 | (891, 12) |
1 | test.info() # age 字段缺失 |
1 | <class 'pandas.core.frame.DataFrame'> |
1 | train.isnull().sum() # 查看缺失总数 |
1 | PassengerId 0 |
Bar Chart for Categorical Features
- Pclass
- Sex
- SibSp ( # of siblings and spouse)
- Parch ( # of parents and children)
- Embarked
- Cabin
1 | def bar_chart(feature): |
1 | bar_chart("Sex") |
1 | bar_chart('Pclass') |
1 | bar_chart('SibSp') |
1 | bar_chart('Parch') |
1 | # 先找出存活的所有数据,再找出属性P(1,2,3)中存活的人,然后统计属性P的分类人数 |
1 | 1 136 |
1 | train[train['Survived']==1]['Pclass'] |
1 | 1 1 |
特征工程
Feature engineering is the process of using domain knowledge of the data
to create features (feature vectors) that make machine learning algorithms work.
特征工程的处理:如何将原始数据中的字符串数据转换成数值类型
Name
1 | train_test_data = [train, test] # 将测试集和训练集合并 |
1 | train["Title"].value_counts() # 统计个数 train["Title"].value_counts() |
1 | Mr 517 |
1 | test["Title"].value_counts() # 统计个数 |
1 | Mr 240 |
Title map
- Mr : 0
- Miss : 1
- Mrs: 2
- Others: 3
1 | title_mapping = {"Mr": 0, "Miss": 1, "Mrs": 2, "Master": 3, "Dr": 3, |
1 | dataset['Title'] # 取出每个名字的title_map部分 |
1 | 0 Mr |
1 | for dataset in train_test_data: |
1 | dataset.head(3) # 最后面新增了一列:title |
1 | bar_chart("Title") |
1 | # 删除某个非必须属性 |
1 | train.head() |
Sex
- male:0
- female:1
1 | sex_mapping = {"male": 0, "female": 1} |
1 | bar_chart('Sex') |
Age
Age字段中有很多缺失值,用中位数
进行填充
fillna函数后中位数进行填充
1 | # 某个字段用中位数进行填充 fillna 函数 |
1 | train.groupby("Title")["Age"].transform("median") |
1 | 0 30.0 |
1 | facet = sns.FacetGrid(train, hue="Survived",aspect=4) |
1 | facet = sns.FacetGrid(train, hue="Survived", aspect=4) |
(0, 20)
1 | train.info() # Age 字段已经填充 |
1 | <class 'pandas.core.frame.DataFrame'> |
如何将一个属性变成一个分类变量形式
1 | # 将 Age 年龄变成 Categorical Variable 分类变量 |
1 | train.head() |
1 | bar_chart("Age") |
Embarked
根据属性的多种不同取值来绘制图形
1 | train[train['Pclass']==1]['Embarked'] # 找出P属性中值为1的每个 Embarked 属性值 |
1 | 1 C |
1 | # 找出P属性中值为1的每个 Embarked 属性值,再进行分类统计 |
1 | df |
1 | # fill out missing embark with S embark |
如何将属性中的字符串转成数值型?
1 | embarked_mapping = {"S": 0, "C": 1, "Q": 2} |
Fare
缺失值填充中位数
1 | # fill missing Fare with median fare for each Pclass |
绘制图形
1 | facet = sns.FacetGrid(train, hue="Survived",aspect=4) |
将Fare属性分段
1 | for dataset in train_test_data: |
Cabin
1 | train.Cabin.value_counts() |
1 | B96 B98 4 |
1 | for dataset in train_test_data: |
1 | train[train['Pclass']==1]['Cabin'].value_counts() |
1 | C 59 |
1 | # 统计每个Pclass中的每个字母各出现多少次 |
1 | cabin_mapping = {"A": 0, "B": 0.4, |
1 | # fill missing Fare with median fare for each Pclass |
FamilySize
添加属性familysize
1 | train["FamilySize"] = train["SibSp"] + train["Parch"] + 1 |
1 | facet = sns.FacetGrid(train, hue="Survived",aspect=4) |
(0, 11.0)
1 | train.head() # 最后添加了familysize属性 |
1 | family_mapping = {1: 0, 2: 0.4, 3: 0.8, 4: 1.2, 5: 1.6, 6: 2, 7: 2.4, 8: 2.8, 9: 3.2, 10: 3.6, 11: 4} |
删除某些属性
1 | # 使用drop删除不需要的属性 |
1 | train_data = train.drop('Survived', axis=1) # train_data是删除3个属性后的数据 |
1 | ((891, 8), (891,)) |
1 | train_data.head(10) |
建模
导入各种模型
1 | # Importing Classifier Modules |
1 | train.info() |
1 | <class 'pandas.core.frame.DataFrame'> |
交叉验证
1 | from sklearn.model_selection import KFold |
1 | # KNN |
1 | [0.82222222 0.76404494 0.80898876 0.83146067 0.87640449 0.82022472 |
1 | round(np.mean(score)*100, 2) # KNN score |
1 | ## 决策树 |
[0.76666667 0.82022472 0.76404494 0.7752809 0.88764045 0.76404494
0.83146067 0.82022472 0.74157303 0.79775281]
1 | round(np.mean(score)*100, 2) |
1 | ### 随机森林 |
[0.8 0.82022472 0.79775281 0.76404494 0.86516854 0.82022472
0.80898876 0.80898876 0.75280899 0.80898876]
1 | round(np.mean(score)*100, 2) # Random Forest Score |
1 | #### 贝叶斯 |
[0.85555556 0.73033708 0.75280899 0.75280899 0.70786517 0.80898876
0.76404494 0.80898876 0.86516854 0.83146067]
1 | round(np.mean(score)*100, 2) |
1 | ##### SVM |
[0.83333333 0.80898876 0.83146067 0.82022472 0.84269663 0.82022472
0.84269663 0.85393258 0.83146067 0.86516854]
1 | round(np.mean(score)*100,2) |
testing
从上面的结果中观察到使用支持向量机
的效果是最好的。
1 | clf = SVC() |
1 | submission = pd.DataFrame({ |
1 | submission = pd.read_csv('submission.csv') |