利用Pandas进行数据探索

数据探索Data Exploration

在我们获取到一份数据后，需要对数据做一个初步的探索工作，比如数据大小、字段类型、行列属性，数据是否有缺失值等，让我们对整体的数据有一个初步的了解。

模拟数据

为了进行下面的数据探索工作，生成了DataFrame和Series两份不同的数据：

In [1]:

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

导入本地的一份csv文件，生成DataFrame数据：

In [2]:

1 2	df = pd.read_csv("Titanic.csv") df.head()

Out[2]:

另外，模拟一份Series类型的数据：

In [3]:

1 2	s = pd.Series(["math","chinese","math","chinese","english","gym","math","chinese","english"]) s

Out[3]:

0       math
1    chinese
2       math
3    chinese
4    english
5        gym
6       math
7    chinese
8    english
dtype: object

数据形状shape

数据形状主要是查看这份数据集的大小，包含数据行记录数和字段（属性）个数

In [4]:

df.shape

Out[4]:

(891, 12)

返回的是一个元组，第一个取值行记录数，891条数据，也就是整体数据的长度：

In [5]:

len(df)

Out[5]:

1	891 # 就是元组的第一个值

第二个取值是字段个数，表示有12个字段属性：

In [6]:

1	df.columns

Out[6]:

1
2
3

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [7]:

1	len(df.columns)

Out[7]:

1	12 # 元组的第二个取值

如果是针对Series型的数据，只有一个返回值，因为Series数据是一维的：

In [8]:

s.shape

Out[8]:

(9,)

In [9]:

len(s)

Out[9]:

数据大小size

数据的size就是行索引数乘以列索引的个数：

In [10]:

1	df.size # 891*12

Out[10]:

Series数据的size就是其长度的值：

In [11]:

1	s.size # 9*1

Out[11]:

数据维度ndim

ndim表示的是数据的维度，Pandas中Series是一维的数据，DataFrame是二维的数据：

In [12]:

df.ndim

Out[12]:

In [13]:

s.ndim

Out[13]:

行列索引index/columns

下面使用不同的方式来取出DataFrame的行列索引信息。

1、单独返回行索引 index 对象和列索引 columns 对象：

In [14]:

1	df.index # 行索引

Out[14]:

1	RangeIndex(start=0, stop=891, step=1)

In [15]:

1	df.columns # 列索引（每个属性）

Out[15]:

1
2
3

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age',
				'SibSp','Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

2、同时返回行列索引组成的一个列表，第一个信息是行索引，第二个是列索引：

In [16]:

df.axes

Out[16]:

[RangeIndex(start=0, stop=891, step=1),
 Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age',
 				'SibSp','Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
       dtype='object')]

如果是针对Series型数据，只返回行索引信息：

In [17]:

s.axes

Out[17]:

1	[RangeIndex(start=0, stop=9, step=1)]

查看样本head/tail/sample

当给定了一份数据，我们可能不想查看全部数据，部分数据的展示可以通过下面3种方式：

head(N)：前N行数据，默认是5行
tail(N)：最后N行数据，默认是5行
sample(N)：随机查看N行数据

1、查看前面的部分数据：

In [18]:

1	df.head() # 默认5行

2、查看尾部的数据：

3、随机查看数据：每次结果不一定相同

随机查看多行数据：

针对Series型数据：

In [26]:

1	s.head() # 默认前5行

Out[26]:

0       math
1    chinese
2       math
3    chinese
4    english
dtype: object

In [27]:

1	s.tail() # 默认尾部5行

Out[27]:

4    english
5        gym
6       math
7    chinese
8    english
dtype: object

In [28]:

1	s.sample() # 随机查看一行

Out[28]:

1 2	2 math dtype: object

In [29]:

1	s.sample(3) # 随机3行

Out[29]:

2       math
7    chinese
6       math
dtype: object

数据keys-values

数据的keys就是指数据的列名（字段）：

In [30]:

df.keys()

Out[30]:

1
2
3

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age',
			'SibSp','Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

数据的values就是指每一行的数据：

In [31]:

df.values

Out[31]:

array([[1, 0, 3, ..., 7.25, nan, 'S'],  # 每行的数据
       [2, 1, 1, ..., 71.2833, 'C85', 'C'],
       [3, 1, 3, ..., 7.925, nan, 'S'],
       ...,
       [889, 0, 3, ..., 23.45, nan, 'S'],
       [890, 1, 1, ..., 30.0, 'C148', 'C'],
       [891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)

字段类型dtypes

查看数据中每个字段的数据类型和DataFrame整体的数据类型：

In [32]:

df.dtypes

Out[32]:

PassengerId      int64  # 每个字段的数据类型
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object  # 整体的数据类型

可以看到这份数据中主要是存在3种数据类型：

int64
float64
object

其中前面两个属于数值型，第三个属于字符型。

如果是针对Series数据，使用dtype属性即可：

In [33]:

s.dtype

Out[33]:

1	dtype('O') # O 表示字符类型

数据信息info

所有字段的数据类型、行列索引情况、字段的非缺失值个数、字段类型、占用的内存等信息，可以通过info函数来统一查看。

In [34]:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890  # 行索引
Data columns (total 12 columns):  # 12个列名/属性
 #   Column       Non-Null Count  Dtype   # 列名、非空个数、字段类型
---  ------       --------------  -----
 0   PassengerId  891 non-null    int64
 1   Survived     891 non-null    int64
 2   Pclass       891 non-null    int64
 3   Name         891 non-null    object
 4   Sex          891 non-null    object
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64
 7   Parch        891 non-null    int64
 8   Ticket       891 non-null    object
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object
 11  Embarked     889 non-null    object
dtypes: float64(2), int64(5), object(5)  # 每种类型的字段个数
memory usage: 83.7+ KB  # 内存大小

info函数只适用于DataFrame数据，不适用于Series数据：

描述统计信息describe

描述统计信息主要是查看数据中数值型数据的相关统计信息：非空值数量、均值、方差、最值等

In [36]:

1	df.describe()

Out[36]:

缺失值情况isnull

当我们拿到一份数据，很多时候会关心数据的缺失值情况：

In [37]:

1	df.isnull() # 如果缺失，则表示为True

在使用sum函数的时候，如果值为True表示为1，False表示为0。

下面的结果表明有3个字段存在缺失值：

In [38]:

1	df.isnull().sum()

Out[38]:

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64  # 数据类型

字段取值分布unique

查看每个字段的不同取值情况：

In [39]:

1	df["Pclass"].unique()

Out[39]:

1	array([3, 1, 2])

查看每个取值的个数，例如：3出现了491次等；默认情况下降序排列：

In [40]:

1	df["Pclass"].value_counts()

Out[40]:

3    491
1    216
2    184
Name: Pclass, dtype: int64

将每种取值的个数改成占比形式normalize=True：

In [41]:

1	df["Pclass"].value_counts(normalize=True)

Out[41]:

3    0.551066
1    0.242424
2    0.206510
Name: Pclass, dtype: float64

下面是将Age字段中的不同取值的个数绘制成柱状图：

In [42]:

1 2	age = df["Age"].value_counts() age

Out[42]:

24.00    30
22.00    27
18.00    26
28.00    25
19.00    25
         ..
55.50     1
74.00     1
0.92      1
70.50     1
12.00     1
Name: Age, Length: 88, dtype: int64

In [43]:

1 2	plt.bar(age.index,age.values) plt.show()