Python岗位背后的奥秘

探索Python岗位后面的奥秘

本文记录的是一个基于Job岗位的数据分析案例。

导入库

import pandas as pd
import numpy as np
import re

import time
import datetime as dt

import jieba

import matplotlib.pyplot as plt
from pyecharts.globals import CurrentConfig, OnlineHostType
from pyecharts import options as opts  # 配置项
from pyecharts.charts import Bar, Pie, Line, HeatMap, Funnel, WordCloud, Grid, Page  # 各个图形的类
from pyecharts.commons.utils import JsCode
from pyecharts.globals import ThemeType,SymbolType

import plotly.express as px
import plotly.graph_objects as go

import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams["font.sans-serif"]=["SimHei"] #设置字体
plt.rcParams["axes.unicode_minus"]=False #该语句解决图像中的“-”负号的乱码问题

%matplotlib inline

数据探索

导入数据

数据探索

In [3]:

1	df.isnull().sum() # 数据缺失值

Out[3]:

只有福利这个字段存在缺失值；

标题       0
地区       0
薪资       0
经验       0
学历       0
公司       0
公司类型     0
福利      31
详情链接     0
dtype: int64

In [4]:

1	df.shape # 数据形状

Out[4]:

(300, 9)

In [5]:

1	df.dtypes # 字段类型

Out[5]:

标题      object
地区      object
薪资      object
经验      object
学历      object
公司      object
公司类型    object
福利      object
详情链接    object
dtype: object

数据中的字段全是字符串类型

地区分析

有的地区存在城市、区、地点，有的只有一个城市，需要单独处理。

In [6]:

1 2	df1 = df[df["地区"].str.contains("·")] df2 = df[~df["地区"].str.contains("·")] # 不包含

带有·的地区处理

In [7]:

1	df1.head()

1 2	df1["地区"] = df1["地区"].str.split("·") df1.head() # 切割之后变成了列表

通过列表的长度进行判断：

df1["城市"] = df1["地区"].apply(lambda x: x[0])
df1["区"] = df1["地区"].apply(lambda x: x[1] if len(x) >= 2 else np.nan)
df1["地点"] = df1["地区"].apply(lambda x: x[2] if len(x) == 3 else np.nan)
df1.head()

不带有·的地区处理

In [10]:

df2["城市"] = df2["地区"]   # 只保留城市
df2["区"] = np.nan
df2["地点"] = np.nan

df2.drop(["地区"],axis=1,inplace=True)

数据合并df3

待分析的数据

In [12]:

1 2	df3 = pd.concat([df1,df2], axis=0) df3.head()

地域分析

基于城市

In [13]:

1
2
3

df4 = df3["城市"].value_counts().reset_index()
df4.columns = ["城市","数量"]
df4.head(10)

fig = px.bar(df4, x="城市",y="数量",text="数量")
fig.update_traces(textposition="outside")

fig.update_layout(xaxis_tickangle=45)   # 倾斜角度设置
fig.show()

基于城市下的区（深圳、北京为例）

下面是北京的例子：

学历

岗位对不同学历的要求

fig = px.pie(df7,
             names="学历",
             values="数量"
            )

fig.update_traces(
    textposition='inside',
    textinfo='percent+label'
)

fig.update_layout(
    title={
        "text":"学历占比",
        "y":0.96,  # y轴数值
        "x":0.5,  # x轴数值
        "xanchor":"center",  # x、y轴相对位置
        "yanchor":"top"
    }
)

fig.show()

薪酬分布

In [21]:

1 2	df8 = df3["薪资"].value_counts() # 不同薪资的取值情况 df8

Out[21]:

10-15K        20
15-25K        18
15-30K·14薪    14
15-30K        11
8-13K          9
              ..
150-250元/天     1
14-26K·13薪     1
10-11K         1
5-8K·13薪       1
11-15K         1
Name: 薪资, Length: 144, dtype: int64

薪资当中大部分都是月薪，但是也有按天的工资。并且薪资都有最大值和最小值，我们取每个薪资的均值，比如：10-15K，取12.5K。

另外如果日薪，先取均值，再乘以30，当做月薪。

In [22]:

## 日薪的数据
df9 = df3[df3["薪资"].str.contains("天")]

df10 = df3[~df3["薪资"].str.contains("天")]

df9  # 日薪

Out[22]:

# 获取最大值和最小值
df11 = df9["薪资"].str.extract(r"(?P<最低日薪>[\d]+)-(?P<最高日薪>[\d]+)")

df12 = df9.join(df11)
df12.drop("薪资", axis=1, inplace=True)

for i in ["最低日薪", "最高日薪"]:
    df12[i] = df12[i].astype(int)

df12["月薪"] = (df12["最高日薪"] + df12["最低日薪"]) / 2 * 30
df12.drop(["最高日薪","最低日薪"], axis=1, inplace=True)

本身就是月薪的数据处理：

1
2
3

# 本身就是月薪处理df10；单位是K
df13 = df10["薪资"].str.extract(r"(?P<最低月薪>[\d]+)-(?P<最高月薪>[\d]+)")
df13.head()

1
2
3

df14 = df10.join(df13)
df14.drop("薪资", axis=1, inplace=True)
df14.head()

for i in ["最低月薪", "最高月薪"]:
    df14[i] = df14[i].astype(int)

df14["月薪"] = (df14["最高月薪"] + df14["最低月薪"]) / 2 * 1000  # K的单位换算
df14.drop(["最高月薪","最低月薪"], axis=1, inplace=True)

df14

合并基于时薪和月薪的数据：

fig = px.histogram(df16,
                   x="月薪",
                   y="数量",
                   nbins=20
                  )

fig.update_layout(
    title={
        "text":"薪酬区间分布",
        "y":0.96,  # y轴数值
        "x":0.5,  # x轴数值
        "xanchor":"center",  # x、y轴相对位置
        "yanchor":"top"
    }
)

# fig.update_layout(bargap=0.1)
fig.show()

人数（基于城市）

In [33]:

1
2
3

df17 = df3["城市"].value_counts().reset_index()
df17.columns = ["城市", "人数"]
df17.head()

fig = px.bar(df17, x="城市",y="人数",text="人数",color="人数")
fig.update_traces(textposition="outside")

fig.update_layout(xaxis_tickangle=45)   # 倾斜角度设置
fig.show()

薪资的城市分布-箱型图

岗位名称

柱状图

In [37]:

1
2
3

df18 = df15["标题"].value_counts().reset_index()
df18.columns = ["岗位","数量"]
df18

# 显示前20个

fig = px.bar(df18[:20], x="岗位",y="数量",text="数量",color="数量")
# fig.update_traces(textposition="side")
fig.update_layout(xaxis_tickangle=45)   # 倾斜角度设置
fig.show()

词云图

In [39]:

title_list = df18["岗位"].tolist()

# 分词过程
title_jieba_list = []
for i in range(len(title_list)):
    # jieba分词
    seg_list = jieba.cut(str(title_list[i]).strip(), cut_all=False)
    for each in list(seg_list):
        title_jieba_list.append(each)
Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/8d/zckh4twx1tgd4f9fvl0t_7y00000gn/T/jieba.cache
Loading model cost 1.371 seconds.
Prefix dict has been built successfully.

In [40]:

# 创建停用词list
def StopWords(filepath):
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]
    return stopwords


# 传入停用词表的路径：路径需要修改
stopwords = StopWords("/Users/peter/Desktop/spider/nlp_stopwords.txt")

useful_result = []

for col in title_jieba_list:
    if col not in stopwords:
        useful_result.append(col)

In [41]:

information = pd.value_counts(useful_result).reset_index()

information.columns=["word","number"]

# 去除包含数字的信息
information_new = information[~information["word"].str.contains("[\d]+")]
information_new

绘制词云图：

information_zip = [tuple(z) for z in zip(information_new["word"].tolist(), information_new["number"].tolist())]

# 绘图
c = (
    WordCloud()
    .add("", information_zip[:100], word_size_range=[20, 80], shape=SymbolType.DIAMOND)
    .set_global_opts(title_opts=opts.TitleOpts(title="Python岗位-词云图"))
)

c.render_notebook()