鱼书笔记4-基于计数和推理方法的比较

对比计数统计和推理两种方法

本文记录的是书籍《深度学习进阶：自然语言处理》的第四章学习笔记。

基于计数的方法

基于计数的方法是根据一个单词周围的单词的出现次数来表示该单词。

生成单词的共现矩阵
进行降维SVD，获得密集向量

问题：语料库大的时候出现问题，维度爆炸和计算量增加。

基于推理的方法

使用神经网络的方法，通常在mini-batch数据上进行学习。

每次只需要学习部分数据；并且可以使用多台机器、多个GPU并行执行加速运算。

大致过程：

基于推理的方法引入某种模型（比如神经网络）
模型接收的上下文作为输入，输出各个单词的出现概率
模型产物：获得单词的分布式表示

神经网络中单词的处理方法

神经网络不能直接处理单词，需要将单词转化成固定长度的向量，使用one-hot编码：

出现单词的位置用1表示
没有出现对应单词的位置用0表示

向量内积np.dot实现

import numpy as np
import time
import matplotlib.pyplot as plt
%matplotlib inline

c = np.array([[1,0,0,0,0,0,0]])
W = np.random.randn(7,3)
h = np.dot(c,W)

h

array([[ 0.12477247, -0.25928347, -0.21568563]])

使用MatMul层实现

class MatMul:
    def __init__(self, W):
        self.params = [W]  # 保存学习的参数；权重矩阵
        self.grads = [np.zeros_like(W)]  # 构造一个和W矩阵维度一致，但是全为0的矩阵
        self.x = None

    # 前向传播
    def forward(self, x):
        W, = self.params    # 参数
        out = np.dot(x,W)   # 矩阵x和矩阵w相乘
        self.x = x
        return out

    # 后向传播
    def backward(self, dout):
        W, = self.params
        dx = np.dot(dout, W.T)  # dout是上游传来的；W.T是把W转置了
        dW = np.dot(self.x.T, dout)
        # grads[0][...] 使用了省略号：可以固定Numpy数组的内存地址，覆盖Numpy数组的元素
        self.grads[0][...] = dW  # 实例变量grads中设置权重的梯度
        return dx

import sys
sys.path.append("..")
import numpy as np

c = np.array([[1,0,0,0,0,0,0]])
W = np.random.randn(7,3)

layer = MatMul(W)  # 类的实例化
h = layer.forward(c)  # 调用类的forward方法

h

array([[-0.23997344, -0.90521716,  0.74001086]])

简单的Word2Vec

使用由原版Word2Vec提出来的CBOW( continous bag-of-words)的模型作为神经网络。

两个经典的Word2Vec中使用的模型：

CBOW模型
skip-gram模型

CBOW模型推理

CBOW模型是根据上下文预测目标词的模型。

模型的输入：上下文，比如['you','goodbye']这样的单词，但是需要转化为one-hot编码表示。

本文中考虑上下文的两个单词，因此模型会有两个输入层。如果是考虑N个单词，则输入层有N个。

从输入层到中间层的变换使用相同的全连接层(权重都是$W_{in}$)
从中间层到输出层神经元的变换由另一个全连接层完成(权重是$W_{out}$)

中间层的神经元是各个输入层经全连接层变换后得到的值得平均。

输出层的神经元是各个单词的得分，它的值越大说明对应单词的出现概率值越高。

得分是指被解释为概率之前的值，对这些得分应用Softmax函数，就可以得到概率值。

代码实现

import sys
sys.path.append('..')

import numpy as np

#  上下文的one-hot编码表示
c0 = np.array([[1,0,0,0,0,0,0]])
c1 = np.array([[0,0,1,0,0,0,0]])

# 权重的初始值
W_in = np.random.randn(7,3)
W_out = np.random.randn(3,7)

# 生成层
in_layer0 = MatMul(W_in)  # 层的内部实现矩阵乘积
in_layer1 = MatMul(W_in)
out_layer = MatMul(W_out)

# 正向传播过程
h0 = in_layer0.forward(c0)
h1 = in_layer1.forward(c1)
h = 0.5 * (h0 + h1)  # 中间层的均值

s = out_layer.forward(h)  # 计算各个单词的得分
s

array([[-0.03647001,  1.22730525, -1.35937841,  1.0817182 ,  1.64785619,
         1.49898799, -0.18553477]])

CBOW模型的学习

CBOW模型的学习就是调整权重，以使其预测准确。

CBOW模型 + Softmax层 + Cross Entropy Error层

Word2Vec的权重和分布式表示

Word2Vec中使用的网络有两个权重，分别是输入侧的$W_{in}$和输出侧的$W_{out}$。

二者都是保存了单词含义进行了编码的向量，到底该选择哪个权重？最受欢迎的方案：只使用输入侧的权重

数据准备

上下文和目标词

Word2Vec使用的神经网络的输入是上下文contexts；
它的正确标签是这些上下文包围在中间的单词，也就是目标词target。

使用语料库获取上下文和目标词

# 第二章的precess函数 ；经常使用

def preprocess(text):
    text = text.lower()  # 转成小写
    text = text.replace('.', ' .')  # 增加空格
    words = text.split(' ')  #  切割

    # 单词和单词ID的对应关系
    word_to_id = {}
    id_to_word = {}
    for word in words:
        if word not in word_to_id.keys():   # 原文 if word not in word_to_id:
            new_id = len(word_to_id)
            word_to_id[word] = new_id
            id_to_word[new_id] = word
    # 单词列表-----> 单词ID列表
    corpus = np.array([word_to_id[w] for w in words])

    return corpus, word_to_id, id_to_word

import sys
sys.path.append('..')

text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)

corpus

array([0, 1, 2, 3, 4, 1, 5, 6])

1	id_to_word

{0: 'you', 1: 'say', 2: 'goodbye', 3: 'and', 4: 'i', 5: 'hello', 6: '.'}

1	corpus[1:-1]

array([1, 2, 3, 4, 1, 5])

def create_contexts_target(corpus, window_size=1):
    target = corpus[window_size:-window_size]  # 目标词从第2个元素开始（索引号为1）

    contexts = []  # 二维数据，contexts[0] 保存的就是第0个上下文

    for idx in range(window_size, len(corpus) - window_size):
        #print("idx: ", idx)
        cs = []
        for t in range(-window_size, window_size + 1):  # [-1,2) 即 -1 0 1；
            #print("t: ", t)
            if t == 0:  # 单词本身跳过；只寻找-1和1的上下文
                continue
            cs.append(corpus[idx + t])
            #print("cs: \n", cs)
        contexts.append(cs)
    return np.array(contexts), np.array(target)

1	contexts, target = create_contexts_target(corpus, window_size=1)

1	contexts # 上下文

array([[0, 2],
       [1, 3],
       [2, 4],
       [3, 1],
       [4, 5],
       [1, 6]])

1	target # 目标值

array([1, 2, 3, 4, 1, 5])

convert_one_hot：转成one-hot编码

def convert_one_hot(corpus, vocab_size):
    """
    corpus：单词ID列表；一维或者二维的numpy数组形式
    vocab_size：词汇个数
    """
    N = corpus.shape[0]  #

    if corpus.ndim == 1:
        one_hot = np.zeros((N, vocab_size), dtype=np.int32)

        for idx, word_id in enumerate(corpus):
            one_hot[idx, word_id] = 1

    elif corpus.ndim == 2:
        C = corpus.shape[1]
        one_hot = np.zeros((N, C, vocab_size), dtype=np.int32)

        for idx_0, word_ids in enumerate(corpus):
            for idx_1, word_id in enumerate(word_ids):
                one_hot[idx_0, idx_1, word_id] = 1

    return one_hot

#text = 'You say goodbye and I say hello.'
#corpus, word_to_id, id_to_word = preprocess(text)
#contexts, target = create_contexts_target(corpus, window_size=1)

vocab_size = len(word_to_id)
target = convert_one_hot(target, vocab_size)
contexts = convert_one_hot(contexts, vocab_size)

1	target # 转成One-Hot编码后的形式

array([[0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0]])

1	contexts # 转成One-Hot编码后的形式

array([[[1, 0, 0, 0, 0, 0, 0],
        [0, 0, 1, 0, 0, 0, 0]],

       [[0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0, 0]],

       [[0, 0, 1, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 0]],

       [[0, 0, 0, 1, 0, 0, 0],
        [0, 1, 0, 0, 0, 0, 0]],

       [[0, 0, 0, 0, 1, 0, 0],
        [0, 0, 0, 0, 0, 1, 0]],

       [[0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1]]])

简单CBOW模型实现

交叉损失熵Crossentropy-Error

def cross_entropy_error(y, t):
    """
    交叉损失熵损失的实现
    """
    if y.ndim == 1:
        t = t.reshape(1, t.size)
        y = y.reshape(1, y.size)

    # 在监督标签为one-hot-vector的情况下，转换为正确解标签的索引
    if t.size == y.size:
        t = t.argmax(axis=1)

    batch_size = y.shape[0]

    return -np.sum(np.log(y[np.arange(batch_size), t] + 1e-7)) / batch_size

SoftmaxWithLoss层实现

# 定义softmax函数
def softmax(x):
    if x.ndim == 2:
        x = x - x.max(axis=1, keepdims=True)
        x = np.exp(x)
        x /= x.sum(axis=1, keepdims=True)
    elif x.ndim == 1:
        x = x - np.max(x)
        x = np.exp(x) / np.sum(np.exp(x))

    return x


# 基于交叉损失熵的softmax损失层
class SoftmaxWithLoss:
    def __init__(self):
        self.params, self.grads = [], []
        self.y = None  # softmax的输出
        self.t = None  # 监督标签

    def forward(self, x, t):  # 前向传播
        self.t = t
        self.y = softmax(x)

        # 在监督标签为one-hot向量的情况下，转换为正确解标签的索引
        if self.t.size == self.y.size:
            self.t = self.t.argmax(axis=1)

        loss = cross_entropy_error(self.y, self.t)  # 调用交叉损失熵函数
        return loss

    def backward(self, dout=1):
        batch_size = self.t.shape[0]

        dx = self.y.copy()
        dx[np.arange(batch_size), self.t] -= 1
        dx *= dout
        dx = dx / batch_size

        return dx

SimpleCBOW类实现

import sys
sys.path.append('..')

import numpy as np

class SimpleCBOW:
    def __init__(self, vocab_size, hidden_size):
        """
        vocab_size：词汇个数;
        hidden_size：中间层的神经元个数
        """
        V,H = vocab_size, hidden_size

        # 权重参数
        W_in = 0.01 * np.random.randn(V,H).astype('f')
        W_out = 0.01 * np.random.randn(H,V).astype('f')

        # 生成层：两个输入侧的MatMul 和 一个输出侧 + Softmax_with_loss层
        self.in_layer0 = MatMul(W_in)
        self.in_layer1 = MatMul(W_in)
        self.out_layer = MatMul(W_out)
        self.loss_layer = SoftmaxWithLoss()

        # 将所有的权重和梯度保存到列表中
        layers = [self.in_layer0, self.in_layer1, self.out_layer]

        self.params, self.grads = [], []

        for layer in layers:
            self.params += layer.params
            self.grads += layer.grads

        #  将单词的分布式表示设置为成员变量
        self.word_vecs = W_in


    # 基于上下文的正向传播forward
    def forward(self, contexts, target):
        h0 = self.in_layer0.forward(contexts[:,0])
        h1 = self.in_layer1.forward(contexts[:,1])

        h = (h0 + h1) / 2

        score = self.out_layer.forward(h)
        loss = self.loss_layer.forward(score, target)
        return loss

    #  反向传播
    def backward(self, dout=1):
        ds = self.loss_layer.backward(dout)
        da = self.out_layer.backward(ds)
        da *= 0.5

        self.in_layer1.backward(da)
        self.in_layer0.backward(da)

        return None

基于上下文的正向传播forward

# def forward(self, contexts, target):
#     h0 = self.in_layer0.farward(contexts[:,0])
#     h1 = self.in_layer.farward(contexts[:,1])

#     h = (h0 + h1) / 2

#     score = self.out_layer.farward(h)
#     loss = self.loss_layer.farward(score, target)
#     return loss

CBOW模型的反向传播

# def backward(self, dout=1):
#     ds = self.loss_layer.backward(dout)
#     da = self.out_layer.backward(ds)
#     da *= 0.5

#     self.in_layer1.backward(da)
#     self.in_layer0.backward(da)

#     return None

Trainer类实现

#  参数去重

def remove_duplicate(params, grads):
    '''
    将参数列表中重复的权重整合为1个，
    加上与该权重对应的梯度
    '''
    params, grads = params[:], grads[:]  # 副本

    while True:
        find_flg = False
        L = len(params)

        for i in range(0, L - 1):
            for j in range(i + 1, L):
                # 在共享权重的情况下
                if params[i] is params[j]:
                    grads[i] += grads[j]  # 加上梯度
                    find_flg = True
                    params.pop(j)
                    grads.pop(j)
                # 在作为转置矩阵共享权重的情况下（weight tying）
                elif params[i].ndim == 2 and params[j].ndim == 2 and \
                     params[i].T.shape == params[j].shape and np.all(params[i].T == params[j]):
                    grads[i] += grads[j].T
                    find_flg = True
                    params.pop(j)
                    grads.pop(j)

                if find_flg:
                    break
            if find_flg:
                break

        if not find_flg:
            break

    return params, grads


class Trainer:
    def __init__(self, model, optimizer):
            self.model = model
            self.optimizer = optimizer
            self.loss_list = []
            self.eval_interval = None
            self.current_epoch = 0


    def fit(self, x, t, max_epoch=10, batch_size=32, max_grad=None, eval_interval=20):
        """
        x: 输入数据
        t: 监督标签
        max_epoch: 进行学习的epoch数
        batch_size: mini-batch的大小
        max_grad: 梯度的最大范数
        eval_interval: 输出结果的间隔——迭代次数
        """
        data_size = len(x)
        max_iters = data_size // batch_size
        self.eval_interval = eval_interval
        model, optimizer = self.model, self.optimizer
        total_loss = 0
        loss_count = 0

        start_time = time.time()
        for epoch in range(max_epoch):
            # 打乱
            idx = np.random.permutation(np.arange(data_size))
            x = x[idx]
            t = t[idx]

            for iters in range(max_iters):
                batch_x = x[iters*batch_size:(iters+1)*batch_size]
                batch_t = t[iters*batch_size:(iters+1)*batch_size]

                # 计算梯度，更新参数
                loss = model.forward(batch_x, batch_t)
                model.backward()

                # 参数去重
                params, grads = remove_duplicate(model.params, model.grads)  # 将共享的权重整合为1个
                if max_grad is not None:
                    clip_grads(grads, max_grad)
                optimizer.update(params, grads)
                total_loss += loss
                loss_count += 1

                # 评价
                if (eval_interval is not None) and (iters % eval_interval) == 0:
                    avg_loss = total_loss / loss_count
                    elapsed_time = time.time() - start_time
                    self.loss_list.append(float(avg_loss))
                    total_loss, loss_count = 0, 0

            self.current_epoch += 1

    def plot(self, ylim=None):
        x = np.arange(len(self.loss_list))
        if ylim is not None:
            plt.ylim(*ylim)
        plt.plot(x, self.loss_list, label='train')
        plt.xlabel('iterations (x' + str(self.eval_interval) + ')')
        plt.ylabel('loss')
        plt.show()

SGD优化

class SGD:
    def __init__(self, lr=0.05):
        self.lr = lr  # 学习率设置

    def update(self, params, grads):
        for i in range(len(params)):
            params[i] -= self.lr * grads[i]  # 参数更新

案例实战

window_size = 1
hidden_size = 5
batch_size = 3
max_epoch = 1000

#  实际数据
text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)  # 数据预处理
contexts, target = create_contexts_target(corpus, window_size=1)

vocab_size = len(word_to_id)
target = convert_one_hot(target, vocab_size)
contexts = convert_one_hot(contexts, vocab_size)

model = SimpleCBOW(vocab_size, hidden_size)
optimizer = SGD()
trainer = Trainer(model, optimizer)
trainer.fit(contexts, target, max_epoch, batch_size)
trainer.plot()

词向量权重和ID分布式表示

word_vecs = model.word_vecs  # 变量权重

for word_id, word in id_to_word.items():
    print(word, word_vecs[word_id])

you [ 0.56529504 -0.89804494  2.2253568  -0.06463418 -1.1290963 ]
say [-1.1766486   0.99819845 -1.1207973   0.6008937   1.1576382 ]
goodbye [ 1.5599341   0.2788804  -0.38262236 -0.35986307 -1.1309035 ]
and [-0.8373688  -0.9373508  -1.2509673   1.8487976   0.51270753]
i [ 1.593292   0.2917557 -0.3518384 -0.3507546 -1.129358 ]
hello [ 0.5445614  -0.89658004  2.2065814  -0.07632335 -1.112859  ]
. [-0.3468702   1.9544731   0.13033707 -1.2471999   0.658087  ]

将单词表示为了密集向量，这就是单词的分布式表示