使用 TensorFlow 的 skip-gram 模型

现在我们已经准备好了训练和验证数据,让我们在 TensorFlow 中创建一个 skip-gram 模型。

我们首先定义超参数:

batch_size = 128
embedding_size = 128
skip_window = 2
n_negative_samples = 64
ptb.skip_window=2
learning_rate = 1.0
  • batch_size是要在单个批次中输入算法的目标和上下文单词对的数量
  • embedding_size是每个单词的单词向量或嵌入的维度
  • ptb.skip_window是在两个方向上的目标词的上下文中要考虑的词的数量
  • n_negative_samples是由 NCE 损失函数生成的负样本数,本章将进一步说明

在一些教程中,包括 TensorFlow 文档中的一个教程,还使用了一个参数num_skips。在这样的教程中,作者选择了num_skips(目标,上下文)对。例如,如果skip_window是 2,那么对的总数将是 4,如果num_skips被设置为 2,则只有两对将被随机选择用于训练。但是,我们考虑了所有的对以便保持训练练习简单。

定义训练数据的输入和输出占位符以及验证数据的张量:

inputs = tf.placeholder(dtype=tf.int32, shape=[batch_size])
outputs = tf.placeholder(dtype=tf.int32, shape=[batch_size,1])
inputs_valid = tf.constant(x_valid, dtype=tf.int32)

定义一个嵌入矩阵,其行数等于词汇长度,列等于嵌入维度。该矩阵中的每一行将表示词汇表中一个单词的单词向量。使用在-1.0 到 1.0 之间均匀采样的值填充此嵌入矩阵。

# define embeddings matrix with vocab_len rows and embedding_size columns
# each row represents vectore representation or embedding of a word
# in the vocbulary

embed_dist = tf.random_uniform(shape=[ptb.vocab_len, embedding_size],
                               minval=-1.0,maxval=1.0)
embed_matrix = tf.Variable(embed_dist,name='embed_matrix')

使用此矩阵,定义使用tf.nn.embedding_lookup()实现的嵌入查找表。 tf.nn.embedding_lookup()有两个参数:嵌入矩阵和输入占位符。 lookup 函数返回inputs占位符中单词的单词向量。

# define the embedding lookup table
# provides the embeddings of the word ids in the input tensor
embed_ltable = tf.nn.embedding_lookup(embed_matrix, inputs)

embed_ltable也可以解释为输入层顶部的嵌入层。接下来,将嵌入层的输出馈送到 softmax 或噪声对比估计(NCE)层。 NCE 基于一个非常简单的想法,即训练基于逻辑回归的二分类器,以便从真实和嘈杂数据的混合中学习参数。

TensorFlow documentation describes the NCE in further detail: https://www.tensorflow.org/tutorials/word2vec.

总之,基于 softmax 损失的模型在计算上是昂贵的,因为在整个词汇表中计算概率分布并对其进行归一化。基于 NCE 损耗的模型将其减少为二分类问题,即从噪声样本中识别真实样本。

NCE 的基本数学细节可以在以下 NIPS 论文中找到:学习词嵌入有效地与噪声对比估计,作者 Andriy Mnih 和 Koray Kavukcuoglu。该论文可从以下链接获得:http://papers.nips.cc/paper/5165-learning-word-embeddings-efficiently-with-noise-contrastive-estimation.pdf.

tf.nn.nce_loss()函数在评估计算损耗时自动生成负样本:参数num_sampled设置为等于负样本数(n_negative_samples)。此参数指定要绘制的负样本数。

# define noise-contrastive estimation (NCE) loss layer
nce_dist = tf.truncated_normal(shape=[ptb.vocab_len, embedding_size],
                               stddev=1.0 /
                               tf.sqrt(embedding_size * 1.0)
                               )
nce_w = tf.Variable(nce_dist)
nce_b = tf.Variable(tf.zeros(shape=[ptb.vocab_len]))

loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_w,
                                     biases=nce_b,
                                     inputs=embed_ltable,
                                     labels=outputs,
                                     num_sampled=n_negative_samples,
                                     num_classes=ptb.vocab_len
                                     )
                      )

接下来,计算验证集中的样本与嵌入矩阵之间的余弦相似度:

  1. 为了计算相似性得分,首先,计算嵌入矩阵中每个单词向量的 L2 范数。
# Compute the cosine similarity between validation set samples
# and all embeddings.
norm = tf.sqrt(tf.reduce_sum(tf.square(embed_matrix), 1, 
                             keep_dims=True))
normalized_embeddings = embed_matrix / norm
  1. 在验证集中查找样本的嵌入或单词向量:
embed_valid = tf.nn.embedding_lookup(normalized_embeddings, 
                                     inputs_valid)
  1. 通过将验证集的嵌入与嵌入矩阵相乘来计算相似性得分。
similarity = tf.matmul(
    embed_valid, normalized_embeddings, transpose_b=True)

这给出了具有(valid_sizevocab_len)形状的张量。张量中的每一行指的是验证词和词汇单词之间的相似性得分。

接下来,定义 SGD 优化器,学习率为 0.9,历时 50 个周期。

n_epochs = 10
learning_rate = 0.9
n_batches = ptb.n_batches(batch_size)
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
            .minimize(loss)

对于每个周期:

  1. 逐批运行整个数据集上的优化器。
ptb.reset_index_in_epoch()
for step in range(n_batches):
    x_batch, y_batch = ptb.next_batch()
    y_batch = dsu.to2d(y_batch,unit_axis=1)
    feed_dict = {inputs: x_batch, outputs: y_batch}
    _, batch_loss = tfs.run([optimizer, loss], feed_dict=feed_dict)
    epoch_loss += batch_loss
  1. 计算并打印周期的平均损失。
 epoch_loss = epoch_loss / n_batches 
 print('\n','Average loss after epoch ', epoch, ': ', epoch_loss)
  1. 在周期结束时,计算相似性得分。
similarity_scores = tfs.run(similarity)
  1. 对于验证集中的每个单词,打印具有最高相似性得分的五个单词。
top_k = 5 
for i in range(valid_size):
    similar_words = (-similarity_scores[i,:])
                    .argsort()[1:top_k + 1]
    similar_str = 'Similar to {0:}:'
                    .format(ptb.id2word[x_valid[i]])
    for k in range(top_k):
        similar_str = '{0:} {1:},'.format(similar_str, 
                        ptb.id2word[similar_words[k]])
    print(similar_str)

最后,在完成所有周期之后,计算可在学习过程中进一步利用的嵌入向量:

final_embeddings = tfs.run(normalized_embeddings)

完整的训练代码如下:

n_epochs = 10
learning_rate = 0.9
n_batches = ptb.n_batches_wv()
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)

with tf.Session() as tfs:
    tf.global_variables_initializer().run()
    for epoch in range(n_epochs):
        epoch_loss = 0
        ptb.reset_index()
        for step in range(n_batches):
            x_batch, y_batch = ptb.next_batch_sg()
            y_batch = nputil.to2d(y_batch, unit_axis=1)
            feed_dict = {inputs: x_batch, outputs: y_batch}
            _, batch_loss = tfs.run([optimizer, loss], feed_dict=feed_dict)
            epoch_loss += batch_loss
        epoch_loss = epoch_loss / n_batches
        print('\nAverage loss after epoch ', epoch, ': ', epoch_loss)

        # print closest words to validation set at end of every epoch
        similarity_scores = tfs.run(similarity)
        top_k = 5
        for i in range(valid_size):
            similar_words = (-similarity_scores[i, :]
                             ).argsort()[1:top_k + 1]
            similar_str = 'Similar to {0:}:'.format(
                ptb.id2word[x_valid[i]])
            for k in range(top_k):
                similar_str = '{0:} {1:},'.format(
                    similar_str, ptb.id2word[similar_words[k]])
            print(similar_str)
    final_embeddings = tfs.run(normalized_embeddings)

这是我们分别在第 1 和第 10 周期之后得到的输出:

Average loss after epoch  0 :  115.644006802
Similar to we: types, downturn, internal, by, introduce,
Similar to been: said, funds, mcgraw-hill, street, have,
Similar to also: will, she, next, computer, 's,
Similar to of: was, and, milk, dollars, $,
Similar to last: be, october, acknowledging, requested, computer,
Similar to u.s.: plant, increase, many, down, recent,
Similar to an: commerce, you, some, american, a,
Similar to trading: increased, describes, state, companies, in,

Average loss after epoch  9 :  5.56538496033
Similar to we: types, downturn, introduce, internal, claims,
Similar to been: exxon, said, problem, mcgraw-hill, street,
Similar to also: will, she, ssangyong, audit, screens,
Similar to of: seasonal, dollars, motor, none, deaths,
Similar to last: acknowledging, allow, incorporated, joint, requested,
Similar to u.s.: undersecretary, typically, maxwell, recent, increase,
Similar to an: banking, officials, imbalances, americans, manager,
Similar to trading: describes, increased, owners, committee, else,

最后,我们运行 5000 个周期的模型并获得以下结果:

Average loss after epoch  4999 :  2.74216903135
Similar to we: matter, noted, here, classified, orders,
Similar to been: good, precedent, medium-sized, gradual, useful,
Similar to also: introduce, england, index, able, then,
Similar to of: indicator, cleveland, theory, the, load,
Similar to last: dec., office, chrysler, march, receiving,
Similar to u.s.: label, fannie, pressures, squeezed, reflection,
Similar to an: knowing, outlawed, milestones, doubled, base,
Similar to trading: associates, downturn, money, portfolios, go,

尝试进一步运行,最多 50,000 个周期,以获得更好的结果。

同样,我们在 50 个周期之后使用 text8 模型得到以下结果:

Average loss after epoch  49 :  5.74381046423
Similar to four: five, three, six, seven, eight,
Similar to all: many, both, some, various, these,
Similar to between: with, through, thus, among, within,
Similar to a: another, the, any, each, tpvgames,
Similar to that: which, however, although, but, when,
Similar to zero: five, three, six, eight, four,
Similar to is: was, are, has, being, busan,
Similar to no: any, only, the, another, trinomial,

results matching ""

    No results matching ""