使用 doc2vec 进行情绪分析

既然我们知道如何训练单词嵌入,我们也可以扩展这些方法以进行文档嵌入。我们将在以下部分中探讨如何执行此操作。

做好准备

在前面关于 word2vec 方法的部分中,我们设法捕获了单词之间的位置关系。我们没有做的是捕捉单词与它们来自的文档(或电影评论)之间的关系。 word2vec 的一个扩展来捕获文档效果,称为 doc2vec。

doc2vec 的基本思想是引入文档嵌入,以及可能有助于捕获文档基调的单词嵌入。例如,只知道单词movielove彼此接近可能无法帮助我们确定评论的情绪。评论可能是谈论他们如何热爱电影或他们如何不爱电影。但是如果评论足够长并且在文档中找到了更多否定词,那么我们可以采用可以帮助我们预测后续词语的整体语气。

Doc2vec 只是为文档添加了一个额外的嵌入矩阵,并使用一个单词窗口加上文档索引来预测下一个单词。文档中的所有文字窗口都具有相同的文档索引。值得一提的是,考虑如何将文档嵌入与单词嵌入相结合是很重要的。我们通过对它们求和来将单词嵌入组合在单词窗口中。将这些嵌入与文档嵌入相结合有两种主要方式:通常,文档嵌入要么添加到单词嵌入中,要么连接到单词嵌入的末尾。如果我们添加两个嵌入,我们将文档嵌入大小限制为与嵌入字大小相同的大小。如果我们连接,我们解除了这个限制,但增加了逻辑回归必须处理的变量数量。为了便于说明,我们将向您展示如何处理此秘籍中的串联。但总的来说,对于较小的数据集,添加是更好的选择。

第一步是将文档和单词嵌入适用于整个电影评论集。然后我们将进行训练测试分组,训练逻辑模型,看看我们是否可以更准确地预测评论情绪。

操作步骤

我们将按如下方式处理秘籍:

  1. 我们将从加载必要的库并开始图会话开始,如下所示:
import tensorflow as tf 
import matplotlib.pyplot as plt 
import numpy as np 
import random 
import os 
import pickle 
import string 
import requests 
import collections 
import io 
import tarfile 
import urllib.request 
import text_helpers 
from nltk.corpus import stopwords 
sess = tf.Session()
  1. 我们将加载电影评论语料库,就像我们在前两个秘籍中所做的那样。使用以下代码执行此操作:
texts, target = text_helpers.load_movie_data()
  1. 我们将声明模型参数,如下所示:
batch_size = 500 
vocabulary_size = 7500 
generations = 100000 
model_learning_rate = 0.001 
embedding_size = 200   # Word embedding size 
doc_embedding_size = 100   # Document embedding size 
concatenated_size = embedding_size + doc_embedding_size 
num_sampled = int(batch_size/2) 
window_size = 3       # How many words to consider to the left. 
# Add checkpoints to training 
save_embeddings_every = 5000 
print_valid_every = 5000 
print_loss_every = 100 
# Declare stop words 
stops = stopwords.words('english') 
# We pick a few test words. 
valid_words = ['love', 'hate', 'happy', 'sad', 'man', 'woman']
  1. 我们将正则化电影评论,并确保每个电影评论都大于所需的窗口大小。使用以下代码执行此操作:
texts = text_helpers.normalize_text(texts, stops)
# Texts must contain at least as much as the prior window size
target = [target[ix] for ix, x in enumerate(texts) if len(x.split()) > window_size]
texts = [x for x in texts if len(x.split()) > window_size]
assert(len(target)==len(texts))
  1. 现在我们将创建我们的单词字典。请务必注意,我们不必创建文档字典。文件索引只是文件的索引;每个文档都有一个唯一的索引:
word_dictionary = text_helpers.build_dictionary(texts, vocabulary_size) 
word_dictionary_rev = dict(zip(word_dictionary.values(), word_dictionary.keys())) 
text_data = text_helpers.text_to_numbers(texts, word_dictionary) 
# Get validation word keys 
valid_examples = [word_dictionary[x] for x in valid_words]
  1. 接下来,我们将定义单词嵌入和文档嵌入。然后我们将声明我们的噪声对比损失参数。使用以下代码执行此操作:
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0)) 
doc_embeddings = tf.Variable(tf.random_uniform([len(texts), doc_embedding_size], -1.0, 1.0)) 
# NCE loss parameters 
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, concatenated_size], 
                                               stddev=1.0 / np.sqrt(concatenated_size))) 
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
  1. 我们现在将声明 doc2vec 索引和目标词索引的占位符。请注意,输入索引的大小是窗口大小加 1.这是因为我们生成的每个数据窗口都有一个附加的文档索引,如下所示:
x_inputs = tf.placeholder(tf.int32, shape=[None, window_size + 1]) 
y_target = tf.placeholder(tf.int32, shape=[None, 1]) 
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
  1. 现在我们必须创建嵌入函数,它将单词嵌入加在一起,然后在最后连接文档嵌入。使用以下代码执行此操作:
embed = tf.zeros([batch_size, embedding_size]) 
for element in range(window_size): 
    embed += tf.nn.embedding_lookup(embeddings, x_inputs[:, element]) 
doc_indices = tf.slice(x_inputs, [0,window_size],[batch_size,1]) 
doc_embed = tf.nn.embedding_lookup(doc_embeddings,doc_indices) 
# concatenate embeddings 
final_embed = tf.concat(axis=1, values=)
  1. 我们还需要声明一组验证词的余弦距离,我们可以经常打印出来以观察 doc2vec 模型的进度。使用以下代码执行此操作:
loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights, 
                                     biases=nce_biases, 
                                     labels=y_target,
                                     inputs=final_embed,
                                     num_sampled=num_sampled, 
                                     num_classes=vocabulary_size))

# Create optimizer 
optimizer =  
 tf.train.GradientDescentOptimizer(learning_rate=model_learning_rate) 
train_step = optimizer.minimize(loss)
  1. 我们还需要从一组验证单词中声明余弦距离,我们可以经常打印出来以观察 doc2vec 模型的进度。使用以下代码执行此操作:
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1,  
keep_dims=True)) 
normalized_embeddings = embeddings / norm 
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings,  
valid_dataset) 
similarity = tf.matmul(valid_embeddings, normalized_embeddings,  
transpose_b=True)
  1. 为了以后保存我们的嵌入,我们将创建一个模型saver函数。然后我们可以初始化变量,这是我们开始训练单词嵌入之前的最后一步:
saver = tf.train.Saver({"embeddings": embeddings, "doc_embeddings":  
doc_embeddings}) 
init = tf.global_variables_initializer() 
sess.run(init) 
loss_vec = [] 
loss_x_vec = [] 
for i in range(generations): 
    batch_inputs, batch_labels = text_helpers.generate_batch_data(text_data, batch_size, 
                                                                  window_size, method='doc2vec') 
    feed_dict = {x_inputs : batch_inputs, y_target : batch_labels} 

    # Run the train step 
    sess.run(train_step, feed_dict=feed_dict) 

    # Return the loss 
    if (i+1) % print_loss_every == 0: 
        loss_val = sess.run(loss, feed_dict=feed_dict) 
        loss_vec.append(loss_val) 
        loss_x_vec.append(i+1) 
        print('Loss at step {} : {}'.format(i+1, loss_val)) 

    # Validation: Print some random words and top 5 related words 
    if (i+1) % print_valid_every == 0: 
        sim = sess.run(similarity, feed_dict=feed_dict) 
        for j in range(len(valid_words)): 
            valid_word = word_dictionary_rev[valid_examples[j]] 
            top_k = 5 # number of nearest neighbors 
            nearest = (-sim[j, :]).argsort()[1:top_k+1] 
            log_str = "Nearest to {}:".format(valid_word) 
            for k in range(top_k): 
                close_word = word_dictionary_rev[nearest[k]] 
                log_str = '{} {},'.format(log_str, close_word) 
            print(log_str) 

    # Save dictionary + embeddings 
    if (i+1) % save_embeddings_every == 0: 
        # Save vocabulary dictionary 
        with open(os.path.join(data_folder_name,'movie_vocab.pkl'), 'wb') as f: 
            pickle.dump(word_dictionary, f) 

        # Save embeddings 
        model_checkpoint_path = os.path.join(os.getcwd(),data_folder_name,'doc2vec_movie_embeddings.ckpt') 
        save_path = saver.save(sess, model_checkpoint_path) 
        print('Model saved in file: {}'.format(save_path))
  1. 这导致以下输出:
Loss at step 100 : 126.176816940307617 
Loss at step 200 : 89.608322143554688
... 
Loss at step 99900 : 17.733346939086914 
Loss at step 100000 : 17.384489059448242 
Nearest to love: ride, with, by, its, start, 
Nearest to hate: redundant, snapshot, from, performances, extravagant, 
Nearest to happy: queen, chaos, them, succumb, elegance, 
Nearest to sad: terms, pity, chord, wallet, morality, 
Nearest to man: of, teen, an, our, physical, 
Nearest to woman: innocuous, scenes, prove, except, lady, 
Model saved in file: /.../temp/doc2vec_movie_embeddings.ckpt
  1. 现在我们已经训练了 doc2vec 嵌入,我们可以在逻辑回归中使用这些嵌入来预测评论情绪。首先,我们为逻辑回归设置了一些参数。使用以下代码执行此操作:
max_words = 20 # maximum review word length 
logistic_batch_size = 500 # training batch size
  1. 我们现在将数据集拆分为训练集和测试集:
train_indices = np.sort(np.random.choice(len(target),  
round(0.8*len(target)), replace=False)) 
test_indices = np.sort(np.array(list(set(range(len(target))) -  
set(train_indices)))) 
texts_train = [x for ix, x in enumerate(texts) if ix in train_indices] 
texts_test = [x for ix, x in enumerate(texts) if ix in test_indices] 
target_train = np.array([x for ix, x in enumerate(target) if ix in train_indices]) 
target_test = np.array([x for ix, x in enumerate(target) if ix in test_indices])
  1. 接下来,我们将评论转换为数字单词索引,并将每个评论填充或裁剪为 20 个单词,如下所示:
text_data_train = np.array(text_helpers.text_to_numbers(texts_train, word_dictionary)) text_data_test = np.array(text_helpers.text_to_numbers(texts_test, word_dictionary)) # Pad/crop movie reviews to specific length text_data_train = np.array([x[0:max_words] for x in [y+[0]*max_words for y in text_data_train]]) text_data_test = np.array([x[0:max_words] for x in [y+[0]*max_words for y in text_data_test]])
  1. 现在我们将声明图中与逻辑回归模型相关的部分。我们将添加数据占位符,变量,模型操作和损失函数,如下所示:
# Define Logistic placeholders 
log_x_inputs = tf.placeholder(tf.int32, shape=[None, max_words + 1]) 
log_y_target = tf.placeholder(tf.int32, shape=[None, 1]) 
A = tf.Variable(tf.random_normal(shape=[concatenated_size,1])) 
b = tf.Variable(tf.random_normal(shape=[1,1])) 

# Declare logistic model (sigmoid in loss function) 
model_output = tf.add(tf.matmul(log_final_embed, A), b) 

# Declare loss function (Cross Entropy loss) 
logistic_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=model_output,  
labels=tf.cast(log_y_target, tf.float32)))
  1. 我们需要创建另一个嵌入函数。前半部分中的嵌入函数在三个单词(和文档索引)的较小窗口上进行训练,以预测下一个单词。在这里,我们将采用相同的方式进行 20 字复习。使用以下代码执行此操作:
# Add together element embeddings in window: 
log_embed = tf.zeros([logistic_batch_size, embedding_size]) 
for element in range(max_words): 
    log_embed += tf.nn.embedding_lookup(embeddings, log_x_inputs[:, element]) 
log_doc_indices = tf.slice(log_x_inputs, [0,max_words],[logistic_batch_size,1]) 
log_doc_embed = tf.nn.embedding_lookup(doc_embeddings,log_doc_indices) 
# concatenate embeddings 
log_final_embed = tf.concat(1, [log_embed, tf.squeeze(log_doc_embed)])
  1. 接下来,我们将在图上创建预测和准确率函数,以便我们可以在训练生成过程中评估模型的表现。然后我们将声明一个优化函数并初始化所有变量:
prediction = tf.round(tf.sigmoid(model_output)) 
predictions_correct = tf.cast(tf.equal(prediction, tf.cast(log_y_target, tf.float32)), tf.float32) 
accuracy = tf.reduce_mean(predictions_correct) 
# Declare optimizer 
logistic_opt = tf.train.GradientDescentOptimizer(learning_rate=0.01) 
logistic_train_step = logistic_opt.minimize(logistic_loss, var_list=[A, b]) 
# Intitialize Variables 
init = tf.global_variables_initializer() 
sess.run(init)
  1. 现在我们可以开始后勤模型训练了:
train_loss = [] 
test_loss = [] 
train_acc = [] 
test_acc = [] 
i_data = [] 
for i in range(10000): 
    rand_index = np.random.choice(text_data_train.shape[0], size=logistic_batch_size) 
    rand_x = text_data_train[rand_index] 
    # Append review index at the end of text data 
    rand_x_doc_indices = train_indices[rand_index] 
    rand_x = np.hstack((rand_x, np.transpose([rand_x_doc_indices]))) 
    rand_y = np.transpose([target_train[rand_index]]) 

    feed_dict = {log_x_inputs : rand_x, log_y_target : rand_y} 
    sess.run(logistic_train_step, feed_dict=feed_dict) 

    # Only record loss and accuracy every 100 generations 
    if (i+1)%100==0: 
        rand_index_test = np.random.choice(text_data_test.shape[0], size=logistic_batch_size) 
        rand_x_test = text_data_test[rand_index_test] 
        # Append review index at the end of text data 
        rand_x_doc_indices_test = test_indices[rand_index_test] 
        rand_x_test = np.hstack((rand_x_test, np.transpose([rand_x_doc_indices_test]))) 
        rand_y_test = np.transpose([target_test[rand_index_test]]) 

        test_feed_dict = {log_x_inputs: rand_x_test, log_y_target: rand_y_test} 

        i_data.append(i+1) 
        train_loss_temp = sess.run(logistic_loss, feed_dict=feed_dict) 
        train_loss.append(train_loss_temp) 

        test_loss_temp = sess.run(logistic_loss, feed_dict=test_feed_dict) 
        test_loss.append(test_loss_temp) 

        train_acc_temp = sess.run(accuracy, feed_dict=feed_dict) 
        train_acc.append(train_acc_temp) 

        test_acc_temp = sess.run(accuracy, feed_dict=test_feed_dict) 
        test_acc.append(test_acc_temp) 
    if (i+1)%500==0: 
        acc_and_loss = [i+1, train_loss_temp, test_loss_temp, train_acc_temp, test_acc_temp] 
        acc_and_loss = [np.round(x,2) for x in acc_and_loss] 
        print('Generation # {}. Train Loss (Test Loss): {:.2f} ({:.2f}). Train Acc (Test Acc): {:.2f} ({:.2f})'.format(*acc_and_loss))
  1. 这导致以下输出:
Generation # 500\. Train Loss (Test Loss): 5.62 (7.45). Train Acc (Test Acc): 0.52 (0.48) Generation # 10000\. Train Loss (Test Loss): 2.35 (2.51). Train Acc (Test Acc): 0.59 (0.58)
  1. 我们还应该注意到,我们在名为 doc2vec 的text_helpers.generate_batch_data()函数中创建了一个单独的数据批量生成方法,我们在本文的第一部分中使用它来训练 doc2vec 嵌入。以下是与该方法有关的该函数的摘录:
def generate_batch_data(sentences, batch_size, window_size, method='skip_gram'): 
    # Fill up data batch 
    batch_data = [] 
    label_data = [] 
    while len(batch_data) < batch_size: 
        # select random sentence to start 
        rand_sentence_ix = int(np.random.choice(len(sentences), size=1)) 
        rand_sentence = sentences[rand_sentence_ix] 
        # Generate consecutive windows to look at 
        window_sequences = [rand_sentence[max((ix-window_size),0):(ix+window_size+1)] for ix, x in enumerate(rand_sentence)] 
        # Denote which element of each window is the center word of interest 
        label_indices = [ix if ix<window_size else window_size for ix,x in enumerate(window_sequences)] 

        # Pull out center word of interest for each window and create a tuple for each window 
        if method=='skip_gram': 
            ... 
        elif method=='cbow': 
            ... 
        elif method=='doc2vec': 
            # For doc2vec we keep LHS window only to predict target word 
            batch_and_labels = [(rand_sentence[i:i+window_size], rand_sentence[i+window_size]) for i in range(0, len(rand_sentence)-window_size)] 
            batch, labels = [list(x) for x in zip(*batch_and_labels)] 
            # Add document index to batch!! Remember that we must extract the last index in batch for the doc-index 
            batch = [x + [rand_sentence_ix] for x in batch] 
        else: 
            raise ValueError('Method {} not implmented yet.'.format(method)) 

        # extract batch and labels 
        batch_data.extend(batch[:batch_size]) 
        label_data.extend(labels[:batch_size]) 
    # Trim batch and label at the end 
    batch_data = batch_data[:batch_size] 
    label_data = label_data[:batch_size] 

    # Convert to numpy array 
    batch_data = np.array(batch_data) 
    label_data = np.transpose(np.array([label_data])) 

    return batch_data, label_data

工作原理

在这个秘籍中,我们进行了两个训练循环。第一个是适合 doc2vec 嵌入,第二个循环是为了适应电影情绪的逻辑回归。

虽然我们没有大幅度提高情绪预测准确率(它仍然略低于 60%),但我们在电影语料库中成功实现了 doc2vec 的连接版本。为了提高我们的准确率,我们应该为 doc2vec 嵌入和可能更复杂的模型尝试不同的参数,因为逻辑回归可能无法捕获自然语言中的所有非线性行为。

results matching ""

    No results matching ""