加载和准备 text8 数据集
现在我们使用 text8 数据集执行相同的加载和预处理步骤:
from datasetslib.text8 import Text8
text8 = Text8()
text8.load_data()
# downloads data, converts words to ids, converts files to a list of ids
print('Train:', text8.part['train'][0:5])
print('Vocabulary Length = ',text8.vocab_len)
我们发现词汇长度大约是 254,000 字:
Train: [5233, 3083, 11, 5, 194]
Vocabulary Length = 253854
一些教程通过查找最常用的单词或将词汇量大小截断为 10,000 个单词来操纵此数据。 但是,我们使用了 text8 数据集的第一个文件中的完整数据集和完整词汇表。
准备 CBOW 对:
text8.skip_window=2
text8.reset_index_in_epoch()
# in CBOW input is the context word and output is the target word
y_batch, x_batch = text8.next_batch_cbow()
print('The CBOW pairs : context,target')
for i in range(5 * text8.skip_window):
print('(', [text8.id2word[x_i] for x_i in x_batch[i]],
',', y_batch[i], text8.id2word[y_batch[i]], ')')
输出是:
The CBOW pairs : context,target
( ['anarchism', 'originated', 'a', 'term'] , 11 as )
( ['originated', 'as', 'term', 'of'] , 5 a )
( ['as', 'a', 'of', 'abuse'] , 194 term )
( ['a', 'term', 'abuse', 'first'] , 1 of )
( ['term', 'of', 'first', 'used'] , 3133 abuse )
( ['of', 'abuse', 'used', 'against'] , 45 first )
( ['abuse', 'first', 'against', 'early'] , 58 used )
( ['first', 'used', 'early', 'working'] , 155 against )
( ['used', 'against', 'working', 'class'] , 127 early )
( ['against', 'early', 'class', 'radicals'] , 741 working )
准备 skip-gram 对:
text8.skip_window=2
text8.reset_index_in_epoch()
# in skip-gram input is the target word and output is the context word
x_batch, y_batch = text8.next_batch()
print('The skip-gram pairs : target,context')
for i in range(5 * text8.skip_window):
print('(',x_batch[i], text8.id2word[x_batch[i]],
',', y_batch[i], text8.id2word[y_batch[i]],')')
输出为:
The skip-gram pairs : target,context
( 11 as , 5233 anarchism )
( 11 as , 3083 originated )
( 11 as , 5 a )
( 11 as , 194 term )
( 5 a , 3083 originated )
( 5 a , 11 as )
( 5 a , 194 term )
( 5 a , 1 of )
( 194 term , 11 as )
( 194 term , 5 a )