%%html
<style>
img[alt=recurrent_unrolled] { width: 400px; }
</style>
<style>
img[alt=sequence_vector] { width: 400px; }
</style>
<style>
img[alt=gru-cell] { width: 400px; }
</style>
<style>
img[alt=encoder-decoder] { width: 400px; }
</style>

Intro

Use case: arbitrary-length sequence data analysis - anticipation abilities
RNNs much like feed-forward NNs, but also with backward-facing connections
At time step t each node sees input x(t) plus its previous output y(t-1).
Below: "unrolling" a net across a time axis.

Memory Cells

A network node that preserves state across time is called a cell (memory cell).
h(t) is a cell's "hidden" state at time=t.

Input/Output Sequences

RNNs can be used to predict the results of time shifts (sequence-to-sequence), a sentiment score (sequence-to-vector), or image caption (vector-to-sequence).
sequence-to-vector nets = encoders; vector-to-sequence nets = decoders. One use case: language translation.
Below:
- Top Left: Sequence-to-sequence
- Top Right: Sequence-to-vector
- Bot Left: Vector-to-sequence
- Bot Right: Delayed-sequence-to-sequence

Basic RNNs in TF

RNN design: layer of 5 recurrent cells with tanh activation; runs over 2 time steps, and uses vectors of size=3 at each step.

import tensorflow as tf

n_inputs = 3
n_neurons = 5

# two-layer net

X0 = tf.placeholder(tf.float32, [None, n_inputs])
X1 = tf.placeholder(tf.float32, [None, n_inputs])

Wx = tf.Variable(tf.random_normal(shape=[n_inputs, n_neurons],dtype=tf.float32))
Wy = tf.Variable(tf.random_normal(shape=[n_neurons,n_neurons],dtype=tf.float32))

b = tf.Variable(tf.zeros([1, n_neurons], dtype=tf.float32))

Y0 = tf.tanh(tf.matmul(X0, Wx) + b)
Y1 = tf.tanh(tf.matmul(Y0, Wy) + tf.matmul(X1, Wx) + b)

init = tf.global_variables_initializer()

# to feed inputs at both time steps,

import numpy as np
# Mini-batch: instance 0,instance 1,instance 2,instance 3

X0_batch = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 0, 1]]) # t = 0
X1_batch = np.array([[9, 8, 7], [0, 0, 0], [6, 5, 4], [3, 2, 1]]) # t = 1

# Y0, Y1 = network outputs at both time steps

with tf.Session() as sess:
    init.run()
    Y0_val, Y1_val = sess.run([Y0, Y1], feed_dict={X0: X0_batch, X1: X1_batch})

print("output at t=0:\n",Y0_val,"\n","output at t=1\n",Y1_val)

output at t=0:
 [[-0.77183092 -0.99924457  0.23752896 -0.63130957 -0.83723265]
 [-0.92028087 -1.          0.99004787 -0.87230623 -0.99995315]
 [-0.97358704 -1.          0.999919   -0.95966864 -1.        ]
 [ 0.99999094 -0.99890459  0.9991411   0.99996841 -0.99999803]] 
 output at t=1
 [[ 0.99512661 -1.          0.99997395 -0.99830353 -1.        ]
 [ 0.99977976  0.99013239 -0.96352106 -0.99476629  0.97579277]
 [ 0.99981618 -0.99989575  0.99114233 -0.99827981 -0.99984008]
 [ 0.54805535 -0.84061396 -0.99912792 -0.47432473 -0.99921536]]

Unrolling through Time (Static) using static_rnn()

tf.reset_default_graph()

n_inputs = 3
n_neurons = 5

X0 = tf.placeholder(tf.float32, [None, n_inputs])
X1 = tf.placeholder(tf.float32, [None, n_inputs])

# BasicRNNCell() -- memcell "factory"

basic_cell = tf.contrib.rnn.BasicRNNCell(
    num_units=n_neurons)

# static_rnn() -- creates unrolled RNN net by chaining cells.
# returns 1) python list of output tensors for each time step
#         2) tensor of final network states

output_seqs, states = tf.contrib.rnn.static_rnn(
    basic_cell, 
    [X0, X1], 
    dtype=tf.float32)

Y0, Y1 = output_seqs

init = tf.global_variables_initializer()

# to feed inputs at both time steps,

import numpy as np
# Mini-batch: instance 0,instance 1,instance 2,instance 3

X0_batch = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 0, 1]]) # t = 0
X1_batch = np.array([[9, 8, 7], [0, 0, 0], [6, 5, 4], [3, 2, 1]]) # t = 1

# Y0, Y1 = network outputs at both time steps

with tf.Session() as sess:
    init.run()
    Y0_val, Y1_val = sess.run([Y0, Y1], feed_dict={X0: X0_batch, X1: X1_batch})

print("output at t=0:\n",Y0_val,"\n","output at t=1\n",Y1_val)

output at t=0:
 [[ 0.42442048  0.92431569 -0.2353479  -0.90074939 -0.94408685]
 [ 0.73783255  0.98977458 -0.72123086 -0.99919385 -0.99999249]
 [ 0.89336294  0.99865782 -0.9186905  -0.99999398 -1.        ]
 [-0.99143326 -0.99993676 -0.37607926  0.88796568 -0.99899191]] 
 output at t=1
 [[ 0.81709599  0.48319042 -0.96708876 -0.9998284  -1.        ]
 [-0.18962485 -0.81231028 -0.21763545  0.88739753  0.57306314]
 [ 0.17130674 -0.6411857  -0.86380148 -0.95413983 -0.99999553]
 [-0.07749119 -0.86547101 -0.00461033 -0.91877526 -0.99582738]]

Simplification

tf.reset_default_graph()

n_steps = 2
n_inputs = 3
n_neurons = 5

# this time, use placeholder with add'l dimension for #timesteps
#X0 = tf.placeholder(tf.float32, [None, n_inputs])
#X1 = tf.placeholder(tf.float32, [None, n_inputs])
X =   tf.placeholder(tf.float32, [None, n_steps, n_inputs])

#print(X)

# transpose - make time steps = 1st dimension
# unstack - extract list of tensors

X_seqs = tf.unstack(
    tf.transpose(
        X, perm=[1, 0, 2]))

#print(X_seqs)

# BasicRNNCell() -- memcell "factory"

basic_cell = tf.contrib.rnn.BasicRNNCell(
    num_units=n_neurons)

# static_rnn() -- creates unrolled RNN net by chaining cells.
# returns 1) python list of output tensors for each time step
#         2) tensor of final network states

output_seqs, states = tf.contrib.rnn.static_rnn(
    basic_cell, 
    X_seqs, 
    dtype=tf.float32)

#Y0, Y1 = output_seqs

# stack - merge output tensors
# transpose - swap 1st two dimensions
# returns tensor shape [none, #steps, #neurons]

outputs = tf.transpose(
    tf.stack(output_seqs), 
    perm=[1,0,2])

init = tf.global_variables_initializer()

X_batch = np.array([
        # t = 0      t = 1 
        [[0, 1, 2], [9, 8, 7]], # instance 1
        [[3, 4, 5], [0, 0, 0]], # instance 2
        [[6, 7, 8], [6, 5, 4]], # instance 3
        [[9, 0, 1], [3, 2, 1]], # instance 4
    ])

with tf.Session() as sess:
    init.run()
    outputs_val = outputs.eval(feed_dict={X: X_batch})

print(outputs_val)

[[[ 0.76157701  0.11581181  0.64773971 -0.79434019 -0.86054337]
  [ 0.99998951 -0.66595364  0.99812627 -1.          0.84574401]]

 [[ 0.99683905  0.29572889  0.98365188 -0.99992883 -0.88169324]
  [ 0.41841054 -0.92049074 -0.64612901 -0.73361856  0.29283327]]

 [[ 0.99996316  0.45685658  0.99936479 -1.         -0.89980829]
  [ 0.99907684 -0.87088716  0.94328976 -0.9999997   0.87934762]]

 [[ 0.12318966  0.02264917  0.99982244 -0.99998975  0.99996465]
  [ 0.9525854  -0.56515652  0.08665188 -0.99705428  0.87525886]]]

Above code still not ideal - builds graph with one cell per time step. Ugly & can cause Out Of Memory errors.

Unrolling through Time using dynamic_rnn()

uses while_loop() to iterate over the memcell
set swap_memory=True to move GPU memory to CPU during backprop if needed
accepts single tensor, outputs single tensor - no stack/unstack/transpose ops required.

tf.reset_default_graph()

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])

basic_cell = tf.contrib.rnn.BasicRNNCell(
    num_units=n_neurons)

outputs, states = tf.nn.dynamic_rnn(
    basic_cell, X, dtype=tf.float32)

init = tf.global_variables_initializer()

with tf.Session() as sess:
    init.run()
    outputs_val = outputs.eval(feed_dict={X: X_batch})

print(outputs_val)

[[[ 0.01341763 -0.10483158 -0.94257653  0.83843452 -0.20272173]
  [ 0.99978089 -0.63150525 -0.99999148  0.99999386 -0.87993085]]

 [[ 0.94205797 -0.13386673 -0.9997741   0.99812031 -0.64444101]
  [-0.6134249  -0.55738503  0.39783546  0.89031053  0.04465704]]

 [[ 0.99817288 -0.16267382 -0.99999928  0.99997997 -0.86824256]
  [ 0.99097538 -0.61533296 -0.99695957  0.99986053 -0.64558744]]

 [[ 0.9963541   0.23641461  0.75174934  0.98267573 -0.97034496]
  [ 0.85169196 -0.07830215 -0.3604137   0.95550352  0.12307668]]]

Variable-Length Input Sequences

Most problems will have variable length inputs (like sentences).
This option uses sequence_length param (1D tensor)

tf.reset_default_graph()

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])

seq_length = tf.placeholder(tf.int32, [None])

basic_cell = tf.contrib.rnn.BasicRNNCell(
    num_units=n_neurons)

outputs, states = tf.nn.dynamic_rnn(
    basic_cell, X, dtype=tf.float32,
    #
    #
    sequence_length=seq_length)
    #
    #
X_batch = np.array([
        [[0, 1, 2], [9, 8, 7]], # instance 1
        [[3, 4, 5], [0, 0, 0]], # instance 2 -- zero padded
        [[6, 7, 8], [6, 5, 4]], # instance 3
        [[9, 0, 1], [3, 2, 1]], # instance 4
    ])

seq_length_batch = np.array([2,1,2,2])

init = tf.global_variables_initializer()

with tf.Session() as sess:
    init.run()
    outputs_val, states_val = sess.run(
        [outputs, states], 
        feed_dict={X: X_batch, seq_length: seq_length_batch})

# RNN should output zero vectors for any time step 
# beyond input sequence length
print(outputs_val)

[[[ 0.28581977 -0.77421445 -0.34181327 -0.87767971 -0.91387445]
  [ 0.99970448 -1.          0.79238343 -1.         -0.9997654 ]]

 [[ 0.96786171 -0.99937457 -0.03243476 -0.99988878 -0.99875116]
  [ 0.          0.          0.          0.          0.        ]]

 [[ 0.99903995 -0.99999839  0.28328663 -0.99999982 -0.99998271]
  [ 0.96896154 -0.99999189  0.43341497 -0.99996883 -0.98279852]]

 [[ 0.9976812  -0.99999118  0.99979782 -0.99983948  0.84931362]
  [ 0.57188803 -0.99268627 -0.30526906 -0.99518502  0.109933  ]]]

# states tensor contains final state of each cell
print(states_val)

[[ 0.99970448 -1.          0.79238343 -1.         -0.9997654 ]
 [ 0.96786171 -0.99937457 -0.03243476 -0.99988878 -0.99875116]
 [ 0.96896154 -0.99999189  0.43341497 -0.99996883 -0.98279852]
 [ 0.57188803 -0.99268627 -0.30526906 -0.99518502  0.109933  ]]

Variable-Length Output Sequences

Typical output sequence lengths not equal to input lengths
Most common solution: use end-of-sequence (EOS) token.

RNN Training

Unroll through time (as shown above) then use backprop through time (BPTT).

RNN Training: Classifier

Example: use MNIST (CNN would be better, but lets keep it simple)
Treat images as 28 rows of 28 pixels each
Use 150 rnn cells + fully-connected layer of 10 cells (1 per class)
Followed by softmax layer

# similar to MNIST classifier
# unrolled RNN replaces hidden layers

tf.reset_default_graph()

from tensorflow.contrib.layers import fully_connected

n_steps = 28
n_inputs = 28
n_neurons = 150
n_outputs = 10
learning_rate = 0.001

# y = placeholder for target classes

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.int32, [None])

basic_cell = tf.contrib.rnn.BasicRNNCell(
    num_units=n_neurons)

outputs, states = tf.nn.dynamic_rnn(
    basic_cell, X, dtype=tf.float32)

logits = fully_connected(
    states, n_outputs, activation_fn=None)

xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
    labels=y, logits=logits)

loss = tf.reduce_mean(
    xentropy)

optimizer = tf.train.AdamOptimizer(
    learning_rate=learning_rate)

training_op = optimizer.minimize(
    loss)

correct = tf.nn.in_top_k(
    logits, y, 1)

accuracy = tf.reduce_mean(
    tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()

# load MNIST data, reshape to [batch_size, n_steps, n_inputs]

from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets("/tmp/data/")

X_test = mnist.test.images.reshape((-1, n_steps, n_inputs))
y_test = mnist.test.labels

Extracting /tmp/data/train-images-idx3-ubyte.gz
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz

# ready to run. reshape each training batch before feeding to net.

n_epochs = 10
batch_size = 150

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):

            X_batch, y_batch = mnist.train.next_batch(batch_size)
            X_batch = X_batch.reshape(
                (-1, n_steps, n_inputs))

            sess.run(
                training_op, 
                feed_dict={X: X_batch, y: y_batch})

        acc_train = accuracy.eval(
            feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(
            feed_dict={X: X_test, y: y_test})

        print(epoch, 
              "Train accuracy:", acc_train, 
              "Test accuracy:",  acc_test)

0 Train accuracy: 0.953333 Test accuracy: 0.8711
1 Train accuracy: 0.953333 Test accuracy: 0.9417
2 Train accuracy: 0.953333 Test accuracy: 0.9432
3 Train accuracy: 0.946667 Test accuracy: 0.9595
4 Train accuracy: 0.98 Test accuracy: 0.9627
5 Train accuracy: 0.966667 Test accuracy: 0.9666
6 Train accuracy: 0.96 Test accuracy: 0.961
7 Train accuracy: 0.973333 Test accuracy: 0.9729
8 Train accuracy: 0.986667 Test accuracy: 0.9702
9 Train accuracy: 0.986667 Test accuracy: 0.9732

RNN Training: Predicting Time Series

t_min, t_max = 0, 30
resolution = 0.1

def time_series(t):
    return t * np.sin(t) / 3 + 2 * np.sin(t*5)

def next_batch(batch_size, n_steps):
    t0 = np.random.rand(batch_size, 1) * (t_max - t_min - n_steps * resolution)
    Ts = t0 + np.arange(0., n_steps + 1) * resolution
    ys = time_series(Ts)
    return ys[:, :-1].reshape(-1, n_steps, 1), ys[:, 1:].reshape(-1, n_steps, 1)

t = np.linspace(t_min, t_max, (t_max - t_min) // resolution)

n_steps = 20
t_instance = np.linspace(
    12.2, 12.2 + resolution * (n_steps + 1), n_steps + 1)

# each training instance = 20 inputs long
# targets = 20-input sequences

tf.reset_default_graph()

n_steps = 20
n_inputs = 1
n_neurons = 100
n_outputs = 1

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.float32, [None, n_steps, n_outputs])

cell = tf.contrib.rnn.BasicRNNCell(
    num_units=n_neurons, 
    activation=tf.nn.relu)

outputs, states = tf.nn.dynamic_rnn(
    cell, X, dtype=tf.float32)

print(outputs.shape)

(?, 20, 100)

# output at each time step now vector[100],
# but we want single output value at each step.

# use OutputProjectionWrapper()
# -- adds FC layer to top of each output

cell = tf.contrib.rnn.OutputProjectionWrapper(
    tf.contrib.rnn.BasicRNNCell(
        num_units=n_neurons, 
        activation=tf.nn.relu),
    output_size=n_outputs)

# define cost function using MSE
# use Adam optimizer

learning_rate = 0.001
loss = tf.reduce_mean(
    tf.square(outputs - y))

optimizer = tf.train.AdamOptimizer(
    learning_rate=learning_rate)

training_op = optimizer.minimize(loss)
init = tf.global_variables_initializer()

# initialize & run

init = tf.global_variables_initializer()
n_iterations = 1000
batch_size = 50

with tf.Session() as sess:
    init.run()
    for iteration in range(n_iterations):
        X_batch, y_batch = next_batch(batch_size, n_steps)
        sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        if iteration % 100 == 0:
            mse = loss.eval(feed_dict={X: X_batch, y: y_batch})
            print(iteration, "\tMSE:", mse)


    # use trained model to make some predictions
    X_new = time_series(np.array(t_instance[:-1].reshape(-1, n_steps, n_inputs)))
    y_pred = sess.run(outputs, feed_dict={X: X_new})
    print(y_pred)

0     MSE: 15.3099
100     MSE: 13.5276
200     MSE: 11.0956
300     MSE: 9.91156
400     MSE: 14.0311
500     MSE: 9.73811
600     MSE: 9.23351
700     MSE: 9.64445
800     MSE: 8.98904
900     MSE: 10.849
[[[ 0.          0.          0.         ...,  0.          0.          0.        ]
  [ 0.          0.04218276  0.         ...,  0.          0.          0.        ]
  [ 0.          0.14342034  0.         ...,  0.          0.          0.        ]
  ..., 
  [ 6.67315388  0.          6.39087296 ...,  6.9017005   6.30435514
    6.23329258]
  [ 6.61708975  0.          6.31429434 ...,  6.58116341  6.19745445
    6.11896658]
  [ 5.9406209   0.          5.73649979 ...,  5.63920403  5.5386672
    5.47510672]]]

import matplotlib.pyplot as plt

plt.title("Testing the model", fontsize=14)

plt.plot(
    t_instance[:-1], 
    time_series(t_instance[:-1]), 
    "bo", markersize=10, label="instance")

plt.plot(
    t_instance[1:], 
    time_series(t_instance[1:]), 
    "w*", markersize=10, label="target")

plt.plot(
    t_instance[1:], 
    y_pred[0,:,0], 
    "r.", markersize=10, label="prediction")

plt.legend(loc="upper left")
plt.xlabel("Time")
#save_fig("time_series_pred_plot")
plt.show()

OutputProjectionWrapper() = simplest solution for reducing output sequences to one value/timestep, but not most efficient.
More efficient solution shown below - signficant speed boost.

tf.reset_default_graph()

n_steps = 20
n_inputs = 1
n_neurons = 100
n_outputs = 1

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.float32, [None, n_steps, n_outputs])

cell = tf.contrib.rnn.BasicRNNCell(
    num_units=n_neurons, 
    activation=tf.nn.relu)

rnn_outputs, states = tf.nn.dynamic_rnn(
    cell, X, dtype=tf.float32)

# stack outputs using reshape
stacked_rnn_outputs = tf.reshape(
    rnn_outputs, [-1, n_neurons])

print(stacked_rnn_outputs)

# add FC layer -- just a projection, so no activation fn needed
stacked_outputs = fully_connected(
    stacked_rnn_outputs, 
    n_outputs,
    activation_fn=None)

print(stacked_outputs)

# unstack outputs using reshape
outputs = tf.reshape(
    stacked_outputs, [-1, n_steps, n_outputs])

print(outputs)

loss = tf.reduce_sum(tf.square(outputs - y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)

#initialize & run
init = tf.global_variables_initializer()

n_iterations = 1000
batch_size = 50

with tf.Session() as sess:
    init.run()
    for iteration in range(n_iterations):
        X_batch, y_batch = next_batch(batch_size, n_steps)
        sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        if iteration % 100 == 0:
            mse = loss.eval(feed_dict={X: X_batch, y: y_batch})
            print(iteration, "\tMSE:", mse)


    # use trained model to make some predictions
    X_new = time_series(np.array(t_instance[:-1].reshape(-1, n_steps, n_inputs)))
    y_pred = sess.run(outputs, feed_dict={X: X_new})
    print(y_pred)

Tensor("Reshape:0", shape=(?, 100), dtype=float32)
Tensor("fully_connected/BiasAdd:0", shape=(?, 1), dtype=float32)
Tensor("Reshape_1:0", shape=(?, 20, 1), dtype=float32)
0     MSE: 22963.7
100     MSE: 743.444
200     MSE: 276.131
300     MSE: 117.955
400     MSE: 53.3529
500     MSE: 63.4189
600     MSE: 45.1415
700     MSE: 41.5129
800     MSE: 53.4219
900     MSE: 43.2203
[[[-3.46527553]
  [-2.46867704]
  [-1.10144436]
  [ 0.69717044]
  [ 2.08823276]
  [ 3.13628578]
  [ 3.55210543]
  [ 3.4186697 ]
  [ 2.85978389]
  [ 2.15520501]
  [ 1.67705297]
  [ 1.6919663 ]
  [ 1.93633199]
  [ 2.70151305]
  [ 3.87054777]
  [ 5.11770582]
  [ 6.15701818]
  [ 6.71814394]
  [ 6.69798708]
  [ 6.08309698]]]

plt.title("Testing the model", fontsize=14)
plt.plot(t_instance[:-1], time_series(t_instance[:-1]), "bo", markersize=10, label="instance")
plt.plot(t_instance[1:], time_series(t_instance[1:]), "w*", markersize=10, label="target")
plt.plot(t_instance[1:], y_pred[0,:,0], "r.", markersize=10, label="prediction")
plt.legend(loc="upper left")
plt.xlabel("Time")
plt.show()

Creative RNNs

Use model to generate creative sequences
Provide seed sequence of length = n_steps, zero-filled
use model to append predicted new value to sequence
feed last n_steps values to model to predict next value, etc.
should get new sequence resembling original time series

n_iterations = 2000
batch_size = 50

with tf.Session() as sess:
    init.run()
    for iteration in range(n_iterations):
        X_batch, y_batch = next_batch(batch_size, n_steps)
        sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        if iteration % 100 == 0:
            mse = loss.eval(feed_dict={X: X_batch, y: y_batch})
            print(iteration, "\tMSE:", mse)

    sequence1 = [0. for i in range(n_steps)]
    for iteration in range(len(t) - n_steps):
        X_batch = np.array(sequence1[-n_steps:]).reshape(1, n_steps, 1)
        y_pred = sess.run(outputs, feed_dict={X: X_batch})
        sequence1.append(y_pred[0, -1, 0])

    sequence2 = [time_series(i * resolution + t_min + (t_max-t_min/3)) for i in range(n_steps)]
    for iteration in range(len(t) - n_steps):
        X_batch = np.array(sequence2[-n_steps:]).reshape(1, n_steps, 1)
        y_pred = sess.run(outputs, feed_dict={X: X_batch})
        sequence2.append(y_pred[0, -1, 0])

plt.figure(figsize=(11,4))
plt.subplot(121)
plt.plot(t, sequence1, "b-")
plt.plot(t[:n_steps], sequence1[:n_steps], "b-", linewidth=3)
plt.xlabel("Time")
plt.ylabel("Value")

plt.subplot(122)
plt.plot(t, sequence2, "b-")
plt.plot(t[:n_steps], sequence2[:n_steps], "b-", linewidth=3)
plt.xlabel("Time")
#save_fig("creative_sequence_plot")
plt.show()

0     MSE: 14607.1
100     MSE: 505.605
200     MSE: 167.29
300     MSE: 83.1336
400     MSE: 58.9695
500     MSE: 61.0224
600     MSE: 55.8671
700     MSE: 43.7078
800     MSE: 57.2013
900     MSE: 55.3992
1000     MSE: 54.082
1100     MSE: 55.48
1200     MSE: 39.4618
1300     MSE: 40.7414
1400     MSE: 47.8548
1500     MSE: 43.9252
1600     MSE: 47.892
1700     MSE: 42.0762
1800     MSE: 48.2429
1900     MSE: 42.7509

Deep RNNs

Built by stacking cells into a MultiRNNCell().

tf.reset_default_graph()

n_inputs = 2
n_neurons = 100
n_layers = 3
n_steps = 5
keep_prob = 0.5

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])

basic_cell = tf.contrib.rnn.BasicRNNCell(
    num_units=n_neurons)

print(basic_cell)

multi_layer_cell = tf.contrib.rnn.MultiRNNCell(
    [basic_cell] * n_layers)

print(multi_layer_cell)

# states = tuple (one tensor/layer, = final state of layer's cell)

outputs, states = tf.nn.dynamic_rnn(
    multi_layer_cell, X, dtype=tf.float32)

init = tf.global_variables_initializer()

import numpy.random as rnd
X_batch = rnd.rand(2, n_steps, n_inputs)

with tf.Session() as sess:
    init.run()
    outputs_val, states_val = sess.run(
        [outputs, states], 
        feed_dict={X: X_batch})

print(outputs_val.shape)

<tensorflow.contrib.rnn.python.ops.core_rnn_cell_impl.BasicRNNCell object at 0x7fd1ff3dbb00>
<tensorflow.contrib.rnn.python.ops.core_rnn_cell_impl.MultiRNNCell object at 0x7fd1d9b7c9e8>
(2, 5, 100)

DRNNs: Multiple GPUs

TO DO

Dropout

Very deep RNNs = danger of overfit. Use dropout to avoid problem.
Can apply before or after RNN
If applying dropout between RNN layers, need to use DropoutWrapper.

# apply 50% dropout to inputs of RNN layers
# can apply dropout to outputs via output_keep_prob

tf.reset_default_graph()
from tensorflow.contrib.layers import fully_connected

n_inputs = 1
n_neurons = 100
n_layers = 3
n_steps = 20
n_outputs = 1

keep_prob = 0.5
learning_rate = 0.001

def deep_rnn_with_dropout(X, y, is_training):

    # TF implementation of DropoutWrapper doesn't differentiate
    # between training & testing.

    cell = tf.contrib.rnn.BasicRNNCell(
        num_units=n_neurons)

    if is_training:
        cell = tf.contrib.rnn.DropoutWrapper(
            cell, input_keep_prob=keep_prob)

    #
    #

    multi_layer_cell = tf.contrib.rnn.MultiRNNCell(
        [cell] * n_layers)

    rnn_outputs, states = tf.nn.dynamic_rnn(
        multi_layer_cell, X, dtype=tf.float32)

    stacked_rnn_outputs = tf.reshape(
        rnn_outputs, [-1, n_neurons])

    stacked_outputs = fully_connected(
        stacked_rnn_outputs, n_outputs, activation_fn=None)

    outputs = tf.reshape(
        stacked_outputs, [-1, n_steps, n_outputs])

    loss = tf.reduce_sum(
        tf.square(outputs - y))

    optimizer = tf.train.AdamOptimizer(
        learning_rate=learning_rate)

    training_op = optimizer.minimize(loss)

    return outputs, loss, training_op

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.float32, [None, n_steps, n_outputs])
outputs, loss, training_op = deep_rnn_with_dropout(X, y, is_training)
init = tf.global_variables_initializer()
saver = tf.train.Saver()

Dropout, in this code, works during both training & testing (don't want).
dropout_wrapper() doesn't know how to handle this, so you need one graph for training, another for testing.

n_iterations = 2000
batch_size = 50

is_training = True

with tf.Session() as sess:
    if is_training:
        init.run()
        for iteration in range(n_iterations):
            X_batch, y_batch = next_batch(batch_size, n_steps)
            sess.run(
                training_op, 
                feed_dict={X: X_batch, y: y_batch})

            if iteration % 100 == 0:
                mse = loss.eval(
                    feed_dict={X: X_batch, y: y_batch})

                print(iteration, "\tMSE:", mse)

        save_path = saver.save(sess, "/tmp/my_model.ckpt")

    else:
        saver.restore(sess, "/tmp/my_model.ckpt")

        X_new = time_series(
            np.array(t_instance[:-1].reshape(-1, n_steps, n_inputs)))
        y_pred = sess.run(
            outputs, feed_dict={X: X_new})

        plt.title("Testing the model", fontsize=14)
        plt.plot(t_instance[:-1], time_series(t_instance[:-1]), "bo", markersize=10, label="instance")
        plt.plot(t_instance[1:], time_series(t_instance[1:]), "w*", markersize=10, label="target")
        plt.plot(t_instance[1:], y_pred[0,:,0], "r.", markersize=10, label="prediction")
        plt.legend(loc="upper left")
        plt.xlabel("Time")
        plt.show()

0     MSE: 10428.8
100     MSE: 314.521
200     MSE: 152.328
300     MSE: 155.774
400     MSE: 100.226
500     MSE: 80.2064
600     MSE: 92.3898
700     MSE: 55.4301
800     MSE: 50.8537
900     MSE: 47.1413
1000     MSE: 57.1007
1100     MSE: 64.2314
1200     MSE: 51.3272
1300     MSE: 51.1612
1400     MSE: 41.0518
1500     MSE: 42.267
1600     MSE: 29.6838
1700     MSE: 48.4316
1800     MSE: 46.5584
1900     MSE: 40.6252

# testing

with tf.Session() as sess:

    saver.restore(sess, "/tmp/my_model.ckpt")

    X_new = time_series(
        np.array(t_instance[:-1].reshape(-1, n_steps, n_inputs)))

    y_pred = sess.run(
        outputs, feed_dict={X: X_new})

    plt.title("Testing the model", fontsize=14)
    plt.plot(t_instance[:-1], time_series(t_instance[:-1]), "bo", markersize=10, label="instance")
    plt.plot(t_instance[1:], time_series(t_instance[1:]), "w*", markersize=10, label="target")
    plt.plot(t_instance[1:], y_pred[0,:,0], "r.", markersize=10, label="prediction")
    plt.legend(loc="upper left")
    plt.xlabel("Time")
    plt.show()

Training across Many Time Steps

problem #1: RNNs susceptible to vanishing/exploding gradients issues. Previous tricks will work, but training time = prohibitively long for even modest sequences.
solution #1: truncated backprop thru time (unrolling RNN over limited number of timesteps during training). Works, but model will not be able to learn long-term patterns.
problem #2: memory of early inputs fades away - information lost during each transformation.
solution #2: using a long-term memory cell.

Long Short-Term Memory (LSTM) Cell

implemented via BasicLSTMCell() instead of BasicRNNCell().
key feature: net learns what to store (long-term), what to read from, what to throw away.
Four FC layers - each with unique purposes:
- main layer: outputs g(t)
- forget gate: controlled by f(t) - decides which parts of long-term memory to erase
- input gate: controlled by i(t) - decides which parts of g(t) to add to long-term memory
- output gate: controlled by o(t) - decides which parts of long-term state should be read & outputted at this time step.

tf.reset_default_graph()

from tensorflow.contrib.layers import fully_connected

n_steps = 28
n_inputs = 28
n_neurons = 150
n_outputs = 10

learning_rate = 0.001

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.int32, [None])

lstm_cell = tf.contrib.rnn.BasicLSTMCell(
    num_units=n_neurons)

multi_cell = tf.contrib.rnn.MultiRNNCell(
    [lstm_cell]*3)

outputs, states = tf.nn.dynamic_rnn(
    multi_cell, X, dtype=tf.float32)

top_layer_h_state = states[-1][1]

logits = fully_connected(
    top_layer_h_state, 
    n_outputs, 
    activation_fn=None, scope="softmax")

xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
    labels=y, logits=logits)

loss = tf.reduce_mean(
    xentropy, name="loss")

optimizer = tf.train.AdamOptimizer(
    learning_rate=learning_rate)

training_op = optimizer.minimize(loss)

correct = tf.nn.in_top_k(
    logits, y, 1)

accuracy = tf.reduce_mean(
    tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()

states

(LSTMStateTuple(c=<tf.Tensor 'rnn/while/Exit_2:0' shape=(?, 150) dtype=float32>, h=<tf.Tensor 'rnn/while/Exit_3:0' shape=(?, 150) dtype=float32>),
 LSTMStateTuple(c=<tf.Tensor 'rnn/while/Exit_4:0' shape=(?, 150) dtype=float32>, h=<tf.Tensor 'rnn/while/Exit_5:0' shape=(?, 150) dtype=float32>),
 LSTMStateTuple(c=<tf.Tensor 'rnn/while/Exit_6:0' shape=(?, 150) dtype=float32>, h=<tf.Tensor 'rnn/while/Exit_7:0' shape=(?, 150) dtype=float32>))

top_layer_h_state

<tf.Tensor 'rnn/while/Exit_7:0' shape=(?, 150) dtype=float32>

n_epochs = 10
batch_size = 150

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            X_batch = X_batch.reshape((batch_size, n_steps, n_inputs))
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test})
        print("Epoch", epoch, "Train accuracy =", acc_train, "Test accuracy =", acc_test)

Epoch 0 Train accuracy = 0.966667 Test accuracy = 0.9403
Epoch 1 Train accuracy = 0.98 Test accuracy = 0.9742
Epoch 2 Train accuracy = 0.993333 Test accuracy = 0.979
Epoch 3 Train accuracy = 0.993333 Test accuracy = 0.9805
Epoch 4 Train accuracy = 1.0 Test accuracy = 0.9854
Epoch 5 Train accuracy = 0.98 Test accuracy = 0.9827
Epoch 6 Train accuracy = 0.993333 Test accuracy = 0.9851
Epoch 7 Train accuracy = 1.0 Test accuracy = 0.9865
Epoch 8 Train accuracy = 1.0 Test accuracy = 0.9887
Epoch 9 Train accuracy = 0.993333 Test accuracy = 0.9871

Peephole Connections

Basic LSTM cell: gate controllers only see input x(t) & prev short-term state h(t-1).
Improvement: let gate peek at long-term state too. Provided with previous long-term state c(t-1) as inputs to forget gate & input gate; current long-term state c(t) added as input to output gate controller.

# Peepholes in TF
lstm_cell = tf.contrib.rnn.LSTMCell(
    num_units=n_neurons, 
    use_peepholes=True)

Gated Recurrent Unit (GRU) Cell

Simplified version of LSTM cell
State vectors merged into single h(t).
Single gate controller manages forget gate & input gate. (if a memory is to be stored, its location is erased first.)
No output gate - full state vector output on

# in TF
gru_cell = tf.contrib.rnn.GRUCell(num_units=n_neurons)

Natural Language Processing (NLP)

Mostly based on RNNs
See Word2Vec and Seq2Seq tutorials!
More: Chris Olah, Sebastian Ruder

Word Embeddings

First: need a word representation. Similar words should have similar representations.
Common sol'n: each word in vocab = small, dense vector of embeddings.

# create embeddings variable. init with random[-1,+1]

vocabulary_size = 50000
embedding_size = 150
embeddings = tf.Variable(
    tf.random_uniform(
        [vocabulary_size, embedding_size], 
        -1.0, 1.0))

Feeding new sentences to net: replace unknown words, numbers, URLs, etc with predefined tokens. Once a word is known, you can look it up in a dictionary.

train_inputs = tf.placeholder(
    tf.int32, shape=[None]) # from ids...

embed = tf.nn.embedding_lookup(
    embeddings, train_inputs) # ...to embeddings

English => French Encoder-Decoder Network (link)

English inputs, French outputs
French translations also fed, pushed back one step
English sentences reversed before entry (ensures beginning of sentence is fed last = best for decoder translation)
Decoder returns score for each word in output vocabulary - softmax turns them into probabilities. Highest probability word is returned.

from six.moves import urllib

import errno
import os
import zipfile

WORDS_PATH = "datasets/words"
WORDS_URL = 'http://mattmahoney.net/dc/text8.zip'

def mkdir_p(path):
    """Create directories, ok if they already exist.

    This is for python 2 support. In python >=3.2, simply use:
    >>> os.makedirs(path, exist_ok=True)
    """
    try:
        os.makedirs(path)
    except OSError as exc:
        if exc.errno == errno.EEXIST and os.path.isdir(path):
            pass
        else:
            raise

def fetch_words_data(words_url=WORDS_URL, words_path=WORDS_PATH):
    os.makedirs(words_path, exist_ok=True)
    zip_path = os.path.join(words_path, "words.zip")
    if not os.path.exists(zip_path):
        urllib.request.urlretrieve(words_url, zip_path)
    with zipfile.ZipFile(zip_path) as f:
        data = f.read(f.namelist()[0])
    return data.decode("ascii").split()

words = fetch_words_data()
words[:5]

['anarchism', 'originated', 'as', 'a', 'term']

Build dictionary

from collections import Counter

vocabulary_size = 50000

vocabulary = [("UNK", None)] + Counter(words).most_common(vocabulary_size - 1)
vocabulary = np.array([word for word, _ in vocabulary])

dictionary = {word: code for code, word in enumerate(vocabulary)}

data = np.array([dictionary.get(word, 0) for word in words])

" ".join(words[:9]), data[:9]

('anarchism originated as a term of abuse first used',
 array([5244, 3081,   12,    6,  195,    2, 3135,   46,   59]))

" ".join([vocabulary[word_index] for word_index in [5241, 3081, 12, 6, 195, 2, 3134, 46, 59]])

'anywhere originated as a term of presidency first used'

words[24], data[24]

('culottes', 0)

Generate batches

import random
from collections import deque

def generate_batch(batch_size, num_skips, skip_window):
    global data_index
    assert batch_size % num_skips == 0
    assert num_skips <= 2 * skip_window
    batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    span = 2 * skip_window + 1 # [ skip_window target skip_window ]
    buffer = deque(maxlen=span)
    for _ in range(span):
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    for i in range(batch_size // num_skips):
        target = skip_window  # target label at the center of the buffer
        targets_to_avoid = [ skip_window ]
        for j in range(num_skips):
            while target in targets_to_avoid:
                target = random.randint(0, span - 1)
            targets_to_avoid.append(target)
            batch[i * num_skips + j] = buffer[skip_window]
            labels[i * num_skips + j, 0] = buffer[target]
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    return batch, labels

data_index=0
batch, labels = generate_batch(8, 2, 1)

batch, [vocabulary[word] for word in batch]

(array([3081, 3081,   12,   12,    6,    6,  195,  195], dtype=int32),
 ['originated', 'originated', 'as', 'as', 'a', 'a', 'term', 'term'])

labels, [vocabulary[word] for word in labels[:, 0]]

(array([[5244],
        [  12],
        [   6],
        [3081],
        [ 195],
        [  12],
        [   6],
        [   2]], dtype=int32),
 ['anarchism', 'as', 'a', 'originated', 'term', 'as', 'a', 'of'])

Build the Model

batch_size = 128
embedding_size = 128  # Dimension of the embedding vector.
skip_window = 1       # How many words to consider left and right.
num_skips = 2         # How many times to reuse an input to generate a label.

# We pick a random validation set to sample nearest neighbors. Here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent.

valid_size = 16     # Random set of words to evaluate similarity on.
valid_window = 100  # Only pick dev samples in the head of the distribution.
valid_examples = rnd.choice(valid_window, valid_size, replace=False)
num_sampled = 64    # Number of negative examples to sample.

learning_rate = 0.01

tf.reset_default_graph()

# Input data.
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

# Look up embeddings for inputs.
init_embeddings = tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0)
embeddings = tf.Variable(init_embeddings)
embed = tf.nn.embedding_lookup(embeddings, train_inputs)

# Construct the variables for the NCE loss
nce_weights = tf.Variable(
    tf.truncated_normal([vocabulary_size, embedding_size],
                        stddev=1.0 / np.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

# Compute the average NCE loss for the batch.
# tf.nce_loss automatically draws a new sample of the negative labels each
# time we evaluate the loss.
loss = tf.reduce_mean(
    tf.nn.nce_loss(nce_weights, nce_biases, train_labels, embed,
                   num_sampled, vocabulary_size))

# Construct the Adam optimizer
optimizer = tf.train.AdamOptimizer(learning_rate)
training_op = optimizer.minimize(loss)

# Compute the cosine similarity between minibatch examples and all embeddings.
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), axis=1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)

# Add variable initializer.
init = tf.global_variables_initializer()

num_steps = 1000 # was 100000?

with tf.Session() as session:
    init.run()

    average_loss = 0
    for step in range(num_steps):
        print("\rIteration: {}".format(step), end="\t")
        batch_inputs, batch_labels = generate_batch(batch_size, num_skips, skip_window)
        feed_dict = {train_inputs : batch_inputs, train_labels : batch_labels}

        # We perform one update step by evaluating the training op (including it
        # in the list of returned values for session.run()
        _, loss_val = session.run([training_op, loss], feed_dict=feed_dict)
        average_loss += loss_val

        if step % 2000 == 0:
            if step > 0:
                average_loss /= 2000
            # The average loss is an estimate of the loss over the last 2000 batches.
            print("Average loss at step ", step, ": ", average_loss)
            average_loss = 0

        # Note that this is expensive (~20% slowdown if computed every 500 steps)
        if step % 10000 == 0:
            sim = similarity.eval()
            for i in range(valid_size):
                valid_word = vocabulary[valid_examples[i]]
                top_k = 8 # number of nearest neighbors
                nearest = (-sim[i, :]).argsort()[1:top_k+1]
                log_str = "Nearest to %s:" % valid_word
                for k in range(top_k):
                    close_word = vocabulary[nearest[k]]
                    log_str = "%s %s," % (log_str, close_word)
                print(log_str)

    final_embeddings = normalized_embeddings.eval()

Iteration: 0    Average loss at step  0 :  260.603485107
Nearest to and: marsh, sipe, vehement, exercises, einer, mrnas, dancer, grendel,
Nearest to called: innuendo, algerian, synthesizing, montgomery, unspoken, elevating, plankton, monochromatic,
Nearest to many: salinas, fuji, trochaic, rubinstein, eln, tintin, lloyd, carbides,
Nearest to about: moreover, congo, choctaws, accomplished, unwieldy, ks, halifax, pac,
Nearest to than: awake, exact, offutt, gloster, pronunciations, delight, tsarina, hopped,
Nearest to or: long, mage, warriors, adhering, sk, clitoridectomy, parenting, vanguard,
Nearest to of: shakespeare, kemp, relax, cul, breakaway, solemnly, mason, mng,
Nearest to when: tolstoy, courtesan, hashes, coursing, evi, ren, diurnal, stimson,
Nearest to four: supermassive, soviet, palatalization, acclaimed, aided, whitney, filtration, lesbians,
Nearest to most: din, hawaii, loch, necronomicon, sunnah, sh, onager, miracles,
Nearest to on: helpers, tangle, heretical, compulsion, unorganized, rump, intimidating, israeli,
Nearest to but: ohio, rican, politeness, watkins, ingesting, street, hatred, novices,
Nearest to that: xhosa, distressed, continually, fausto, iole, admitted, etsi, gross,
Nearest to all: orissa, persistent, moro, informative, reservation, ren, browne, frobenius,
Nearest to in: chanced, accelerator, sergio, demonstrating, inertia, jarrett, intricate, orange,
Nearest to had: irredentist, kbit, sarris, lactate, bettor, narratives, hui, transpired,
Iteration: 999

Save final embeddings

np.save("my_final_embeddings.npy", final_embeddings)

Plot embeddings

def plot_with_labels(low_dim_embs, labels):
    assert low_dim_embs.shape[0] >= len(labels), "More labels than embeddings"
    plt.figure(figsize=(18, 18))  #in inches
    for i, label in enumerate(labels):
        x, y = low_dim_embs[i,:]
        plt.scatter(x, y)
        plt.annotate(label,
                     xy=(x, y),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    plt.show()

from sklearn.manifold import TSNE

tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
plot_only = 500
low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only,:])
labels = [vocabulary[i] for i in range(plot_only)]
plot_with_labels(low_dim_embs, labels)

ch14 Recurrent NNs.md

Intro

Memory Cells

Input/Output Sequences

Basic RNNs in TF

Unrolling through Time (Static) using static_rnn()

Simplification

Unrolling through Time using dynamic_rnn()

Variable-Length Input Sequences

Variable-Length Output Sequences

RNN Training

RNN Training: Classifier

RNN Training: Predicting Time Series

Creative RNNs

Deep RNNs

DRNNs: Multiple GPUs

Dropout

Training across Many Time Steps

Long Short-Term Memory (LSTM) Cell

Peephole Connections

Gated Recurrent Unit (GRU) Cell

Natural Language Processing (NLP)

Word Embeddings

English => French Encoder-Decoder Network (link)

Build dictionary

Generate batches

Build the Model

Save final embeddings

Plot embeddings

results matching ""

No results matching ""