Recurrent neural networks (RNN) by nvmoyar

Recurrent neural networks (RNN) by nvmoyar https://padlet.com/nvmoyar/72g3qqxc5xp6 Neural networks to analyse time series data (notes from DeepLearning book) en-us 2018-02-07 10:12:33 UTC 2024-05-10 01:00:42 UTC hello@padlet.com https://padlet-assets.s3.amazonaws.com/icons/Watchclock.png What are RNN ? nvmoyar https://padlet.com/nvmoyar/72g3qqxc5xp6/wish/229017929

They are a family of neural networks for processing sequential data, thus order and structure matters.

To go from multilayer networks to recurrent networks, ideas found in machine learning and statistical models of the 1980s: sharing parameters across diﬀerent parts of a model

Why the need of parameter sharing? It allows us to extend the model and check how well generalises, since we are able to apply the model to different forms (different lengths in this case).

Can be a 1D convolution like text recognition. In practice RNN operates in minibatches t, and the time step of each minibatch doesn't have to be related to the real world.

Can be applied to 2D spatial data like images, when the time serial must be considered like in videos.

]]> 2018-02-07 10:14:49 UTC https://padlet.com/nvmoyar/72g3qqxc5xp6/wish/229017929 Recursivity nvmoyar https://padlet.com/nvmoyar/72g3qqxc5xp6/wish/229026437

We could say that a RNN is composed by n computational graphs which are a set of structured computations. That computation implies mapping an input and parameters to outputs and loss.

We can see a computational graph in its folded or unfolded view: the unfolded view provides an explicit description of which computations to perform, and helps us to illustrate the information flow forward in time (computing outputs and losses) and backwards in time (computen gradients) since we know explicitly which path the information is following.

Regarding this graph:

This RNN has no outputs,

This RNN processes info from x (input node) and incorporates it to the state h, that is passed forward time.

The black square on the folded view is the delay of a single time step.

Unfolded computational graph --> each node is associated to a 1 time instance

For a x(t) signal, the state h(t) is represented by a function dependent of the previous state h(t-1) and gradients.

]]> 2018-02-07 10:37:50 UTC https://padlet.com/nvmoyar/72g3qqxc5xp6/wish/229026437 Examples of Design patterns for RNN nvmoyar https://padlet.com/nvmoyar/72g3qqxc5xp6/wish/229059301

Left: One input produces an output each time step, existing connections btw hidden units. The graph maps:

x, input sequence --> o, outputs
L, loss measures how far each o is from training target y
when using softmax outputs, we assume o is the unnormalized log probabilities.
The RNN has input to hidden connections parametrised by a weight matrix U. Hidden-to-hidden recurrent connections parametrised by a weight matrix W. Hidden-to-output connections parametrised by a weight matrix V.
It's a powerful model: any possible information about the past, can be shared to the hidden representation and transmitted to the future.

Center: One input produces an output each time step, existing connections only from output to hidden units each time step. The only recurrence is the feedback connection from the output to the hidden layer. This model is less powerful than the previous one: the figure is trained to put a specific output value into the output o, and o is the only information that is allowed to send to the future. No direct connections from h are going forward. Previous h is connected to the present only indirectly via the predictions it was used to produce. Unless o is high-dimensional and rich, it will usually lack important information from the past. Thus, less powerful but easier to train, because each time step is ISOLATED from the others (that allows greater parallelization). This models are trained with teacher forcing.

Right: Recurrent connections btw hidden units, that read an entire sequence and produces a single output. It can be used to summarise a sequence producing a fixed-size representation (used as input for further processing like seq-2-seq model). There might be a target (right at the end) or the gradient on the output o(t) can be obtained by BPTT from further downstream modules.

]]> 2018-02-07 12:32:27 UTC https://padlet.com/nvmoyar/72g3qqxc5xp6/wish/229059301 Backpropagation or BPTT (backpropagation through time) nvmoyar https://padlet.com/nvmoyar/72g3qqxc5xp6/wish/229143330 No specialised algorithms are necessary, the generalised backpropagation algorithm is applied to the unrolled computational graph. Gradients obtained by backpropagation then are used to train.

In feedforward networks moves backward from the final error through the outputs, weights and inputs of each hidden layer, assigning those weights responsibility for a portion of the error by calculating their partial derivatives – ∂E/∂w, or the relationship between their rates of change. Those derivatives are then used by our learning rule, gradient descent, to adjust the weights up or down, whichever direction decreases error. Recurrent networks rely on an extension of backpropagation called backpropagation through time, or BPTT. Time, in this case, is simply expressed by a well-defined, ordered series of calculations linking one time step to the next, which is all backpropagation needs to work.

Neural networks, whether they are recurrent or not, are simply nested composite functions like f(g(h(x))). Adding a time element only extends the series of functions for which we calculate derivatives with the chain rule.

]]> 2018-02-07 15:30:48 UTC https://padlet.com/nvmoyar/72g3qqxc5xp6/wish/229143330 Bidirectional RNN nvmoyar https://padlet.com/nvmoyar/72g3qqxc5xp6/wish/229177463 In many applications, however, we want to output a prediction of y(t) that

may depend on the whole input sequence x. Ex. In speech recognition if a sound might be interpreted in two ways, maybe we need information about the context in order to figure out what is the likely sound.
It combines an RNN that moves forward through time, beginning from the start of the sequence, with another RNN that moves backward through time, beginning from the end of the sequence.

As the name suggests, bidirectional RNNs combine an RNN that moves forward

through time, beginning from the start of the sequence, with another RNN that

moves backward through time, beginning from the end of the sequence. That means to learn to map the input sequences x to target sequences y with loss L(t) at each step t. The h recurrence propagates information forward in time (to the right) while g recurrence propagates information backward in time (to the left). At each point t the o(t) can benefit from the past on its h(t) and from the future through g(t).

]]> 2018-02-07 16:22:42 UTC https://padlet.com/nvmoyar/72g3qqxc5xp6/wish/229177463 PROBLEM: Vanishing (and exploding) gradients nvmoyar https://padlet.com/nvmoyar/72g3qqxc5xp6/wish/229188769 The challenge of learning long-term dependencies is that gradients propagated over many stages tend to either to vanish (most of the time) or explode. If we assume that the parameters are stable, we will be comparing long-term dependencies (smaller weights resulting from the multiplication of many Jacobians -matrix of derivatives-) to the short-term ones (^{Hochreiter, 1991; Doya, 1993; Bengio et al., 1994; Pascanu et al., 2013}). If we can’t know the gradient, we can’t adjust the weights in a direction that will decrease error, and our network ceases to learn.

Exploding gradients treat every weight as though it were the proverbial butterfly whose flapping wings cause a distant hurricane. Those weights’ gradients become saturated on the high end; i.e. they are presumed to be too powerful. But exploding gradients can be solved relatively easily, because they can be truncated or squashed. Vanishing gradients can become too small for computers to work with or for networks to learn – a harder problem to solve.

Below you see the effects of applying a sigmoid function over and over again. The data is flattened until, for large stretches, it has no detectable slope. This is analogous to a gradient vanishing as it passes through many layers.

https://uwaterloo.ca/data-analytics/sites/ca.data-analytics/files/uploads/files/rnn1.pdf

]]> 2018-02-07 16:39:50 UTC https://padlet.com/nvmoyar/72g3qqxc5xp6/wish/229188769 LSTM nvmoyar https://padlet.com/nvmoyar/72g3qqxc5xp6/wish/229498352 In theory, RNNs are absolutely capable of handling such long-term dependencies, however in practice, RNNs don’t seem to be able to learn them.

The basic idea behind an LSTM is to control the how the information flow is formed, that means to control which information is going to be keep for long, which part of the recent information flow (or short-term memory) is going to be part of the history, etc. LSTMs have a repeating unit structure, same as RNN, but the repeating module instead of a single neural network layer, there are four that interact btw each other, this four layers are seen like gates through which the information flow is moved forming a circuit. This gates are composed by a sigmoid nn and an element-wise multiplication operation.

LSTMs have a repeating unit structure usually called cells, same as RNN, but the repeating module instead of a single neural network layer, there are four that interact btw each other.

]]> 2018-02-08 11:02:04 UTC https://padlet.com/nvmoyar/72g3qqxc5xp6/wish/229498352 GRU (Gated Recurrent Unit) nvmoyar https://padlet.com/nvmoyar/72g3qqxc5xp6/wish/229521288 2018-02-08 12:20:28 UTC https://padlet.com/nvmoyar/72g3qqxc5xp6/wish/229521288 Anatomy of a LSTM Cell nvmoyar https://padlet.com/nvmoyar/72g3qqxc5xp6/wish/229965669

Learn gate --> combines short-term memory and event ignoring part the information by wise multiplying by a factor i(t). The combination is performed by putting them through a linear function which consists of joining the vectors by multiplying by a weight matrix adding a bias and finally squashing the result with a tanh activation function. This combination leads to a new pseudo state that can be called as ~C(t), which later use will lead to the new updated cell state from C(t-1) to C(t).

To calculate i(t) we take the previous short-term info and event as an input to a small nn, and pass them to a linear function with new weights and bias and squashing them with a sigmoid function. The output is i(t) which is element-wise multi plicated to ~C(t).

Forget gate --> through which its possible rid of irrelevant long-term memory information flow (history), that might be irrelevant for the outcome. After the cleanse, the remaining information will be added to the history. The info dropping is performed by element wise multiplication of the history -the cell state, C(t-1)- by a forget factor f(t).

To calculate f(t) it is used short-term memory and the event put them through a linear function multiplying by a new matrix and adding a bias, squashing with a sigmoid. This way new information is incorporated to the history.

Remember gate --> The output of this gate is new updated long-term memory, thus what we will remember for the future. It adds wisely the remaining info output by the forget gate and the new learnt info from the learn gate. After this gate we have a brand new C(t) state, a new updated info workflow.

Use gate --> this gate is responsible to decide which information to use from what we previously know + what we just learned to make a prediction. The prediction is based on a filtered version of the long-term info updated flow C(t). This filter o(t) is an element wise multiplication factor, which is built using the short-term info and the event info into a small nn, passing through a linear function with new weight matrix and bias, and squash the result with the sigmoid function.

Despite this is a frequent architecture which offers very good results, there are so many others possible.

http://colah.github.io/posts/2015-08-Understanding-LSTMs/
https://deeplearning4j.org/lstm.html

]]> 2018-02-09 10:52:56 UTC https://padlet.com/nvmoyar/72g3qqxc5xp6/wish/229965669 Peephole connections nvmoyar https://padlet.com/nvmoyar/72g3qqxc5xp6/wish/229967405 LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This variation also considers the cell state as an input for the forget, learn and use gates, which means that history and recent information are combined into the different nn that are run at each gate.
http://colah.github.io/posts/2015-08-Understanding-LSTMs/#variants-on-long-short-term-memory]]> 2018-02-09 11:00:54 UTC https://padlet.com/nvmoyar/72g3qqxc5xp6/wish/229967405