Best suited for sequential data
RNN is best suited for sequential data. It can handle arbitrary input / output lengths. RNN uses its internal memory to process arbitrary sequences of inputs.
This makes RNNs best suited for predicting what comes next in a sequence of words. Like a human brain, particularly in conversations, more weight is given to recency of information to anticipate sentences.
RNN that is trained to translate text might learn that “dog” should be translated differently if preceded by the word “hot”.
RNN has internal memory
RNN has memory capabilities. It memorizes previous data. While making a decision, it takes into consideration the current input and also what it has learned from the inputs it received previously. Output from previous step is fed as input to the current step creating a feedback loop.
So, it calculates its current state using set of current input and the previous state. In this way, the information cycles through a loop.
In nutshell, we can say that RNN has two inputs, the present and the recent past. This is important because the sequence of data contains crucial information about what is coming next, which is why a RNN can do things other algorithms can’t.
Types of RNN
1. One to One: It maps one input to one output. It is also known as Vanilla Neural Network. It is used to solve regular machine learning problems.
2. One to Many: It maps one input to many outputs. Example: Image Captioning. An image is fetched into the RNN system and it provides the caption by considering various objects in the image.
Caption: “A dog catching a ball in mid air“
3. Many to One: It maps sequence of inputs to one output. Example: Sentiment Analysis. In sentiment analysis, a sequence of words are provided as input, and RNN decides whether the sentiment is positive or negative.
4. Many to Many: It maps sequence of inputs to sequence of outputs. Example: Machine Translation. A sentence in a particular language is translated into other languages.
Forward and Backward Propagation
Forward Propagation: We do forward propagation to get the output of the model and check its accuracy and get the error.
Backward Propagation: Once the forward propagation is completed, we calculate the error. This error is then back-propagated to the network to update the weights.
We go backward through the neural network to find the partial derivatives of the error (loss function) with respect to the weights. This partial derivative is now multiplied with learning rate to calculate step size. This step size is added to the original weights to calculate new weights. That is how a neural network learns during the training process.
Vanishing and Exploding Gradients
Lets first understand what is gradient?
Gradient: As discussed above in back-propagation section, a gradient is a partial derivative with respect to its inputs. A gradient measures how much the output of a function changes, if you change the inputs a little bit.
You can also think of a gradient as the slope of a function. Higher the gradient, steeper the slope and the faster a model can learn. If the slope is almost zero, the model stops to learn. A gradient simply measures the change in all weights with regard to the change in error.
Gradient issues in RNN
While training an RNN algorithm, sometimes gradient can become too small or too large. So, the training of an RNN algorithm becomes very difficult in this situation. Due to this, following issues occur:
1. Poor Performance
2. Low Accuracy
3. Long Training Period
Exploding Gradient: When we assign high importance to the weights, exploding gradient issue occurs. In this case, values of a gradient become too large and slope tends to grow exponentially. This can be solved using following methods:
1. Identity Initialization
2. Truncated Back-propagation
3. Gradient Clipping
Vanishing Gradient: This issue occurs when the values of a gradient are too small and the model stops learning or takes way too long because of that. This can be solved using following methods:
1. Weight Initialization
2. Choosing the right Activation Function
3. LSTM (Long Short-Term Memory)
Best way to solve the vanishing gradient issue is the use of LSTM (Long Short-Term Memory).
LSTM
A usual RNN has a short-term memory. So, it is not able to handle long term dependencies. Using LSTM, it can also have a long-term memory. LSTM is an extension for RNA, which extends its memory. LSTM’s enable RNN’s to remember their inputs over a long period of time so that RNN become capable of learning long-term dependencies.
In this way, LSTM solves the vanishing gradients issue in RNN. It keeps the gradients steep enough and therefore make training relatively short and the accuracy high.
Gated Cells in LSTM
LSTM is comprised of different memory blocks called cells and manipulations in these cells are done using gates. LSTMs store information in these gated cells. The data can be stored, deleted and read from these gated cells much like the data in a computer’s memory. Gates of these cells open and close based on some decisions.
These gates are analog gates (instead of digital gates) and their outputs range from 0 to 1. Analog has the advantage over digital of being differentiable, and therefore suitable for back-propagation.
We have following types of gates in LSTM:
1. Forget Gate: It decides what information it needs to forget or throw away. It outputs a number between 0 and 1. A 1 represents “completely keep this” while a 0 represents “completely forget this.”
2. Input Gate: The input gate is responsible for the addition of information to the cell state. It ensures that only that information is added to the cell state that is important and is not redundant.
3. Output Gate: Its job is to select useful information from the current cell state and showing it out as an output.
Squashing / Activation Functions in LSTM
1. Logistic (sigmoid): Outputs range from 0 to 1.
2. Hyperbolic Tangent (tanh): Outputs range from -1 to 1.
Bidirectional RNN
Bidirectional RNNs take an input vector and train it on two RNNs. One of the them gets trained on the regular RNN input sequence while the other on a reversed sequence. Outputs from both RNNs are next concatenated, or combined.
Applications of RNN
1. Natural Language Processing (Text mining, Sentiment analysis, Text and Speech analysis, Audio and Video analysis)
2. Machine Translation (Translate a language to other languages)
3. Time Series Prediction (Stock market prediction, Algorithmic trading, Weather prediction,
Understanding DNA sequence etc.)
4. Image Captioning