Why bidirectional RNN is better than unidirectional RNN: A theoretical proof

unsplash-image-2LowviVHZ-E.jpg

Have you ever wondered how Machine Learning (ML) research papers come up with their own mathematical proofs? In this article, we will provide you with step-by-step, intuitive, theoretical proof of why bidirectional Recurrent Neural Networks (RNNs) empirically perform better than their unidirectional counterparts.

Note: Our proof is an adapted and extended version of the dilated RNN proof$^{[2]}$. To our knowledge, this is the first attempt at theoretically proving that bi-directional RNNs are better than uni-directional RNNs.

Create a hypothesis

It is well known that bidirectionality considers information from both left and right contexts. Thus, bidirectional RNNs provide richer information in their outputs. But how do we formalize and quantify this hypothesis?

 
RNN_flow.png

Fig. 1: With an increase in the number of timesteps, lesser information (depicted by lighter colour) of the initial timestep is retained in a vanilla RNN.

RNN_skip_flow.png

Fig 2: This architecture is known as the skip RNN. By adding “skips“ among timesteps, more information (depicted by darker colour) from the initial timesteps is retained as the RNN unfolds.

We first start off with the vanilla RNN. As shown in Fig. 1, information from initial timesteps are barely carried forward as an RNN unfolds (depicted by the change in colour from dark to light). In order to allow initial information to flow across multiple timesteps, skip RNN$^\text{[1]}$ was introduced. A skip RNN adds “skip” connections between timesteps as shown in Fig. 2. This allows the architecture to model longer sentences since the information from initial words is not lost. Clearly, an RNN with a shorter path from the input node to the output node can model longer sentences and hence can provide better results. This looks like a good hypothesis to work with.

Formalize the hypothesis

Let $d_i(n)$ be the length of the shortest path from input node $i$ to any output node $i+n$. In Fig. 2, for a skip RNN architecture, if $i=1$ (the first input which is “word 1”) and $n=4$ (the last output which is “Output 5“), then $d_1(4)=2$ as there are two skips directly from the first input node to last output node. For comparison purposes, $d_1(4)=4$ for vanilla RNN in Fig. 1. We shall formally define the quantity, $d_i(n)$ as the recurrent length of a network. Now, RNNs that have better “memory” (such as the skip RNN which can remember longer contexts) will have smaller recurrent lengths. Hence, we define a new quantity called Memory Capacity such that, $$\text{MC}(f) = \frac{1}{\max\limits_{i}\left(\text{avg}_{n}\left(d_i(n)\right)\right)}$$Here, $f$ is any RNN and MC$(f)$ is the Memory Capacity of $f$. The idea is that the average longest distance between the input and output of an RNN is inversely related to its Memory Capacity.

Note: It is clear from the above equation that an ideal RNN would have any of its output directly connected to all of its inputs resulting in an MC value of $1$. But wait!! Doesn’t such a network already exist? You bet it does! Remember transformers? The self-attention mechanism makes sure that each output timestep is represented by all input timesteps. Thus, transformers have the ideal MC value which explains why these networks have been proven to perform better than RNNs on a variety of tasks.

Use the formalization for proof

Bidirectional RNNs are implemented by first passing the input sequence in left-to-right fashion as shown in Fig. 3 (a) and then in right-to-left fashion as shown in Fig. 3 (b). The separate vector outputs from each pass are then concatenated together to represent the final output for every timestep. Let $i=1$ (“word 1”) and $n=4$ (“output 5”), then memory capacity is $d_1(4)=4$ for the left-to-right pass. But for the same value of $i$ and $n$, $d_1(4)=0$ for the right-left-pass because the sequence is now in reverse and the first word directly provides the final output. Clearly, the longest possible sequence in the left-to-right pass is actually the shortest possible sequence in the right-to-left pass. Since the outputs are concatenated, subsequent layers can choose to use either contexts or both. Thus, for bidirectional RNNs, $d_i(n)=\min\left(d_i^l(n), d_i^r(n)\right)$ where $d_i^l(n)$ is the measure of memory capacity of left-to-right sequence while $d_i^r(n)$ is a measure of memory capacity of right-to-left sequence.

 
(a)

(a)

(b)

(b)

Fig. 3: (a) Input sequence is left-to-right (b) Input sequence is right-to-left. The output from both sequences are concatenated to produce final output for a bidirectional RNN

Objective

Now that we have a Memory Capacity (MC) measure for both unidirectional and bidirectional RNNs, we need to compare them and find which one has a larger MC. Let $d_i^u(n)$ and $d_i^b(n)$ denote the recurrent length of unidirectional and bidirectional RNNs respectively while MC$(f^u)$ and MC$(f^b)$ denote the Memory Capacities of unidirectional and bidirectional RNNs, respectively. We need to show that for all values of $n$, $d_i^b(n) < d_i^u(n)$, i.e., MC$(f^u) < $MC$(f^b)$. For such proofs that require you to show the condition “over all values”, one can use Expectation$^{[3]}$.

Expectation of recurrent length for unidirectional RNNs

Let $t$ be the total number of timesteps. Then, $n$ can take values between the range $[0, t-1]$. The maximum value is $t-1$ because at least one node has to be reserved for $i$. Now consider the case when $i=1$. This is the worst-case scenario for a unidirectional RNN since all outputs are at the furthest distance from the input (Try to work this out). Hence, we shall only consider the worst case $d_1(n)$. Now, the expected value of $d_1^u(n)$ where $n \in \{0, 2, …, t-1\}$ can be written as follows. $$\mathbb{E}[d_1^u(n)] = \sum_jp(d_1^u(j))d_1^u(j)$$

$$\implies \sum_j\left(\frac{1}{t}(0) + \frac{1}{t}(1) + \ldots + \frac{1}{t}(t-1)\right) = \frac{1}{t}\frac{t(t-1)}{2}\approx \frac{t}{2} \text{when } t \text{ is large}$$where, $p(d_1^u(j))$ is the probability of the event $d_1^u(j)$, meaning, the probability of the existence of an RNN with $j$ timesteps. In our case, we consider RNNs with any possible number of timesteps equally likely. Thus for $t$ timesteps, each RNN gets an equally likely probability of $\frac{1}{t}$. Finally, the above equation results from the fact that, $d_1^u(0)=0, d_1^u(1)=1, d_1^u(2)=2$, up to $d_1^u(t-1)=t-1$ meaning that the expected value of $d_1^u(n)$ is simply the sum of an AP series with initial value as $0$ and final value as $t-1$.

Expectation of recurrent length for bidirectional RNNs

In a bidirectional RNN, the longest possible sequence for the left-to-right pass becomes the shortest possible sequence for the right-to-left pass as discussed previously. In fact, the largest possible value for $d_1^b(n)$ is $\left(\frac{t}{2}-1\right)$. This is because, for any value above $\left(\frac{t}{2}-1\right)$, there exists a shorter path in the right-to-left sequence. Similarly, for any value below $\left(\frac{t}{2}-1\right)$, there exists a shorter path in the left-to-right sequence.

Before we write down the expected value of $d_1^b(n)$, let us work through a simple math example. Let $t=10$. All possible values of $d_1^b(n)$ are, $d_1^b(0)=0$, $d_1^b(1)=1$, $d_1^b(2)=2$, $d_1^b(3)=3$, $d_1^b(4)=4$, $d_1^b(5)=4$, $d_1^b(6)=3$, $d_1^b(7)=2$, $d_1^b(8)=1$, $d_1^b(9)=0$. Thus, the entire series is $\{0, 1, 2, 3, 4, 4, 3, 2, 1, 0\}$ which can be reduced down to $\{0, 2, 4, 6, 8\}$ by adding the symmetric terms together. As you can see, this is an AP series with the initial value of $0$, a difference of $2$ and a total number of terms $\left(\frac{t}{2}-1\right)$. Note that for odd values of $t$, the series changes a bit but for our final result, it does not matter.

Therefore, the expected value of $d_1^b(n)$ is written as, $$\mathbb{E}[d_1^b(n)] = \sum_jp(d_1^b(j))d_1^b(j)$$

$$\implies \frac{1}{t}\left(\frac{t}{4}\left(\frac{t}{2}-1\right)2 \right)\approx \frac{t}{4} \text{when } t \text{ is large}$$

Final Comparison

Now, if MC$(f^b) > $ MC$(f^u)$, then, $\frac{\text{MC}(f^b)}{\text{MC}(f^u)} > 1$. We can quickly check this based on the values that we found above. $$\frac{\text{MC}(f^b)}{\text{MC}(f^u)} = \frac{\max\limits_{i}\left(\text{avg}_{n}\left(d_i^u(n)\right)\right)}{\max\limits_{i}\left(\text{avg}_{n}\left(d_i^b(n)\right)\right)} = \frac{\text{avg}_{n}\left(d_1^u(n)\right)}{\text{avg}_{n}\left(d_1^b(n)\right)}$$

$$\implies \frac{\mathbb{E}[d_1^u(n)]}{\mathbb{E}[d_1^b(n)]} = \frac{t}{2}\times\frac{4}{t} = 2$$

We can clearly see that the MC of a bidirectional RNN is strictly twice that of a unidirectional RNN. This concludes our proof which theoretically explains why bidirectional RNNs have historically always outperformed their unidirectional counterparts.

References

[1] Campos, Víctor, Brendan Jou, Xavier Giró-i-Nieto, Jordi Torres, and Shih-Fu Chang. "Skip rnn: Learning to skip state updates in recurrent neural networks." arXiv preprint arXiv:1708.06834 (2017).

[2] Chang, Shiyu, Yang Zhang, Wei Han, Mo Yu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui, Michael Witbrock, Mark A. Hasegawa-Johnson, and Thomas S. Huang. "Dilated recurrent neural networks." Advances in neural information processing systems 30 (2017).

[3] Expected value - Wikipedia

Cite us

For attribution in academic contexts or open-source libraries, please cite this work as:

@article{hassan2021birnnproof,
title={Why bidirectional RNN is better than unidirectional RNN: A theoretical proof},
author={Hassan, Atif},
year={2021},
howpublished={\url{https://www.deepwizai.com/simply-deep/a-mathematical-proof-of-why-bidirectional-rnns-are-better-than-unidirectional-rnns}}
}

Next
Next

Why does regularizing the bias lead to underfitting in neural networks?