Deep Learning Language Models: The Beginning
Text, images and videos (unstructured data) contains a lot of information. Due to the complexity in processing and analyzing of this data, this area was not explored in depth earlier. Natural Language Processing (NLP) has become increasingly important for understanding natural language data as it leverages tools, techniques and algorithms to process them. Having multiple applications (Chatbots, virtual assistants, dialog agents etc.), it is necessary to explore the best and most efficient techniques of NLP. A lot of computation power is required to train these models and it is integral to investigate the tools and methods to make them more efficient.
In this blog, we will be discussing a summary of the two of the most important papers that fueled the research of NLP and pushed it to the next level.
Before we dig deep into these breakthrough papers, let’s take a step back and recall what is “Attention”. An interesting blog I found that summarizes “Attention” is [7].
Contents of the blog:
- Attention is all you need [1]
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [2]
Attention is all you need [1]
The authors have shown that the sequential nature of models can be captured by using only the attention mechanism without any use of Long Short Term Memory (LSTMs) or Recurrent Neural Networks (RNNs). RNN based architectures are hard to parallelize and can have difficulty learning long-range dependencies within the input and output sequences and thus, all of these dependencies are modelled using the Transformer. The Transformer uses multiple “heads” which means multiple attention distributions and multiple outputs for a single input instead of one sweep of attention. It also uses layer normalization and residual connections to make optimization much easier and efficient. Additionally, Attention cannot utilize the positions of the inputs. To solve this issue, the Transformer uses explicit position encodings which are added to the input embeddings. We will dive into these details ahead.
“We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.”
-From the ‘Attention is all you need’ paper
Before we jump into the transformer’s architecture, we need to recall what is self attention. Self- attention is an attention mechanism which relates different positions of a single sentence in order to compute a representation of a sequence. Self Attention layer aims to encode a word based on all other words in a sequence. In addition to the latter, it measures the encoding of the word against another word’s encoding and thus produces a new encoding.
Why is self attention important? In a regular encoder decoder architecture, we may have problems in long term dependency. To tackle this problem, for every input word’s representations we learn the attention distribution with another word. Thus, by this method, each input representation has global level information on another token in the sequence in the corpus.
Transformer Architecture
From the Figure 2, we observe that the Transformer uses the basic encoder-decoder design. The encoder is on the left and the decoder is on the right.
The first initial inputs to the encoder are the embeddings of the input sequence. Additionally, the initial inputs to the decoder are the embeddings of the outputs up to that points. The encoder-decoder design replaces the LSTMs with the Self Attention layer and the sequential layer is identified using Positional encodings. Each layer components are only made of FC (Fully Connected) layers. Thus, it becomes easier to parallelize it and solves the problem of Recurrent Neural Networks.
The encoder and decoder are composed of a stack of identical layers, whose main components are:
- Multi-Head Self-Attention
The paper discusses the multi head self attention mechanism by adopting the Scaled Dot-Product Attention. The dot product of the query (Q) is done with the keys (K^T). The weight is calculated by this method and the weighed sum is calculated. This is the output of the layer.
This method, the scaled dot product, is chosen because it is much faster and space efficient since it uses optimized matrix multiplication code. In addition to that, “Multi Head” Attention is used. It is called so because it computes multiple attention weighted sums rather than single attention. It is several attention layers stacked in parallel with same input’s various linear transformations. Each of the “Multi Heads” is a linear transformation of the input. This makes the model capture different aspects of the input and increase its expressive ability.
2. Position-Encoding and Position-Wise Feed Forward Neural Networks
Information about the relative or absolute position of the tokens in the sequence to the embeddings is also necessary to compute. We need to ensure that information about order is also added to the model. The positional encoding have the same dimensions as that of embeddings. Two sinusoids are used. Both of them have different frequencies as seen in the equation. These equations would allow the model to easily learn to attend by relative positions.
Here: pos is the position of the token and i is the dimension
The main importance of this section is the “Residual connections” around the layers since it retains the information of position of the word embeddings. This improved the model by a considerable amount.
Thus, by the above 2 techniques, the Attention model was created.
The authors investigated the results of different mechanism and compared it’s complexity in the paper. It is seen that Self- Attention (restricted)’s complexity per layer is the least.
Some interesting results
In Table 2, the WMT 2014 English to German translation was used. The Big transformer model outperforms the best reported models by having a BLUE score of 28.4. (This model took 3.5 days on 8 P100 GPUs). In the WMT 2014 English to French task, the Big Transformer model had a BLUE score of 41. This was done at 0.25 of the training cost of the previous state of the art models.
The authors also observe that reducing the attention key size hurts the model quality. This is demonstrated in Table 3. This suggests that determining the compatibility is not easy and a better method than dot product needs to be investigated. The table also shows that bigger models are better and using dropout is beneficial to avoid over fitting the model.
The authors also observe in Table 4 that the lack of task specific tuning the model performs well and gives a good output than all previous models apar from RNNs. Only in the WSJ training set, the Transformer works better than BerkeleyParser [29].
Conclusion and summary for Paper I
The authors utilized the encoder decoder approach to their architecture with multi head self attention layers and positional encodings. For translation tasks, their Transformer outperformed in speed to convolutional or recurrent models. On both the WMT 2014 English to German and English to French translation task they outperformed the state of the art. This method can be used in many applications involving audio and video.
After this paper, one of the major breakthroughs in the world of NLP was BERT.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT (Bidirectional Encoder Representations from Transformers) is a technique of pre-training language representations. It’s key innovation was the bidirectional nature of it’s training of the Transformer to language modelling techniques. BERT became huge because it outperformed the previous state of the art methods because of it’s unsupervised and bidirectional nature of pre training Natural Language Processing.
“BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.”
-From the ‘BERT’ paper
For example, for the vector “lit” will have the same word2vec representation of it’s vector for it’s occurrences in a sequence. “She lit the room up as she entered” and “She lit the candles” will have a contextualized embedding which will be different according to the sentence and context it is used in.
Fun fact: Google Search announced in October 2019 that they had started applying BERT for English language search queries in the United States[3]. In December 2019, Google Search expanded it to over 70 languages. In October 2020, almost every single English based query is processed by BERT[4].
BERT Architecture
The BERT framework is composed of two main steps:
- Pre-training BERT
- Fine-tuning BERT
1. Pre-training:
Pre-training utilizes two main tasks: MLM and NSP.
- MLM: [MASK] is used to replace 15% of the tokens. Fine-tuning does not contain the [MASK] token, the authors want to avoid making mismatches between pre-training and fine-tuning. Thus, they follow these rules:
- The token is replaced with [MASK] (80% of the time)
- The token is replaced with a random token (10% of the time)
- The token remains unchanged (10% of the time)
2. Next Sentence Prediction (NSP)
BERT separates sentences with a token. ([SEP]). 2 input sentences are fed into in the training model such that:
- 50% of the time, the original sequence is copied as the next sequence
- For the remaining 50%, it is replaced with a random sentence from the corpus.
This is done to understand the relationships between the sentences in a better way. This is how Next Sentence Prediction (NSP) is adopted. The model is then required to predict whether the 2nd sentence is random or it is not random. If the sentence proves to be random, it is discarded and not connected with the 1st sentence. The prediction process is such that the complete sentence goes through the base model. By a simple classification layer, the output of [CLS] is transformed to a 2*1 vector. The “IsNext” label is assigned using softmax.
The above two techniques are used together so that they can minimize the combined loss function.
Some interesting results
- BERT-Base Model: 12 layers, 768 hidden-nodes, trained on 4 cloud TPUs for 4 days, 12 attention-heads, approx. 110M parameters
- BERT-Large Model: 24 layers, 1024 hidden-nodes, trained on 16 cloud TPUs for 4 days, 16 attention-heads, approx. 345M parameters
As seen from the above results, we can conclude that the model size matters. BERT-Large definitely outperforms BERT-Base in it’s performance. We can also conclude that BERT’s bidirectional approach (MLM) converges slower than left to right approach. This is probably because only 15% of the words are predicted in each batch. In spite of this slow converge, the bidirectional approach outperforms the latter in it’s performance after only a small number of pre training steps. Thirdly, as most models work, in a similar way BERT does- with more training data and more training steps, higher accuracy is observed. The authors increased the hidden step dimensions from 200 to 600 and observed improved performance. But after 1000, not much increase in performance was observed.
This result is observed on multiple datasets and the authors have published those results as seen in Table 6, 7 and 8.
Conclusion and summary of Paper II
BERT has created a “boom” in the Natural Language Processing field and it is extremely powerful. The pre training and the next sentence prediction features combined has made the model better. It has increased our capacity to do transfer learning in NLP and a lot of it’s variations can be created to make the model better. It solves various NLP tasks. To improve BERT’s performance, many modifications were made in paper’s like RoBERTa[5]and DistilBERT[6]. BERT has opened up many endless possibilities and this is just the beginning.
Some interesting and useful resources I found:
References:
[2] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
[3] Nayak, Pandu (25 October 2019). “Understanding searches better than ever before”. Google Blog. Retrieved 10 December 2019
[4] “Google: BERT now used on almost every English query”. Search Engine Land. 2020–10–15. Retrieved 2020–11–24.
[5] RoBERTa
[6] DistilBERT
[7] Attention