Deep Learning Language Models: The Beginning

10 min readNov 30, 2020

Text, images and videos (unstructured data) contains a lot of information. Due to the complexity in processing and analyzing of this data, this area was not explored in depth earlier. Natural Language Processing (NLP) has become increasingly important for understanding natural language data as it leverages tools, techniques and algorithms to process them. Having multiple applications (Chatbots, virtual assistants, dialog agents etc.), it is necessary to explore the best and most efficient techniques of NLP. A lot of computation power is required to train these models and it is integral to investigate the tools and methods to make them more efficient.

In this blog, we will be discussing a summary of the two of the most important papers that fueled the research of NLP and pushed it to the next level.

Before we dig deep into these breakthrough papers, let’s take a step back and recall what is “Attention”. An interesting blog I found that summarizes “Attention” is [7].

Contents of the blog:

Attention is all you need [1]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [2]

Attention is all you need [1]

The authors have shown that the sequential nature of models can be captured by using only the attention mechanism without any use of Long Short Term Memory (LSTMs) or Recurrent Neural Networks (RNNs). RNN based architectures are hard to parallelize and can have difficulty learning long-range dependencies within the input and output sequences and thus, all of these dependencies are modelled using the Transformer. The Transformer uses multiple “heads” which means multiple attention distributions and multiple outputs for a single input instead of one sweep of attention. It also uses layer normalization and residual connections to make optimization much easier and efficient. Additionally, Attention cannot utilize the positions of the inputs. To solve this issue, the Transformer uses explicit position encodings which are added to the input embeddings. We will dive into these details ahead.

“We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.”
-From the ‘Attention is all you need’ paper

Before we jump into the transformer’s architecture, we need to recall what is self attention. Self- attention is an attention mechanism which relates different positions of a single sentence in order to compute a representation of a sequence. Self Attention layer aims to encode a word based on all other words in a sequence. In addition to the latter, it measures the encoding of the word against another word’s encoding and thus produces a new encoding.

Why is self attention important? In a regular encoder decoder architecture, we may have problems in long term dependency. To tackle this problem, for every input word’s representations we learn the attention distribution with another word. Thus, by this method, each input representation has global level information on another token in the sequence in the corpus.

Transformer Architecture

From the Figure 2, we observe that the Transformer uses the basic encoder-decoder design. The encoder is on the left and the decoder is on the right.

The first initial inputs to the encoder are the embeddings of the input sequence. Additionally, the initial inputs to the decoder are the embeddings of the outputs up to that points. The encoder-decoder design replaces the LSTMs with the Self Attention layer and the sequential layer is identified using Positional encodings. Each layer components are only made of FC (Fully Connected) layers. Thus, it becomes easier to parallelize it and solves the problem of Recurrent Neural Networks.

The encoder and decoder are composed of a stack of identical layers, whose main components are:

Multi-Head Self-Attention

The paper discusses the multi head self attention mechanism by adopting the Scaled Dot-Product Attention. The dot product of the query (Q) is done with the keys (K^T). The weight is calculated by this method and the weighed sum is calculated. This is the output of the layer.

This method, the scaled dot product, is chosen because it is much faster and space efficient since it uses optimized matrix multiplication code. In addition to that, “Multi Head” Attention is used. It is called so because it computes multiple attention weighted sums rather than single attention. It is several attention layers stacked in parallel with same input’s various linear transformations. Each of the “Multi Heads” is a linear transformation of the input. This makes the model capture different aspects of the input and increase its expressive ability.

2. Position-Encoding and Position-Wise Feed Forward Neural Networks

Information about the relative or absolute position of the tokens in the sequence to the embeddings is also necessary to compute. We need to ensure that information about order is also added to the model. The positional encoding have the same dimensions as that of embeddings. Two sinusoids are used. Both of them have different frequencies as seen in the equation. These equations would allow the model to easily learn to attend by relative positions.

Here: pos is the position of the token and i is the dimension

The main importance of this section is the “Residual connections” around the layers since it retains the information of position of the word embeddings. This improved the model by a considerable amount.

Thus, by the above 2 techniques, the Attention model was created.

The authors investigated the results of different mechanism and compared it’s complexity in the paper. It is seen that Self- Attention (restricted)’s complexity per layer is the least.

Table 1. : Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types. n is the sequence length, d is the representation dimension, k is the kernel size of convolutions and r the size of the neighborhood in restricted self-attention

Some interesting results

In Table 2, the WMT 2014 English to German translation was used. The Big transformer model outperforms the best reported models by having a BLUE score of 28.4. (This model took 3.5 days on 8 P100 GPUs). In the WMT 2014 English to French task, the Big Transformer model had a BLUE score of 41. This was done at 0.25 of the training cost of the previous state of the art models.

Table 2. The Transformer achieves better BLEU scores than previous state-of-the-art models on the English-to-German and English-to-French newstest2014 tests at a fraction of the training cost

The authors also observe that reducing the attention key size hurts the model quality. This is demonstrated in Table 3. This suggests that determining the compatibility is not easy and a better method than dot product needs to be investigated. The table also shows that bigger models are better and using dropout is beneficial to avoid over fitting the model.

Table 3. Variations on the Transformer architecture. Unlisted values are identical to those of the base model. All metrics are on the English-to-German translation development set, newstest2013. Listed perplexities are per-word piece, according to our byte-pair encoding, and should not be compared to per-word perplexities

The authors also observe in Table 4 that the lack of task specific tuning the model performs well and gives a good output than all previous models apar from RNNs. Only in the WSJ training set, the Transformer works better than BerkeleyParser [29].

Table 4. : The Transformer generalizes well to English constituency parsing

Conclusion and summary for Paper I

The authors utilized the encoder decoder approach to their architecture with multi head self attention layers and positional encodings. For translation tasks, their Transformer outperformed in speed to convolutional or recurrent models. On both the WMT 2014 English to German and English to French translation task they outperformed the state of the art. This method can be used in many applications involving audio and video.

After this paper, one of the major breakthroughs in the world of NLP was BERT.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Yes, BERT has transitioned from Sesame Street to NLP!

BERT (Bidirectional Encoder Representations from Transformers) is a technique of pre-training language representations. It’s key innovation was the bidirectional nature of it’s training of the Transformer to language modelling techniques. BERT became huge because it outperformed the previous state of the art methods because of it’s unsupervised and bidirectional nature of pre training Natural Language Processing.

“BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.”
-From the ‘BERT’ paper

For example, for the vector “lit” will have the same word2vec representation of it’s vector for it’s occurrences in a sequence. “She lit the room up as she entered” and “She lit the candles” will have a contextualized embedding which will be different according to the sentence and context it is used in.

Fun fact: Google Search announced in October 2019 that they had started applying BERT for English language search queries in the United States[3]. In December 2019, Google Search expanded it to over 70 languages. In October 2020, almost every single English based query is processed by BERT[4].

BERT Architecture

The BERT framework is composed of two main steps:

Pre-training BERT
Fine-tuning BERT

Fig. 3 : Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architectures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different down-stream tasks. During fine-tuning, all parameters are fine-tuned. [CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating questions/answers)

1. Pre-training:

Pre-training utilizes two main tasks: MLM and NSP.

MLM: [MASK] is used to replace 15% of the tokens. Fine-tuning does not contain the [MASK] token, the authors want to avoid making mismatches between pre-training and fine-tuning. Thus, they follow these rules:

The token is replaced with [MASK] (80% of the time)
The token is replaced with a random token (10% of the time)
The token remains unchanged (10% of the time)

Fig. 4 BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings

2. Next Sentence Prediction (NSP)

BERT separates sentences with a token. ([SEP]). 2 input sentences are fed into in the training model such that:

50% of the time, the original sequence is copied as the next sequence
For the remaining 50%, it is replaced with a random sentence from the corpus.

This is done to understand the relationships between the sentences in a better way. This is how Next Sentence Prediction (NSP) is adopted. The model is then required to predict whether the 2nd sentence is random or it is not random. If the sentence proves to be random, it is discarded and not connected with the 1st sentence. The prediction process is such that the complete sentence goes through the base model. By a simple classification layer, the output of [CLS] is transformed to a 2*1 vector. The “IsNext” label is assigned using softmax.

The above two techniques are used together so that they can minimize the combined loss function.

Some interesting results

Table 5. : GLUE Test results, scored by the evaluation server (https://gluebenchmark.com/leaderboard). The number below each task denotes the number of training examples. The “Average” column is slightly different than the official GLUE score, since we exclude the problematic WNLI set.8 BERT and OpenAI GPT are single model, single task. F1 scores are reported for QQP and MRPC, Spearman correlations are reported for STS-B, and accuracy scores are reported for the other tasks. We exclude entries that use BERT as one of their components

BERT-Base Model: 12 layers, 768 hidden-nodes, trained on 4 cloud TPUs for 4 days, 12 attention-heads, approx. 110M parameters
BERT-Large Model: 24 layers, 1024 hidden-nodes, trained on 16 cloud TPUs for 4 days, 16 attention-heads, approx. 345M parameters

As seen from the above results, we can conclude that the model size matters. BERT-Large definitely outperforms BERT-Base in it’s performance. We can also conclude that BERT’s bidirectional approach (MLM) converges slower than left to right approach. This is probably because only 15% of the words are predicted in each batch. In spite of this slow converge, the bidirectional approach outperforms the latter in it’s performance after only a small number of pre training steps. Thirdly, as most models work, in a similar way BERT does- with more training data and more training steps, higher accuracy is observed. The authors increased the hidden step dimensions from 200 to 600 and observed improved performance. But after 1000, not much increase in performance was observed.

This result is observed on multiple datasets and the authors have published those results as seen in Table 6, 7 and 8.

Table 6. SQuAD 1.1 results. The BERT ensemble is 7x systems which use different pre-training checkpoints and fine-tuning seeds.

Table 8. CoNLL-2003 Named Entity Recognition results. Hyperparameters were selected using the Dev set. The reported Dev and Test scores are averaged over 5 random restarts using those hyperparameters.

Conclusion and summary of Paper II

BERT has created a “boom” in the Natural Language Processing field and it is extremely powerful. The pre training and the next sentence prediction features combined has made the model better. It has increased our capacity to do transfer learning in NLP and a lot of it’s variations can be created to make the model better. It solves various NLP tasks. To improve BERT’s performance, many modifications were made in paper’s like RoBERTa[5]and DistilBERT[6]. BERT has opened up many endless possibilities and this is just the beginning.