Do Androids Dream of Electric Sheep?
Intro
This article presents an implementation of the novel machine learning model GPT-2 [1] for automatic text generation. It is a deep neural network with a particular architecture known as transformer. It was developed by the research company OpenAI [2] and is open source.
In this case, the model was trained with the aim of generating original literary passages with the style and prose of the Argentinian writer Jorge Luis Borges. That is, with the objective of writing like him but without paraphrasing or transcribing any already written sequence.
Neural Networks
Neural networks are a set of algorithms designed to work by mimicking the way the human brain processes information. The general structure consists of incoming information (input X) that enters a network of interconnected neurons where each one returns a value and the final output (output Y) is formed with those values, i.e. with the incoming information processed in some way according to the required task.
For example, when the image of an object (e.g., a bottle) enters through the eyes, the brain with its network of neurons processes it and manages to associate it with the general abstract concept of that object in a particular language (it can be the word “bottle”, the function of the bottles, the content of the bottle, etc.). Artificial neural networks are intended to work in the same way. To understand it better, let us look at the structure of a simple neuron: the perceptron [3]:
Perceptron
The perceptron is the simplest structure of a neural network; we can associate it with a single neuron in the network. To describe it, let’s use a bit of mathematics. As we said, the network receives a packet of incoming information (input) generally represented by elements of a vector space (vectors and matrices), say for example, a vector x of n components. The neuron, abstractly speaking, has to have a structure capable of receiving and operating with x. In general, the neuron returns a response from input x defined as
Where w is a vector of parameters or weights of the input size and u is a threshold representing the inhibition of the neuron (how much it contributes to the final output). The function f is generally called activation function and is used to differentiate the neuron model from a linear classifier. Non-linear functions are generally used as activations.
We can use this simple example of a single neuron to introduce the concept of network training. Basically, this is a process by which the parameters w and u are iteratively updated in order to minimize the error made trying to predict the known Y_known. Many different functions are used to estimate this error and in general, this function is known as cost function or loss function [4]. Iterative determination of the weights is generally achieved from a strategy known as gradient descent [5]. It minimizes the cost function following the direction determined by its gradient as a function of the model parameters (the weights w and the threshold u).
A more general neural network does not differ too much from this model but it is a multilayered structure of this model superimposed again and again combined with different computational techniques to improve the performance of each step of the process such as the operation of each neuron, the interconnection of the neurons with each other and above all, the optimization of the loss function.
The network, finally, is composed of many layers where in each layer there is a “strip” of neurons that operate exactly as described above. Then, a matrix of weights W can be associated to each layer where in row i are located the weights of neuron i of the layer. Then Wᴸᵢⱼ refers to the j component of the vector of weights of neuron i belonging to layer L. The most complex models can have millions of parameters and the computational calculation of this minimization can be expensive.
In particular, a technique known as back-propagation [6] is used to update the weights. In this article we will not go into the details of this complex technique but the spirit remains being to find the minimum error committed as a function of the model parameters.
Natural Lenguage Processing
One of the most developed areas in the artificial intelligence universe is Natural Language Processing (NLP). This sub-discipline encompasses all computational strategies aimed at optimizing the linguistic interaction between humans and computers, i.e., the interpretation by machines of spoken or written language, translation, automatic text generation, colloquial language interpretation and processing, etc. There are different developments of neural networks applied to this field such as Recurrent Neural Networks (RNN) [7] or Transformers [8]. In this paper we will focus on the latter structure since it was the type of network used in GPT-2.
Transformers
The transformer, unlike the recurrent networks -which receive the text sequence separated in time (word by word)- has as input the entire sequence (fig. 4) resulting in spectacular performance in terms of context, syntax and are very useful in translation tasks or automatic text generation. The key concept to introduce transformers is the attention layer [9].
The input sequence is generally expressed as a dense vector space in the form of embeddings [10]. We will not go into the details of this technique but it basically maps the vocabulary to a lower dimensional space while preserving the relevant information of the words and the original corpus. Then, each word is represented by a vector x of dimension m (different from the vocabulary size). In the case of the transformer, as we said, whole sequences (sentences) are entered, which are then represented by matrices (overlapping embeddings of each word of the sequence) X.
Generally, transformers have two main components: encoders and decoders. They are composed of a system of layers where the main tasks of sequence processing and learning are carried out: normalization, resizing, feed-forward layers, and most important, the self-attention layers that we will explain below. Thus, a transformer is basically made up of towers of identical decoders and encoders in which the information enters and exits processed successively, first passing through the tower of encorders and then passing to the tower of decoders, from which the output sequence results.
In this article we will not go into detail on the internal composition of all the encoder and decoder layers (feed-forward layers, renormalization layers, etc.) but we will go a little deeper into the self-attention layer.
As mentioned, each encoder is given a complete sequence or phrase, represented in the embedding space. Let’s say for example, that the sentence “I am an intelligent machine” is represented by vectors x¹, x², x³, x⁴ and x⁵ of dimension m and that one on top of the other can be represented as an input matrix X of size 5ₓm. Then, inside an encoder:
The general idea of an attention layer is to estimate the relationship of the words in the sequence to each other in order to establish a criterion of attention on the rest of the sequence when looking at a particular word. For example, in the sentence “This animal is adorable and the zoo is closed”, if we are looking at the word “this”, the attention layer is intended to account for the fact that it should pay more attention to “animal” than to “closed”. That is, when we are encoding the word “this” we seek to find how much attention to pay to the rest of the words in the sentence and so on with each of the words in the sequence. This task is accomplished from a purely algebraic approach.
The attention layer of the encoder is composed of three arrays of trainable parameters: W_Q, Wₖ and Wᵥ where Q, K and V refer to “query”, “key” and “value”. Basically, the operation of the layer is based on multiplying the input sequence (kₓm matrix where k is the length of the sequence and m the dimension of the embeddings) by each of these matrices obtaining three new matrices: Q, K and V. Then, the matrix product of Q with K is performed, normalized (through the softmax function), resulting in a series of relationship weights between each of the words of the sequence with the other. Then a weighted sum (with the weights obtained from multiplying Q and K) of the vectors that compose V is performed obtaining a final matrix Z which is the output of the attention layer and the input of the following layers (normalization, feed-forward, etc.). It should be clarified that this last operation is not a usual operation between matrices but an operation defined for this task: for each query qᵢ (each input word) all the weighted vⱼ are summed with qᵢ · kⱼ for j going through the sequence. And that is repeated for each word in the sequence (for each i). This operation can be noted and implemented synthetically using matrices.
In this way, a new input is obtained and fed to the next layers (and to the next encoders of the tower), this input has the information of the attention that each part of the sequence must have on the rest.
This summary gathers the main and general ideas of this technique and we will not go further into the encoder attention layers but the subject can get more complex by including different techniques to optimize the algorithm such as multi-head attention layers and positional embeddings.
The decoder works in much the same way but with some fundamental differences. To begin with, it does not recalculate the K and V matrices, it always takes them from the output of the last encoder and only calculates new parameters for the Q matrices in each encoder. In addition, they add a fundamental concept that will be key in the operation of GPT-2: the “masked self-attention layer”. It works almost identically to a self-attention layer with the difference that when it is decoding a particular word, it only allows to estimate the attention on the rest of the sequence prior to the word in question. That is to say, it does not focus on the future but estimates the focus of attention on the previous parts of the sequence. This is achieved by masking the Norm function (softmax) and giving zero weight to the words that come after the word being processed.
GPT-2
The GPT-2 (Generative Pre-trained Transformer 2) model is basically a tower of decoders. There are many known implementations (such as BERT) that use this architecture: splitting the encoder-decoder structure into towers of only one of these components. This is the case of GPT. Moreover, this architecture (tower of decoders) is combined with a clever form of tokenization (Byte-Pair Encoding [11], as we will see later) and an idea that gives the architecture spectacular capabilities for text generation: the output of the model is one token at a time. Then, given an input, what GPT does is to generate tokens and then adds them as input to the next step. For example, if the input of the first step is “The dog is” and the output of that step is “white”, the input of the next step will be “The dog is white”. This idea is known as “auto-regression”. However, auto-regression is not used to train the model, but rather both the input and the outputs are long sequences of text.
As we saw in the case of the transformers, the decoder tower took the matrices of keys and values from the final output of the encoder tower. But where do these matrices come from in the case of the decoder tower? Simply from the previous iteration. When the network is focusing on a particular token that is at location i in the sequence, it does not recompute the Q, K and V vectors for tokens i-1, i-2, … but reuses the ones it already computed in the previous steps.
Moreover, as mentioned above, the GPT-2 decoder blocks use “masked self-attention layers”, i.e., at the time of the heavy summation of the V vectors, which somehow gives each part of the sequence the information of the attention it should pay to the rest, it cancels the weights of tokens that are in positions following the token in question. In other words, it gives a weight of 0 to future tokens. Then, each output of the attention layer is sent to a fully-connected neural layer where all the contextual information learned in the attention layer is processed.
In summary, GPT-2 is a decoder tower that comes in various sizes (various numbers of encoders and various sizes of embeddings) and uses certain training and generation techniques (such as masked self-attention and auto-regression and others that we did not go into depth such as multi-head attention layers and positional encodings) and is also trained with a particular vocabulary tokenization system: Byte Pair Encoding.
Byte Pair Encoding
BPE is part of a subset of tokenization algorithms known as sub-word tokenization because they process vocabulary by chunking words according to some criterion. In this case, it is a data compression system introduced by Philip Gage in 1994 that replaces frequent pairs of bytes with new bytes that were not originally in the data.
In the version of BPE used by GPT, each word in the corpus is reconstructed from the union of the most frequent pairs of characters. Then, the tokens of the model will not be whole words but sets of characters, including the most frequent pairs. The advantage of this type of tokenization is that words that appear infrequently in the corpus, for example “unconstitutionally”, will no longer be so strange to the model because they are a union of the most frequent character sets such as [‘un’,’constitu’, ‘ti’, on’,’ally’] since “un”, “tion” and “ally” are extremely frequent in the English language. This allows the algorithm not to consider infrequent words as strange and lose sight of them, but to put them in the context of a language that builds words as a union of prefixes, suffixes and word families with common roots such as “consti”.
Here is an iterative implementation to tokenize the corpus [12]:
From this algorithm, in the first step, the corpus is tokenized at the character level; for example, the word unconstitutionally is tokenized as [‘u’,‘n’,‘c’,‘o’,‘n’,‘s’,‘t’,‘i’,‘t’,‘u’,‘t’,‘i’,‘o’,‘n’,‘a’,‘l’,‘l’,‘y’], however, after a thousand steps, the tokenization will be given by [‘un’, ‘constitu’, ‘tion’, ‘ally’], making it much easier for the algorithm to process, reconstruct and generate it.
What we did
What we did was to finetune the GPT-2 model (trained with 40GB of English text and a vocabulary of 45000 BPE tokens) available for Python in Max Woolf’s GPT-2-SIMPLE package: https://github.com/minimaxir/gpt-2-simple.
The training was performed using the Google Colab graphics processor.
First, since the model was trained in English, we rewrote the parameters to generate text in Spanish. For this, we trained the model with 700MB of Spanish text taken from Wikipedia (freely available at https://www.kaggle.com/rtatman/120-million-word-spanish-corpus) and 64000 steps.
Once the model was able to generate decent Spanish text, we trained it with the complete work of Borges (~4MB) and 2000 more steps. After this training, the model was able to write “Borgesian” sentences: using his words, punctuation and spacing, narrative structure, etc.
Finally, we trained the model with Borges’ complete poetry (~0.3MB) and 100 steps. The number of steps in this case is considerably smaller since the dataset is much smaller. For a larger number of steps, the model ended up generating sentences copied from the dataset (overfitting).
Once trained in the described way, the model was able to generate sentences with poetic structure without imitating or copying verbatim lines or verses from the author.
A remarkable feature of the process is that, despite the fact that the subword vocabulary has been extracted from English -therefore including the most frequent BPE tokens of that language such as “ing”, “ed” is”, “am”, “tion”, etc. — the model was able to generate Spanish text in considerably few steps. This demonstrates the enormous power of this type of architecture and the remarkable versatility of Byte Pair Encoding.
The project of building a similar architecture but using a BPE vocabulary extracted from Spanish is still pending. This is expected to yield even better results with less training time since generating Spanish words from BPE fragments acquired from English is trivially more expensive.
The trained model is saved in compressed .tar files that include all the information needed to load it and generate text: the BPE vocabulary, the parameters and their values, and the network architecture. Max Woolf’s package allows to load these checkpoints in a simple way by means of a few lines of code:
Where the argument run_name takes the name of the .tar file. The temperature parameter determines the level of randomness with which the next predicted token is chosen. A temperature of 0 unambiguously chooses the most likely token while as it increases, that criterion becomes more lax. The parameter top_k determines the number of tokens available ordered according to their probability. A top_k = 40 means that the model will choose among the 40 most likely tokens to generate at each step.
The generated poems are stored in a SQL database and are automatically published on Twitter via its official API.
The final result of this project can be found at https://twitter.com/jlbotges 😄
We would be pleased to receive corrections, contributions or advances in this or other similar projects. For any suggestions, ideas or criticisms, please contact us at https://www.prexa-tec.com or leave a comment.
UPDATE 2021:
The cloud sources of the Max Wolf package are no longer available so it is no longer possible to use the package using those sources. However the package can be adapted by downloading the GPT model from the official OpenAI source.
Bibliography:
[1] https://openai.com/blog/tags/gpt-2/
[3] https://en.wikipedia.org/wiki/Perceptron
[4] https://en.wikipedia.org/wiki/Loss_function
[5] https://en.wikipedia.org/wiki/Gradient_descent
[6] https://en.wikipedia.org/wiki/Backpropagation
[7] https://en.wikipedia.org/wiki/Recurrent_neural_network
[8] http://jalammar.github.io/illustrated-transformer/
[9] https://en.wikipedia.org/wiki/Attention_(machine_learning)
[10] https://en.wikipedia.org/wiki/Word_embedding
[11] https://en.wikipedia.org/wiki/Byte_pair_encoding
[12] https://towardsdatascience.com/byte-pair-encoding-the-dark-horse-of-modern-nlp-eb36c7df4f10