AI Blog Post Content

The article below was written by a Huggingface Summarization Pipeline. As I stated in the blog post, I didn’t train it, barely pre-and-post-processed, and lightly edited it after the fact. I scraped text from top Google Search results and fed the text through the summarizer. I spent no time training the models to do this task reasonably well, and I used zero rows of data to train this model.

BERT Overview

[1] BERT stands for Bidirectional Encoder Representations for Transformers. Yes, there is a multilingual BERT model available as well. BERT-base was trained on 4 cloud TPUs for 4 days. A recent paper talks about bringing down BERT pre-training time. For all the fine-tuning tasks discussed in the paper it takes at least 1 hour on a single cloud.

[2] BERT is a “deeply bidirectional” model that can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of NLP tasks. ULMFiT essentially replaced the LSTM-based architecture for Language Modeling. This is when we established the golden formula for transfer learning in NLP: Pre-Training and Fine-Tuning Most of the breakthroughs that followed ULMFIT tweak.  BERT's architecture builds on top of the Transformer. The authors of BERT have added a specific set of rules to represent the input text for the model. Without making any major change in the model’s architecture, we can easily train it on multiple kinds of NLP tasks.



[3] Transformers are designed to handle sequential data, such as natural language, for tasks such as translation and text summarization. Unlike recurrent neural networks, Transformers do not require that the sequential data be processed in order. This feature has enabled training on larger datasets than was possible before it was introduced.

[3] The Transformer uses an attention mechanism without being an RNN, processing all tokens at the same time and calculating attention weights between them. Each encoder and decoder layer make use of an attention mechanism. Attention weights are calculated between every token simultaneously. The output of the attention unit for each token is the weighted sum of the value vectors. Transformers have been implemented in major deep learning frameworks such as TensorFlow and PyTorch.

Attention is All You Need

[4] The Transformer was proposed in the paper Attention is All You Need. It uses attention to boost the speed with which these models can be trained. The biggest benefit comes from how the Transformer lends itself to parallelization. If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words into the one it’s currently processing.

[5] BERT is a multi-layer bidirectional Transformer encoder based on fine-tuning. The model is divided into two layers, a self-attention layer and a feedforward neural network. This allows each position in the encoder to attend over all positions in the input sequence. If the text is slightly longer, it is easy to lose some information of the text. In the paper, we call these three vectors Query, Key, and Value respectively.

[5] Mask is used in all the scaled dot-product attention, and the padding mask is only used in the decoder’s self-attention. A masked multi-head attention is added as the attn_mask and a sequence mask. This is based on the Transformer architecture, which uses a novel technique called Masked Language Modeling (as we will see later), which allows bidirectional training in languages.

[5] Masked LM and Next Sentence Prediction train a deep bidirectional representation. Some percentage (15% in the paper) of the input tokens are simply masqués at random. A downside is that there is a mismatch between pre-training and fine-tuning. The authors do not always replace “masked” words with the actual [MASK] token. This is because of the two-way function (bi-directionality)

Bert Tokenization Explained

[6] BERT has taken over a majority of tasks in NLP. The 2017 paper, “Attention Is All You Need”, proposed the Transformer architecture. A common architecture is trained for a relatively generic task, and then, it is fine-tuned on specific downstream tasks that are more or less similar to the pre-training task. This is one of the groundbreaking models that has achieved the state of the art in many downstream tasks.

[6] Tokens fed to the BERT model are tokenized using WordPiece embeddings. The model is trained to predict these tokens using all the other tokens of the sequence. 80% of the time, the 15% tokens from each sequence are randomly masked (replaced with the token [MASK]) a random sentence is taken from the corpus for training. This ensures that the model adapts to training on multiple sequences (for tasks like question answering and natural language inference).

[7] BERT is a model that knows how to represent text. it looks left and right several times and produces a vector representation for each word as the output. this way, we train our additional layer/s and also change (fine-tune) the BERTs weights. If we are interested in classification, we need to use the output of the first token (the [CLS] token).
the BERT library provides us with tokenizer for each of BERTS models. Maximum sequence size for BERT is 512, if there is a token that is not present in the vocabulary, the tokenizer will use the special [UNK] token and then take only the first 512 tokens for both train and test sets.

[8] A subword-level tokenization scheme is based on the Transformer (2017) architecture and the BERT (2018) language model. the model uses a system called “tokenization” which can be used as inputs for a number of tasks such as next-sentence prediction, question and answer retrieval and classification. We need to find a way to represent your entire text dataset with the least number of tokens. If you are training a POS tagger or classifier then it is more difficult to work at.

[8] SentencePiece is able to provide fast and robust subword tokenization on any input language and can be used as part of an end-to-end solution. We can train a BPE model on the same dataset and then simply print the tokens. If you want to have the most efficient tokenization, you can use the same library as the one in the Unigram model.

[9] SentencePiece trains tokenization and detokenization models from sentences with the extension of direct training from raw sentences. There is no language-dependent logic, but the number of unique tokens is predetermined prior to the neural model training.
SentencePiece is fast enough to train the model from raw sentences. A standard English tokenizer would segment the text "Hello world" into the following three tokens. the original input and tokenized sequence are NOT reversibly convertible. tokenized sequences do not preserve the necessary information to restore the original sentence. this feature makes it



References/Websites Summarized by BERT










Additional References