Chapter 1 Introduction

Author: Xiao-Yin To

Supervisor: Daniel Schalk, Matthias Aßenmacher

Over the course of the past decades the importance and utilization of artificial intelligence technology has continuously gained traction. In the present times, it is already inextricably linked with most of the surroundings that constitute the human shaped environment. Consequently, a myriad of sectors such as commerce, research and development, information services, engineering, social services, and medical science have already been irreversibly impacted by the capabilities of artificial intelligence. There are three major fields of artificial intelligence that comprise the technology: speech recognition, computer vision, and natural language processing (see Yeung (2020)). In this book we will take a closer look at the modern approaches in natural language processing (NLP).

1.1 History and development

The history of artificial intelligence and NLP dates back to the 18th century, where well-known philosophers such as Leibniz, Spinoza, Hobbes, Locke, Kant, and Hume as well as scientists such as La Mettrie and Hartley tried to formulate laws of thought (see Mccorduck and Cfe (2004)).

However, the first steps of development started in the 20th century. Alan Turing (1937) was the first to propose an abstract Universal Computing Machine and became one of the most defining scientists who shaped the path of the scientific development of artificial intelligence in the following years. He further developed these ideas in his work “Intelligent machinery” (see Turing (1948)) and “Computing Machinery And Intelligence” (see Turing (1950)). In 1949, Warren Weaver proposed that “given that all humans are the same (inspite of speaking a variety of languages), a document in one language could be viewed as having been written in code. Once this code was broken, it would be possible to output the document in another language. From this point of view, German was English in code” (see Weaver (1949)).

In the post World War II era, numerous German documents needed to be translated into the English language. Due to the sheer number of documents, an automatic decryption system was required in order to ensure time efficiency regarding the herculean task, which further accelerated the research concerning NLP. The first teams were comprised of numerous bilingual programmers. The idea behind this was that the knowledge of multiple languages might facilitate the process of creating programs, which could understand languages and their structures, and would subsequently be able to translate texts. While working on the first NLP programs the main difficulty that crystallized related to the complexity and irregularity of many languages (see Hancox (1996)). Beginning in the 1950s, linguists and machine learning teams congregated and introduced new ideas. During this time period, Georgetown University and International Business Machines Corporation (IBM) published the Georgetown Experiment, the first public demonstration of machine translation, which included the first fully automatic translation, being able to translate more than sixty Russian sentences into English (see Hutchins (2005)). Moreover, Noam Chomsky, one of the most important and influential scientists in linguistics, introduced the idea of Generative Grammar, which describes syntactic structures based on rules (see Chomsky (1957)). The most successful NLP systems that were developed back then were the first chatbot ELIZA (see Weizenbaum (1966)), STUDENT (see Bobrow (1964)), and SHRDLU, a language program that allowed user interaction with a block world (see Winograd (1972)). As the resources that were available for computing in the past were extremely undeveloped – access to computers was restricted, the machines were still really slow, storage was limited, and there were no suitable higher-level programming languages – the creation of such programs was considerably more difficult. The fact that any progress in this field was attained makes the achievements of those scientists all the more remarkable.

In addition to limited resources, researchers encountered the problem that research regarding the development of NLP software came with high costs: By the mid-1960s, machine translation research expenses amounted to 20 million USD, which were paid by U.S. government funding. Those two obstacles, resource limitations along with high costs, were the main reason for the slow advancement of research in this area. The history of NLP reached its lowest point, when in 1966 the Automatic Language Processing Advisory Committee evaluated the results that were attained through the funding and reported, that “there had been no machine translation of general scientific text, and none is in immediate prospect” ((“Language and Machines” 1966)). This report caused U.S. funding to be discontinued, which is the reason why in the following decade the quantity of NLP in scientific literature decreased enormously. Nevertheless, compelling developments such as Augmented Transition Networks, which aid in the analysis of sentence structures, Case Grammar, which facilitates comprehension of linguistic structures by using the link between different components of sentences, and Semantic Representation, which signifies an abstract language in which meanings can be represented, originated in that time (see Hancox (1996)).

In the 1980s, the so-called Statistical Revolution took place. Prior to that, NLP was a primarily “grammar-based approach”, which denotes that systems were created by hand-coding rules and parameters. Via the statistical revolution, the empirical “statistical approach” was introduced (see Johnson (2009)) and consequently “NLP was characterized by the exploitation of data corpora and of (shallow) machine learning, statistical or otherwise, to make use of such data” (see Deng and Liu (2018)). This approach has dominated NLP ever since, as the amount of machine-readable data and computational power has continuously expanded. Since simple Machine Learning techniques are often not sufficient for creating NLP applications that can fulfil the requirements of real-life tasks, nowadays most of the methods are based on Deep Learning designs (see Deng and Liu (2018)).

1.2 Statistical Background

Ever since the Statistical Revolution, many challenging aspects could be tackled using statistical approaches and artificial intelligence. Statistics shaped a substantial part of the path of NLP, conjointly with fundamental knowledge of linguistics. The statistical approaches used in this booklet presuppose mathematical foundations such as elementary probability theory and essential information theory. In order facilitate the comprehension of the approaches explained in the later chapters, now some of the basic schemes that lie the foundations to those modern approaches in NLP will be introduced.

Human language courses usually consist of two elementary parts: vocabulary and grammar. The language skills are often measured by the number of words a person knows, while grammar allows using the words and form sentences correctly. Further, NLP systems basically consist of learning and understanding words as well as recognizing the patterns in which they occur.

The first step is characterized by the recognition and comprehension of words. One difficulty arising when trying to understand words is that many words possess multiple meanings. It might be challenging to ascertain which of the meanings is implied. For word sense, disambiguation methods such as bag of words models or Bayesian classification can be used, which inspect the words around the ambiguous word. Bag of words models consider the dependence of words in a so-called bag, a vector of words that appear in a sentence, so co-occurrences can be learned without understanding grammar. By applying the Bayesian decision rule, the meaning of the word will be decided by choosing the meaning with the highest conditional probability while minimizing the probability of error. Another difficulty is that in many languages words exist, which do not (only) have a meaning themselves, but also possess combined meanings in a collocation, an expression consisting of two or more words. One way for NLP systems to implement this is by using basic statistics such as frequency, mean and variance, and hypothesis testing. If two or more words often occur together in a sentence, it may be concluded that these words together possess a special function in this sentence and cannot be explained by the combination of their respective meanings. Mean and variance can help finding the affiliation between words, which do not always appear in the same structure or with the same distance within a phrase, by calculating the mean distance between words in a sentence and the variance of this distance during training, in order to enable correct classification of these words into a collocation in later applications of the model.

The second step is understanding not only the words themselves, but also their meaning given their contexts. As an example, Markov Models can be used for the classification of texts depending on the surrounding context as well as grammatical structure finding. A Markov Model is a sequence classifier which assigns a sequence of classes to a sequence of observations, enabling the classification of texts depending on the class of the previous texts. Markov Models can be used for designing a part-of-speech tagger. Part-of-speech tagging allows assigning words to their part-of-speech in a sentence, allowing for the comprehension of a sentence without requiring complete understanding (see Manning and Schutze (2008)).

Combining the understanding of words and grammar, many NLP problems can be solved. In this booklet more advanced methods, of which some are based on the described basic methods, will be introduced.

1.3 Outline of the Booklet

This booklet circumstantiate modern approaches used for natural language processing, such as Deep Learning and Transfer Learning. Moreover, the resources that are available for the training of NLP tasks will be investigated and a use-case where NLP will be applied for generation of natural language will be shown.

For the analysis and comprehension of human language, NLP programs need to extract information from words and sentences. As neural networks and other machine learning algorithms require a numeric input for training, word embeddings, using dense vector representations for words, are applied. Those are usually learned by neural networks with multiple hidden layers, deep neural networks. In order to solve easy tasks, simple structured neural networks can be applied. In order to overcome the limitations of those simple structures, recurrent and convolutional neural networks are utilized. Thereby, recurrent neural networks are used for models that can learn sequences without pre-defined optimal fixed dimensions while convolutional neural networks are applied for sentence classification. Chapter 2 of the booklet gives a short introduction to Deep Learning in NLP. The Foundations and Applications of Modern NLP will be described in chapter 3. In chapter 4 and 5 recurrent neural networks and convolutional neural networks and their applications in NLP will be explained and discussed.

Transfer learning is an alternative to learning models for every task or domain. Here, existing labeled data of related tasks or domains can be used for training a model and applying it onto the task or domain of interest. The advantage of this approach is that there is no need for a long training in the target domain, and that time for training of the model can be saved, while still resulting in a (mostly) better performance. A concept used in transfer learning is Attention, which enables the decoder to attend to the entire input sequence, or Self-Attention, which allows a transformer model to process all input words at once and model the relationships between all words in a sentence, which renders fast modelling of long-range dependencies in a sentence possible. The concepts of transfer learning will be briefly introduced in chapter 6 of the booklet. Chapter 7 will describe transfer learning and LSTMs by presenting the models ELMo, ULMFiT, and GPT. Chapter 8 will illustrate the concepts of Attention and Self-Attention for NLP in detail. In chapter 9, transfer learning is combined with Self-Attention, introducing the models BERT, GTP2, and XLNet.

For NLP modelling, resources are needed. In order to find the best model for a task, benchmarks can be used. For comparing different models within a benchmark experiment, metrics such as exact match, Fscore, perplexity, or bilingual evaluation understudy, or accuracy, are required. Chapter 10 of the booklet provides a brief introduction to the resources for NLP and the manner in which they are used. Chapter 11 will explain the different metrics, give an insight into the benchmark datasets SQuAD, CoQa, GLUE and SuperGLUE, AQuA-Rat, SNLI, and LAMBADA as well as pre-trained models and databases where resources can be found, such as “Papers with Code” and “The Big Bad NLP Database”.

In the last chapter of the booklet, the generative NLP process Natural Language Generation, thus the generation of understandable text in a human language, is presented. Therefore, different algorithms will be described and chatbots as well as image captioning will be shown for illustrating the possibilities of application.

This introduction to the various methods in NLP functions as the foundation for the following deliberations. The individual chapters of the booklet will present modern methods in NLP and provide a more detailled discussion of the potential as well as the limitations along with various examples.


Bobrow, Daniel G. 1964. “Natural Language Input for a Computer Problem Solving System.”

Chomsky, Noam. 1957. Syntactic Structures. The Hague: Mouton.

Deng, Li, and Yang Liu. 2018. Deep Learning in Natural Language Processing. Springer Nature.

Hancox, Peter. 1996. Natural Language Processing.

Hutchins, John. 2005. “The First Public Demonstration of Machine Translation: The Georgetown-Ibm System, 7th January 1954.” Noviembre de.

Johnson, Mark. 2009. “How the Statistical Revolution Changes (Computational) Linguistics.” Proceedings of the EACL 2009 Workshop on the Interaction Between Linguistics and Computational Linguistics Virtuous, Vicious or Vacuous? - ILCL 09, March.

“Language and Machines.” 1966. National Academy of Sciences National Research Council.

Manning, Christopher D., and Hinrich Schutze. 2008. Foundations of Statistical Natural Language Processing. MIT.

Mccorduck, Pamela, and Cli Cfe. 2004. “Machines Who Think.” A K Peters/CRC Press, March.

Turing, Alan M. 1950. “Computing Machinery and Intelligence.” Mind LIX (236): 433–60.

Turing, Alan Mathison. 1948. “Intelligent Machinery.” NPL. Mathematics Division.

Turing, A. M. 1937. “On Computable Numbers, with an Application to the Entscheidungsproblem.” Proceedings of the London Mathematical Society s2-42 (1): 230–65.

Weaver, Warren. 1949. “The Mathematics of Communication.” Scientific American 181 (1): 11–15.

Weizenbaum, Joseph. 1966. “ELIZA-a Computer Program for the Study of Natural Language Communication Between Man and Machine.” Communications of the ACM 9 (1). ACM New York, NY, USA: 36–45.

Winograd, Terry. 1972. “SHRDLU: A System for Dialog.” CUMINCAD.

Yeung, Joshua. 2020. “Three Major Fields of Artificial Intelligence and Their Industrial Applications.” Medium. Towards Data Science.