Natural language processing, usually abbreviated as NLP, is all about the interactions between humans and computers via (written or spoken) language. Therefore, it is an interdisciplinary field with ideas being developed both in computer science and linguistics.Popular applications of NLP are summarizing documents, translating between languages, identifying emails as spam, determining (named) entities (like persons or organizations), assessing if a tweet or message carry a positive or negative sentiment or even to conduct a conversation wih a customer (chatbots). While good progress has been made in some of these applications, others are still very challenging.
Historically, first ideas can be traced back to the 1950s to researchers like Alan Turing. The first wave of NLP systems was mostly based on hand-written rules, like complex regular expressions or conceptual ontologies. As an example, to identify an email as spam, you would write rules like “if the subject-line contains both the words *cash* and *guaranteed* in any order or any place, the email is spam”. Due to the complexity of natural languages, the success of this approach was rather limited.In the 1980, computing power had increased sufficiently to allow the development of first machine learning algorithms. Bayesian classifiers are an example of such a classical machine learning algorithm. For example, given a large number of emails pre-classified as spam or no-spam, a Bayesian classifier can be trained to identify future spam emails (similar to one, it has seen before) pretty accurately.On other potential applications, like document summarization, chatbots for customer service or high-quality document translation, some progress was made as well, but these problems continued to prove very difficult to solve.
Another 30 years later, deep learning came along. While the basic idea of deep artificial neural networks had been around for many years, it was only now that sufficient computing power and sufficiently large amounts of data were available to put this idea into practice with high quality.
In addition, in this field researchers contributed novel concepts, especially in the field of network architecture. This combination has catapulted Deep Learning to the forefront of machine learning techniques.
Naturally, deep neural networks were applied to NLP as well. As many machine learning algorithms need their input represented as a feature vector, words and documents had regularly been converted (embedded) into this representation, often using a one-hot encoding for words or tf-idf representations for terms and documents. While these approaches worked reasonably well, it removes basically all the inherent meaning of a word.
One could almost say that this information is being actively hidden from the machine learning algorithm receiving this feature vector. In 2013, word2vec, a deep-learning based method, was introduced to create word embeddings that capture the semantic meaning of the word.
A famous example is: if you start with the vector for *king*, subtract the vector for *man* and add the vector for *woman*, you get a vector which is very close to *queen*. In a one-hot-encoding embedding, the same operations would not lead to any meaningful result. The similarities between the vectors produced by word2vec have for example been used by a search engine to extend the search to similar words.
For instance, a user searches for “Hotel Mallorca”, the search engine could recognize that the vector for *Hotel* and *Finca* are very similar and shows Fincas on Mallorca as well. Word2Vec reinvigorated research into word embedding algorithms, leading to GloVe or fastText for example.
All embedding algorithms mentioned so far (and there are many more) compute exactly one feature vector per word. Many words, however, have different meanings, depending on the context they are used for. For example the word “rose” refers to a pretty flower in “Sandra has her own rose garden” but has a different meaning in “Sales rose 50% last year”. Word embeddings extracted from recent algorithms such as BERT or generated by ELMo (both published in 2018) are able to capture these differences in meaning of the same word ("homonyms").
BERT, in fact, is much more than an algorithm to compute word embeddings – it generates a universal "language understanding" model that can be used for many Natural Language Processing (NLP) tasks. Bert, ELMo and GPT-2 (published in early 2019), are based on a neural network architecture called Transformer (improvement on the well-known encode-decoder network architecture), which seems to provide a basis of considerable potential.
While NLP research had been relatively static for a couple of years, tremendous progress has been made since 2013.
Research results and practical applications have been accelerating further from 2017 to 2019.
Should this trend continue, great leaps forward can be expected in the near future, particularly on the harder NLP tasks.