1. Motivation behind Information Extraction
News-papers, blogs, and web-pages are a rich and diverse source of textual information. However, the information contained in these sources cannot be manually extracted, recorded, and indexed, mainly because they come in a massive size. Moreover, the extraction of some information sometimes require specific knowledge or technical background. This is the case in the Financial domain where we want to be able to automatically extract some key information from specific documents (e.g pdfs, word documents, etc) published by banks, funds, and other financial institutions. In order to scale knowledge extraction to the large size of available textual information, and build extractors specific to a certain field, new methods based on machine learning have been deployed and developed [1, 2].
2. From information extraction to relation extraction
Information Extraction (IE) is a broad subject covering multiple tasks, mainly Named Entity Recognition (NER), Co-reference Resolution, Relation Extraction (RE), and Event Extraction. RE uses NER and Co-reference resolution to extract binary semantic relations, whereas event extraction (i.e, extract who did what to whom when and where for an event) can be represented as a complex combination of relations, and thus may use RE techniques. Accordingly, RE is the pivot of IE.
What is relation extraction ? A relation extractor automatically understands the semantic (i.e meaning) of a sentence, and then puts it in a structured and computer-readable format, known as a Knowledge Base. For example, a relation extractor would map the sentence
John Lenon, the lead singer of the Beatles, was born in England (1)
to the structured fact has_nationality(John Lenon, England), and the sentence:
Google bought Youtube in 2006 for US$1.65 billion (2)
to the fact has_acquired(Google, Youtube).
Main difficulty of IE The main difficulty of IE is the fact that most text data is initially unstructured, i.e the data format is not indicative of its meaning. For instance, there are multiple different ways to convey the fact has_acquired(Google, Youtube). Sentence 2 conveys it, and so do the following sentences:
Youtube was purchased by Google. (3)
Google confirms YouTube aquisition. (4)
Google has acquired Youtube, an online video sharing service. (5)
Google’s Youtube takeover clears final hurdle. (6)
The versatility of natural or human-understandable language makes it difficult to interpret by a computing machine.
3. Methods in the state of the art
A relation extractor takes as input a raw sentence with two marked named entities. In sentence 2 for example, the pair of marked entities is Google and Youtube. The input sentence is then processed in two steps. First, a vector representation of the sentence is computed (cf section 3.1), and second, a semantic relation is assigned to the sentence based on this representation (cf section 3.2).
3.1 Vector representation of a sentence
Feature-based represention This type of representation is binary, sparse, and computed using tools from Computational linguistics (CL). These Tools include among others:
- Named Entity Recognizer: a tool that tags a token as either an organization (ORG), a person (PER), a location (LOC), etc …
- Part-Of-Speech (POS) tagger: a tool that tags each word of a sentence with its Part-Of-Speech (whether it is a verb, a noun, a proper noun, etc)
- Dependency parser: a tool that represents a sentence as a tree, whose nodes are the words of this sentence, and edges are the grammatical relations between these words (e.g subject_of, preposition, etc).
In the binary sparse representation, each bin indicates the presence or not of a feature. These features are retrieved with the help of CL tools and are selected on a trial-and-error basis. For example, we can choose to represent our sentences with the following features:
- NER tags of the pair of entities
- POS sequence of the sentence
- sequence of words between the pair entities
- the dependency path between the pair of entities
This binary representation is sparse and of high dimensionality. It also requires heavy pre-processing, as it uses different CL tools, the dependency parser being the most computationnaly demanding.
Dense representation Dense representations generally stem from neural networks (convolutional, recurrent, or recursive ). Instead of using off-the-shelf tools trained and optimized separately for the RE task as it is the case for the feature-based representation, the dense representation is rather optimized (a.k.a learned) during the training of the relation extractor. We refer the reader to [4, 5, 6, 7] for architectures that have been successfully used to perform relation extraction. Work is still ongoing to find better, easier to train architectures.
3.2 Paradigms for relation extraction
RE can be done under different paradigms depending on the data at hand: whether it is a raw corpus of text, a set of labeled and tagged sentences, a knowledge base, etc … The semantic relation assigned to an input sentence may either be a pre-defined relation type (e.g has_nationality), or an integer id, depending on the paradigm.
Supervised RE (aka Relation Classification) In supervised relation extraction, the training dataset is composed of sentences tagged with a pair of named entities, and labeled with the semantic relations they convey. The number of considered relations is finite and the list, R, of relations considered is specified in advance by the user. The task consists of training a model that correctly maps an unseen sentence to one of the semantic relations in R: it is a classification problem where the labels are the semantic relations considered.
Distant supervision Distant supervision  is an alternative to Relation Classification (RC) that does not require hand-labeled datasets. Similar to RC, it consists of mapping sentences to semantic relations belonging to a predefined list, R, of relation types. However, unlike RC, labels are automatically assigned to the training sentences. Hence, the training datasets are larger in size than those used under the supervised setting, but are noisier.
Bootstrapping Boutstrapping  is a semi-sepervised method that targets one relation type, r, at a time and outputs a large set of entity pairs related by r. It requires to have, initially, only few entity pairs linked by r.
Open IE An open information extractor processes raw corpora of text to output a diverse set of relational tuples, like for example (Google, has acquired, Youtube) . The main drawback of this paradigm is the fact that it does not specify the relation type conveyed by each extracted triplet.
Unsupervised RE An unsupervised relation extractor clusters a set of relational triplets such that the triplets conveying the same semantic relation are put together in the same cluster .
4. Practical uses of IE
Information extraction, and specifically relation extraction, enables to construct from scratch a large knowledge base, or populate existing knowledge bases (like Yago , DBPedia , WikiData ). It is crucial to have at hand a comprehensive knowledge base as it is the backbone of many useful engines, such as
Question-Answering systems . With a powerful QA system, one can further develop appealing tools such as:
- personalized bots
- bots specific to a certain domain (finance, risk assessment , …)
- quick access to information regarding competitors
- utilizing extracted information in decision making