webleads-tracker

Automatic Table-Of-Contents generation for efficient information access - Fortia
Data Science
Remi Juge Remi Juge
25 February 2020

Automatic Table-Of-Contents generation for efficient information access

Authors:

Najah-Imane Bentabet
Fortia Financial Solutions
Paris, France

Rémi Juge
Fortia Financial Solutions
Paris, France

Ismaïl EL MAAROUF
Fortia Financial Solutions
Paris, France

Dialekti Valsamou-Stanislawski
Fortia Financial Solutions
Paris, France

1 Introduction

As with many professional domains, Finance conveys most of its policy, regulation, and corporate information through electronic documents first elaborated with office suites and then converted to PDF before publication. Documents obviously do not simply expose raw text and significant effort is made towards organising its layout. Indeed, layout plays a key role in document understanding, by bringing objects in relation (e.g. referencing illustrations in the text), increasing readability (e.g. using spaced paragraphs), assisting navigation (e.g. cross referencing between sections), and organising content (e.g. summarising with section titles). Document layout is also frequently codified and mandatory templates are typically created to harmonise publications of the same type and to ensure compliance with regulations. For instance the French financial authority (AMF) provides a template for financial prospectuses, organised in sections and listing all the information that such a document should contain. Such documents can be extremely long (hundreds of pages) and sometimes contain a Table of Contents (TOC) to facilitate navigation.
A TOC is, in principle, simply a restitution of section headings, which are, in turn, used to signal the starts and ends of divisions in a document (or parts, or chapters in a book). They are part of a document’s metadata, rather than the data itself. Because they provide the page number of each section start, TOC are very useful to quickly navigate and access information in a document. They also help users understand the structure of a document at a glance. Given the growing number of digital documents available electronically, the automatic extraction of TOC from documents has been identified as a key feature of document reader applications and increased interest in automating it.

2 Previous work

Document layout extraction has mostly been applied to images and to PDF documents.
The literature distinguishes two main kinds of document structure, the geometrical structure and the logical structure [11]:

  • the geometrical structure or physical structure indicates the organisation of various areas in the space of the page using presentational criteria
  • the logical structure signals the content structure, i.e. how the content is organised in the document

The Table Of Content (TOC) of a document clearly belongs to the logical structure. We should note that the notion of logical structure, which is sometimes coupled with semantic structure or semantic labelling, has received different definitions, which may lead to confusions[12, 13].
The purpose of the TOC is to provide immediate knowledge of its organisation and the key parts that make up its content: it reveals its logical structure. This is achieved by presenting a (often nested) list of the titles of the document parts, assuming each title encapsulates the main idea of the part it introduces. A TOC can be highly standardized and recommended in certain domains or publications. For instance [14] describes the evolution of the famous IMRAD (Introduction, Methods, Results, And Discussion) structure used in medical journal articles, such as the Lancet. Obviously the words used in the title of each section may vary, and subdivisions to the sections are developed, but the main overall structure is always there.
While such template-based TOC do not provide detailed information on the content of a document, they are convenient to the frequent reader. Specifying more details of the document structure has led to the development of document templates which codify both its content and the plan of the document. TOC generation has been explored from three main perspectives. It was initially approached from a layout analysis perspective: the document image is split into a variety of spatial areas of a predefined type, i.e. text, figures, tables, or background [15]. Each area is then represented by predefined features [16, 17, 18] and predefined semantic labels (titles, captions, author names,…). Semantic labels are applied using heuristic rules [16] or with classification techniques [19]. The yielded hierarchy grouping these areas is the TOC [12]. The scope of these approaches is strongly reduced since they rely on the existence of a predefined set of semantic categories that needs to match a particular type of document and a TOC template, which obviously is not always available. Such methods have mostly been applied to scientific articles.
Recent algorithms have explored TOC extraction by parsing TOC pages and ex- tract the hierarchical structure of sections and subsections. Most methods in this area have been developed in the context of the INEX [20] and ICDAR competitions [21, 22, 23] which, as we have mentioned before, focus on long and old digitized historical books, as opposed to short scientific articles with previous methods. To the best of our knowledge, the only work led outside these competitions on the topic of TOC page parsing is [24, 25], who apply a rule-based approach to PDF document layout analysis.
Lastly, a number of methods have been proposed to detect titles using machine learning methods based on layout and text features. In such approaches, the list of titles are hierarchically ordered according to a predefined rule-based function [21, 26, 27].

3 Experimental datasets

In the following experiments, three datasets were considered: two small, domain- specific databases of investment documents built specifically for this work in (a) French and (b) English, and (c) a public dataset that has already been used to evaluate a TOC extractor on Arxiv scientific publications [28].

3.1 Investment documents datasets

For this work, we use investment documents known as prospectuses. Prospectuses are documents where investment funds specify their investment policy and legal structure. There are domain-specific documents with a distinguishing vocabulary. Moreover, a non-negligible amount of information is assumed by the context and thus, missing in the textual content. As mentioned earlier, prospectuses are often the result of an export from an office suite 6 thus, the layout can greatly differ from one publisher to another. We witness
strong differences in font style, size, color and in the use of tables and graphical content.
The considered documents in these datasets are characterized by a deep hierarchy in their TOC (up to 6 levels). French documents never include an embedded TOC page whereas English documents do, but this parsable TOC page is usually highly incomplete as only the higher-level titles are displayed. Due to the presence of a template for the French prospectuses, we noticed key differences between French and English documents. In French documents, the titles usually contain the same words but the phrasing can differ from a prospectus to another. A lot of similarities can also be found in their hierarchies (i.e. “Caractéristiques Générales” is found under ”Prospectus” and followed by ”Acteurs”). On the contrary, it is difficult to manually find such patterns in English documents as no template exists and the organization of the document is unconstrained. However, for both languages, we observe that:

  • some documents contain specific headings that do not appear in any other document
  • the same title in two different documents can have a different position in the hierarchy
  • two titles that follow each other can have the same layout but a different position in the TOC
  • the font size of a higher-level heading can be smaller than the font size of a lower- level one
  • and a title can have the exact same layout as its associated paragraph.

Concerning the statistics of the datasets, the average number of pages per document in the French collection is 26 pages whereas it is 87 for the English set. This implies that English documents also contain many more titles: 230 on average versus 118 in French. These elements will be reflected in the abilities of all the systems evaluated in this paper, to correctly identify titles and organise them hierarchically in a TOC.

3.2 Arxiv dataset

In 2017, Rahman and Finin [28] released a dataset of scientific publications from Arxiv along with their labelised TOC. It brings together the scientific publications submitted to Arxiv between January 2010 and December 2016. Contrary to the Financial datasets, only a few headings are repeated throughout the documents (i.e. ”Introduction”, ”References”, and ”Conclusions”). The other titles are usually highly distinct according to the topic. Concerning the TOC trees of these documents, we notice that they usually are much shallower and shorter than in the previously described datasets. Finally, it is worth mentioning that this dataset was generated automatically by collecting metadata from Arxiv, especially PDF bookmarks were used to extract the TOC for the ground-truth.

4 TOC Pipeline and models

This section presents the TOC-generation pipeline, which uses neural networks to first classify inputs into title or non-title and secondly, to order titles hierarchically to produce the final TOC of a document.

4.1 Title detection step

Titles are separated from plain text thanks to a binary classifier. We use a character-level convolutional neural network (char-CNN) which has proven to be effective in text classification tasks [30]. This decision is further motivated by the fact that headings, unlike plain text, generally begin with specific numbering schemes (e.g ”I.” in ”I. I am a title”, and ”II.1.b” in ”II.1.b Now I am a sub -title”), which can be used as a significant hint to distinguish titles from non-titles.

Title encoder.

The text blocks are pre-processed such that punctuation is standardized. The characters are then encoded thanks to an embedding layer into a d-dimensional dense vector such that each text block is represented by a d×l matrix (l is the maximum number of characters in the text blocks).

Binary classifier.

The character-level CNN applies, in parallel, 1D-convolutions followed by a 1D-max-pooling layer. The outputs of these parallel operations are concatenated and flattened before applying a fully-connected layer with ReLU activation. We attach to the resulting vector a fixed-length vector of hand-crafted features encoding the layout of the input text block. Lastly, a second fully-connected layer with softmax activation classifies the input into heading or non- heading. Dropout [31] is added for regularization purposes.

4.2 Title hierarchization step

The second unit of the pipeline has the following role: take the previously detected titles and hierarchically order them (i.e predict a level in the TOC tree for each one of them) to create the final TOC.

Title encoder.

Each heading is encoded using an architecture comparable to the one mentioned in Section 4.1. The difference lies in the fact that it works at the word level (word-level CNN) [32]. d stands for the dimension of word embeddings. Thanks to this encoder, we induce a low and dense vector representation of each detected title of the document, to which is attached a fixed-length vector of manually designed features. The final vectors are stacked to create a matrix (M) that represents the document input where each row corresponds to the vector representing a title. As a result, two consecutive headings in the document have their vector representations appearing as consecutive rows in the matrix M.

Sequence labeling approach.

Next, a BiLSTM [33] layer followed by a CRF [34] layer is applied row-wisely on M so that the titles are sequentially labeled. The first layer takes advantage of the information of both past and future input features in the document while the second one exploits the level of the other headings in the document to predict the final TOC levels. To the best of our knowledge, no other work previously predicted the levels of titles using a sequential labeling approach. We stress that to this model, a single data point is not a text block but a complete document, represented by the series of its text blocks.

4.3 Reorganization into a TOC structure

The sequence of the titles and their depths is not necessarily a tree as the model is not trained with such constraints. Therefore, the last procedure of the TOC-pipeline is a post-processing step that maps the sequence of headings into a TOC that validates the constraints of a tree graph (e.g. a title with a level of 1 cannot by followed by a title of level 3).
The mapping mentioned above is a rule-based function. The first detected heading is always a top-level heading, ignoring the prediction of the BiLSTM-CRF. Then, for each subsequent detected title s, we look for the closest previous title s’ that has a level strictly lower than that of s, and we set s’ to be the parent of s.

5 Model implementation details

The textual content of the document is segmented into text blocks -elements that are visually similar to paragraphs- so that titles that spreads over multiple lines can be
properly classified.
Before being fed to the first classifier, each text block is transformed into a feature vector that has two sets of constituents: the first set of values represents the semantic information extracted from the text with an encoder, the second set describes the text block through 28 hand-designed features that can be divided into three categories:

  • layout features: they encode the layout’s visual information (i.e. is bold, is italic…)
  • semantic features: they enrich the semantic representation of the titles with sparse vectors that indicate the presence of relevant words for the task in the current, previous and subsequent textblocks.
  • text-related features: they represent the information which is neither visual, nor semantic (i.e. text length).

These features can be unary, meaning they describe intrinsic characteristics of the text blocks, or relative, that is, they compare characteristics of text blocks with their neighbours.

6 Results

6.1 Baseline models

We compare the proposed title detector with:

  • a Gradient Boosted Tree (GBT) binary classifier trained on text characters
  • a GBT binary classifier trained on the hand-crafted features (hfs)
  • a GBT binary classifier trained on both text characters and hand-crafted features (combined)
  • Rahman’s title detector [28], a recurrent neural network applied on characters.

We use the code provided by them on github. We compare our title hierarchizer against GBT models and against Rahman’s model as well.

6.2 Evaluation measures

Standalone evaluation

In order to evaluate each stage independently, we use the weighted F1 score [43] that looks at the correct rate of classification: binary classification for title detection, and multi-class classification for title hierarchization.
Pipeline evaluation In order to score the tree that represents the TOC, we use the Xerox measure [29, 21] which attempts to assess the quality of the whole pipeline, from the document up to the generated TOC. The TOC is considered as a sequence of TOC-entries, each TOC-entry being composed of a title, and the number of the page where it appears.

6.3 Standalone evaluation results

We randomly split all datasets into train and test sets, as specified in table 6.2. Obviously the Arxiv dataset is very different as it is larger and automatically generated. On the contrary, the titles and the TOC of financial datasets have been manually annotated and are much smaller.

Table of Content: Splits into train and test

6.3.1 Title detection

As depicted in table 6.3, titles are reasonably straightforward to detect, especially for models which have access to both style and text information as both Raman and Finin’s model as well as GBT baselines on isolated feature groups perform worse on all datasets. This can be explained by the fact that style clues (bold and font size mainly), can help when title numbering is non-existent. The difference of performance between models is less stark on domain-specific data as results are generally better for all models, and the scores are much higher than on the Arxiv dataset, which can be due to a more homogeneous divide between the training set and the test set. The fact that the Arxiv dataset was automatically generated whereas financial datasets have been manually prepared is also relevant. The model proposed in this paper outperforms all baselines on the Arxiv dataset with a large margin. Overall, the neural model is better, although it is slightly less effective on the financial datasets compared to the GBT baseline with combined feature sets. However, because they are generally faster during inference than neural models, GBT models are a particularly interesting candidate.

Table of Content: Table detection results

6.3.2 Title hierarchization

For the task of title hierarchization, table 6.4 shows that the proposed algorithm performs better than all baselines on all datasets. The gap between our model and the baselines is larger than for title detection.

We also observe that, unlike the scores obtained for title detection, the F-scores on the English financial dataset are low compared to the F-scores obtained on the other two datasets. This is due to the fact that English investment documents are long documents with a high average number of titles.
As for Arxiv documents, they are known to have a much smaller number of titles.
Another reason is that there exists no standard TOC template for English investment documents, whereas (i) French investment documents need to strictly follow the AMF template, and (ii) Arxiv documents tend to use predictable titles (using keywords such as Abstract, Introduction, Results, Bibliography, etc…), though they don’t necessarily follow a rigid template.

Table of Content : title hierarchization results

7 Conclusions

This article proposes a new flexible neural approach to generate the TOC of a document without relying on TOC pages nor on predefined document templates. We proposed a two-step approach for TOC generation that first detects the titles and then hierarchically orders them following a sequence labeling approach. The main advantage of this sequence labeling method is that the level given to a title is impacted by all the input feature vectors and the levels of the other titles in the document. The proposed method was tested on two new datasets of Financial documents that are particularly challenging due to their layout complexity and TOC structure diversity. The presented algorithm largely outperforms state of the art methods on the public Arxiv dataset on the task of TOC generation.
As mentioned earlier, being a tree, the TOC structure is delicate to evaluate and future work should focus on this aspect. As revealed by the title hierarchization evaluation, building up the TOC tree is challenging, particularly when documents do not strictly follow a layout structure template, which means it is more difficult to train generalized models. Future work should focus on this aspect and the datasets we provide, will enable to explore and hopefully solve satisfactorily this key problem.

Find all references of “Automatic Table Of Contents generation for efficient information access”

By continuing your navigation, you accept the use of cookies for statistical purposes and personalization. Learn more