Data Science
Sira Ferradans Sira Ferradans
20 August 2019

Sentence Boundary Detection in PDF Noisy Text in the Financial Domain (FinSBD)

Fortia is organizing the shared task “Sentence Boundary Detection in PDF Noisy Text in the Financial Domain”. A Workshop on NLP for the Financial domain (FinNLP) organized in collaboration with the National Taiwan University at the IJCAI (International Joint Conference on Artificial Intelligence, Macao, China). Fortia received 7 papers, 10 teams participated and 60 people were subscribed.


Sentences are basic units of the written language and detecting the beginning and end of sentences, or sentence boundary detection (SBD) is a foundational first step in many Natural Language Processing (NLP) applications, such as POS tagging; syntactic, semantic, and discourse parsing; information extraction; or machine translation.
Despite its important role in NLP, sentence boundary detection has so far not received enough attention. Previous research in the area has been confined to formal texts only (news, European Parliament proceedings, etc.) where existing rule-based and machine learning approaches are extremely accurate (when the data is perfectly clean). No sentence boundary detection research to date has addressed the problem in noisy texts extracted automatically from machine-readable formats (generally PDF file format) files such as financial documents.
In this shared task, we focus on extracting well segmented sentences from Financial prospectuses by detecting their beginning and ending boundaries. These are official PDF documents in which investment funds precisely describe their characteristics and investment modalities. The most important step of extracting any information from these files is to parse them to get noisy unstructured text, clean it, format information (by adding several tags) and finally, transform it into semi-structured text, where sentence boundaries are well marked.

Participants results:

participant result to the FinSBD FR version

participant result to the FinSBD

Next Important Date

August 10–12, 2019: FinNLP 2019 Workshop in Macao

Sentence Boundary Detection in PDF Noisy Text in the Financial Domain in Macau china


Fortia introduced a new data set on the SBD problem in text automatically extracted from PDF files for French and English. This scenario is very realistic in everyday applications which may explain the diversity of institutions that participated, from public universities to for profit organizations from the financial domain. In this sense, the shared task was a success since it was able to bring together researchers from different sectors.

Shared Task Co-organizers — Fortia Financial Solutions

· Sira Ferradans

· Abderrahim Ait-Azzi

· Guillaume Hubert

· Houda Bouamor

Link below for more details

FinNLP 2019 & FinSBD Shared Task 

By continuing your navigation, you accept the use of cookies for statistical purposes and personalization. Learn more