European Conference on Information Retrieval 2018 – Grenoble
ECIR is a Conference that was held in Grenoble this year and focuses on Information Retrieval (IR) methods and algorithms and the way they should be implemented, monitored and evaluated. Most IR methods are based on natural language so there is also a strong focus on Natural language processing and understanding.
This year’s program included multiple subjects like Retrieval, Micro-blogs (Twitter) Analysis, Recommendation, Deep learning, User Behaviour. Through the different activities the attendees had the opportunity to learn about what is new in the field of Information retrieval.
I selected two of the most interesting papers in this post :
Bringing Back Structure to Free Text Email Conversations with Recurrent Neural Networks
This paper aims at adding structure to email communication using supervised learning.
Email is one of the most important communication tool in business so analysing it can reveal some interesting insights about the inner working of a company which in turn can be used for improving the communication process or fraud detection.
The raw form of email is a blob of html and text with no structure where the signature is mixed with the body or previous emails of the conversation and whatever keywords that are added by the email client.
The main difficulty of this problem is that each email client adds its specific formatting and layout so the email segmentation model needs to be generic enough to work for all of them. In this setting rule-based system fail at generalizing. If they are created while considering a predefined set of email clients then their performance degrades significantly when applied in the general case.
The authors of the paper “Bringing Back Structure to Free Text Email Conversations with Recurrent Neural Networks” propose a new method in for email Zoning using supervised learning and a combination of Convolutional Neural Networks and Recurrent Neural Networks applied to the email text.
The Convolutional Neural Network is applied to each line of the input email where each sentence is one-hot encoded at the character level and then the sequence of characters is fed to the 1D convolution layers. This CNN is pre-trained to do prediction on a line-by-line basis and then the weights are plugged into the full-model of CNN+RNN and trained to predict the sequence of labels of each email. The CNN encodes each sentence into a fixed length vector which is in turn fed into the RNN which then is able to encode the sequential aspect of the data. The final output is produced using a CRF that finds the most likely sequence of label.
The model of the paper (Quagga) achieves 98% accuracy in classifying the Body of the email from other parts ( Headers, Disclaimers … ) Which is a very high score that proves that this system can be very reliable when analysing a large collection of emails automatically.
Web2Text: Deep Structured Boilerplate Removal
Cleaning data before applying an information retrieval model is crucial in order to avoid having noise interfere with the quality of the output. This cleaning process is the motivation behind the paper “Web2Text: Deep Structured Boilerplate Removal”.
The objective of the paper is to separate the main content from a webpage and discarding all the unnecessary information like the layout html, sidebar, header, footer …
The paper is based on one clever preprocessing step called DOM Collapsing followed by a Convolutional neural network.
DOM collapsing flattens the HTML tree of the webpage and then each block is encoded using 128 hand-crafted features like “average word length”, “the parent node’s text contains an email address” …
One CNN is applied to the sequence of blocks and another CNN is applied to the sequence of edges between blocks which then produce Unary and Pairwise Potentials that are fed to a CRF that is used to find the most likely sequence of labels.
Original page; after automatic HTML stripping; after manual cleaning/markup.
The result of this algorithm is a cleaned version of webpage where only the main content is kept and used to potentially do page clustering or create an text based Information retrieval model that is robust to all the boilerplate elements of a webpage which in turn improves the clustering quality/ Retrieval accuracy.
Overall The European Conference on Information Retrieval 2018 was an opportunity to learn about what is new in the Information retrieval field with many example of new methods applied to real world application like data cleaning or Email segmentation.