Natural language processing (NLP) is the science (and art!) of making machines understand speech and written text. What might look simple in principle has put thousands of researchers to work for decades. It basically combines very different areas of human knowledge : artificial intelligence, linguistics and applied mathematics. As any other area of artificial intelligence the emergence of deep learning has completely changed the area. At Fortia, NLP methods are at the core of most of our applications, and this is why we have attended the Conference on Empirical Methods in Natural Language Processing (EMNLP) in Copenhagen this September. The focus of the EMNLP is, as its name shows, a bit different from other NLP conferences. Here, there is more emphasis on applications of NLP and ‘real-world’ problems.
Some fun application
• Advances in unsupervised learning from speech. Currently, the best translation systems, such as those from Google, cover at most 103 languages, while in the world there are around 7000 . The only way of covering all these languages is to develop methods with minimal needs, that is, unsupervised learning methods. Sharon Goldwater was an invited speaker that addressed directly the most challenging problem in speech, that is ‘unsupervised term discovery’. This is a problem that all children solve during their first years, but is far from being solved for computers. The idea is that, without any directions, the system should be able to identify from sound how to segment never- before-seen words
• Politeness evaluation of US police. D. Jurafsky  decided to use NLP to clarify an issue that has deeply moved the US population in the last years : police racial bias toward people that have been pulled over. Every police officer in the US is obliged to carry a camera with a microphone. In collaboration with the police department they -who? Jurafsky? transcribed the conversations between every policeman/woman and the person being pulled over. They proposed a precise model for politeness measurement on these transcriptions and the results confirmed the bias being denounced by the population (Figure 1).
Figure 1: Conclusions on the politeness measurement algorithm on Police conversation scripts
The cost of tagging for Supervised learning applications
A problem that is creating much concern in the community is the cost associated with supervised learning : Machine learning methods can be divided between supervised learning, unsupervised learning and reinforcement learning. The main difference between them is how much information is available to the algorithm during training time (the period offline during which we teach the algorithm the task).
Supervised methods are those for which we know the correct answer for each element of the task (for each element on the training set). For instance, if our task is to identify in a text the nationality of every actor, we would know the coupling beforehand (Francois Hollande, France), (Barack Obama, US), for instance. It is a well-known fact that supervised methods are reaching accuracy levels in some applications that allow its implementation in real life scenarios, thus its industrial success. Supervised methods work significantly better than unsupervised ones, thus the issue of tagging data (the process of assigning a tag to each element in the data base) has become central. In fact, more and more companies (Amazon Mechanical Turk, CrowdFlower, etc) have crowdsourced tagging services, that is to say, they find and pay people to develop simple tasks to generate tags. This issue has not passed unnoticed in the research community and specially for the EMNLP community, where many solutions were proposed to minimize the tagging effort without losing performance.
Active Learning vs heuristics for automatic tagging. An easy way to tag a given data base is by using heuristics. For instance, given a gazeteer with the possible entities that can appear in the text, tag by looking entity per entity if they appear verbatim in the text. Of course this approach does not scale to different types of corpus, has degenerative behaviors, and always needs seed data to start. Moreover, it generally requires a posterior step of human control on the tags. Insteard Fang et al. propose to use active learning. They interpret the iterations as a decision process. Decision making, can thus be translated into a reinforcement learning problem.
Correctly sampling. In Machine Learning, the training set is meant to be representative of the future data that will be seen by the algorithms in testing time. Of course, this is violated very frequently, since creating an open domain data set is very difficult. Chaganty et al. points out the importance of correctly sampling for unbiased Knowledge base population
Sentence simplification. The use of simplification tasks as a preprocessing step to ease semantic processes, such as parsers, semantic role labelers or summarizers was proposed by Zhang and Lapata. They found a very useful data set for this purpose, the simple Wikipedia which is a simplified version of Wikipedia for kids. They propose the use of a sequence-to-sequence reinforcement learning algorithm with the aligned text from Wikipedia and simple Wikipedia articles.
 Meng Fang, Yuan Li and Trevor Cohn. Learning how to Active Learn: A Deep Reinforcement Learning Approach. EMNLP 2017 Sharon Goldwater. Towards more universal langage technology : unsupervised learning from speech. EMNLP 2017 X. Zhang, M. Lapata. Sentence Simplification with Deep Reinforcement learning. EMNLP 2017  https://simple.wikipedia.org/wiki/Main_Page  A Chaganty, A.P. Paranjape, P. Liang, C.D Manning. Importance sampling for unbias on-demand evaluation of knowledge base population. . EMNLP 2017 D. Jurafsky. Does this vehicle belong to you?processing the langage of policing for improving police community relations ». Invited talk, EMNLP 2017. https://en.wikipedia.org/wiki/Language, 23 October 2017.  M. Johnson, M. Schuster, QV. Le et al. Google’s multilingual neural machine translation system: enabling zero-shot translation. arXiv preprint arXiv:1611.04558, 2016.