Natural language processing in focus at the Collège de France
Changed on 12/07/2024
Generative artificial intelligence, machine translation and chatbots are all examples of technology that employs natural language processing (NLP), something which Benoît Sagot, head of the Inria project team ALMAnaCH, has spent his whole career researching. Here’s a date for your diary: as of 30 November he will be introducing NLP to the general public as part of the chair at the Collège de France that he was recently appointed to.
Artificial intelligence: from fantasy to reality
Benoît Sagot, who was recently appointed to the prestigious “IT and Digital Science” chair at the Collège de France for the 2023-2024 academic year, has his sights set on helping people to understand artificial intelligence (AI) systems, which are used among other things to automatically generate text from written instructions. Sagot is head of ALMAnaCH, a project team at the Inria Paris Centre which specialises in natural language processing and digital humanities. He is also the holder of a chair at PRAIRIE, an interdisciplinary institute for research in artificial intelligence.
Benoît Sagot feels that the need to explain this technology to people is now more pressing than ever, at a time when there is “so much noise, much of it toxic” around generative AI agents, as well as “a lot of scaremongering”.As for the public release of ChatGPT, a solution developed by the US organisation OpenAI, this “doesn't really constitute a scientific or technological revolution, but it has given ordinary people an opportunity to play with it, and a glimpse of the major changes it could bring about in various different fields. ChatGPT shows the significant progress made in NLP over the past 20 years or so, including for mass-market applications such as spell checking and machine translation.”
The more languages, the harder it gets for researchers
Research in NLP continues to move forward, as illustrated by taking a look back at Benoît Sagot's career to date: “When I started out back in 2002 with the now defunct project team ATOLL, I did a lot of work on formalised lexicons and grammars, as well as syntax analysis, the analysis of the grammatical structure of sentences”, explains the researcher, who gravitated towards this field seeking to combine his two loves: languages and computing. “I continued my research in NLP with the Alpage project team, the forerunner to ALMAnaCH, while expanding my research to computational linguistics, which involves studying linguistics from a quantitative and computational perspective.”
Work was carried out on a number of different languages. “It was important to analyse a diversity of languages and how they function in order to understand why some things may be applicable in certain languages but not others”, says Benoît Sagot. In addition to his work on English and French, the director of research also worked to varying extents on a range of other languages. Around a decade or so back he co-supervised a PhD on “the segmentation of Mandarin”, a language which is difficult to process using computer tools: “There are no spaces between words, meaning you need to find another way of identifying them for the purposes of analysis and processing.” He also co-founded the startup Opensquare, where he developed systems for analysing surveys carried out among employees of major international companies whose staff speak dozens of different languages.
Machine language learning - a booming sector
In tackling these challenges, the researchers within ALMAnaCH are able to count on increasing processing capacities, drawing on machine learning technology while contributing to its development. “Major progress has been made in natural language processing (a sub-domain of artificial intelligence) in recent years thanks to the generalisation of neural networks”, says Benoît Sagot. The purpose of these networks is to teach computers how to analyse and process data in a way that is inspired - albeit remotely - by the workings of the human brain. Neural networks are among the methods used for both supervised learning (using annotated examples) and unsupervised learning (using raw data), thanks chiefly to deep learning, which employs the use of large neural networks.
Educating the public and continuing to innovate
A renowned expert in the field, Benoît Sagot is delighted to now have the chance to present these breakthroughs at the Collège de France.“It is a real honour to have been given this opportunity.This is a social issue with significant implications. My goal is to give as many people as possible the keys to understanding it.”
The chair will run from 30 November 2023 to 9 February 2024, with a one-hour class each week. Catch-up videos will also be made available for each class on the Collège de France website. Each class will be followed by an hour-long lecture by a guest speaker.
The first class (on 30 November 2023 at 6pm), entitled “Teaching Languages to Machines”, will introduce natural language processing in its historical context while providing an overview of where the discipline is currently at. The programme for subsequent classes includes: a look at textual data and how it can be represented; followed by introductions to symbolic and probabilistic approaches, language models, contemporary approaches to neural networks, machine translation systems, the challenges raised by chatbots and current research in multimodality (combining text and speech or text and images).
Looking further ahead: making models more frugal
This chair will also provide the wider public with an opportunity to learn about those research topics judged by ALMAnaCH to be a priority. “One of the biggest challenges facing us over the months and years to come is frugality”, says Benoît Sagot. “Language models and chat models are very expensive. Ideally we wouldn't need as much processing resources or training data to produce them, particularly for languages where there is not much textual data available.”
Other challenges include robustness, which is linked to the capacity of applications to function with texts that are further removed from the levels of more common languages, and “alignment”, a term which refers to the capacity of generative AI systems to respect specific principles and values. Ambitious targets which provide Benoît Sagot and his team with plenty of motivation.
Verbatim
The aim of my classes at the Collège de France will be to introduce the wider public to the most important research currently being carried out in natural language processing. I believe that it's important to shine a spotlight on a subject that has got a lot of publicity over the past year thanks to the release of ChatGPT.
Auteur
Benoît Sagot
Poste
Head of the ALMAnaCH project team, and visiting professor at the Collège de France
Benoît Sagot’s brief bio
2000: graduated from the École Polytechnique.
2002-2006: doctoral student with the ATOLL project team (Atelier d’outils logiciels pour le langage naturel - Natural Language Software Tools Studio) at Inria Rocquencourt.
2007-2016: Inria research fellow with the Alpage project team (Analyse linguistique profonde à grande échelle - Deep linguistic analysis on a large-scale), before being made head of this team.
2017 to present: head of the ALMAnaCH project team (Automatic Language Modelling and Analysis & Computational Humanities).
2019 to present: holder of a chair at PRAIRIE (Interdisciplinary Research and Education in AI), an interdisciplinary institute.
L’équipe-projet ALMAnaCH
The ALMAnaCH (Automatic Language Modelling and Analysis & Computational Humanities) project-team is dedicated to automatic language processing (NLP), a key area of artificial intelligence and digital humanities, at the interface between theoretical computer science, machine learning and linguistics. Its research concerns the training, analysis and use of neural language models (the team produced the CamemBERT and CamemBERTa models, helped produce BLOOM and is working on the most recent models) as well as applications based on these models (including machine translation and conversational agents) and their interpretability, while continuing some earlier work based on symbolic and statistical approaches.
The team is also working on the development of linguistic resources (e.g. the OSCAR corpus, several tree corpora and parallel corpora, lexicons, historical corpora built using OCR and HTR applied to archival and other historical documents) and on the extraction and retrieval of information, in particular from scientific, medical and legal corpora as well as historical documents. One of the team's cross-cutting issues is that of linguistic variation, both in a historical sense and between contemporary states of language (development of robust NLP systems for noisy web content and dialectal varieties of language, for example).
List of Benoît Sagot's lectures and seminars (in French)
“Large-scale Language Models & Their Training Corpora” (video in English), lecture by Benoît Sagot at the Czech-French AI workshop organised by the Czech Ministry of Foreign Affairs and the French Embassy in Prague on 12 and 13/9/2022.