Natural language processing in focus at the Collège de France

Artificial intelligence: from fantasy to reality

Benoît Sagot, who was recently appointed to the prestigious “IT and Digital Science” chair at the Collège de France for the 2023-2024 academic year, has his sights set on helping people to understand artificial intelligence (AI) systems, which are used among other things to automatically generate text from written instructions. Sagot is head of ALMAnaCH, a project team at the Inria Paris Centre which specialises in natural language processing and digital humanities. He is also the holder of a chair at PRAIRIE, an interdisciplinary institute for research in artificial intelligence.

Infographie expliquant les fonctionnalités du traitement automatique des langues. — Illustration picture (Gino Crescoli/Pixabay)

Benoît Sagot feels that the need to explain this technology to people is now more pressing than ever, at a time when there is “so much noise, much of it toxic” around generative AI agents, as well as “a lot of scaremongering”. As for the public release of ChatGPT, a solution developed by the US organisation OpenAI, this “doesn't really constitute a scientific or technological revolution, but it has given ordinary people an opportunity to play with it, and a glimpse of the major changes it could bring about in various different fields. ChatGPT shows the significant progress made in NLP over the past 20 years or so, including for mass-market applications such as spell checking and machine translation.”

The more languages, the harder it gets for researchers

Research in NLP continues to move forward, as illustrated by taking a look back at Benoît Sagot's career to date: “When I started out back in 2002 with the now defunct project team ATOLL, I did a lot of work on formalised lexicons and grammars, as well as syntax analysis, the analysis of the grammatical structure of sentences”, explains the researcher, who gravitated towards this field seeking to combine his two loves: languages and computing. “I continued my research in NLP with the Alpage project team, the forerunner to ALMAnaCH, while expanding my research to computational linguistics, which involves studying linguistics from a quantitative and computational perspective.”

Work was carried out on a number of different languages. “It was important to analyse a diversity of languages and how they function in order to understand why some things may be applicable in certain languages but not others”, says Benoît Sagot. In addition to his work on English and French, the director of research also worked to varying extents on a range of other languages. Around a decade or so back he co-supervised a PhD on “the segmentation of Mandarin”, a language which is difficult to process using computer tools: “There are no spaces between words, meaning you need to find another way of identifying them for the purposes of analysis and processing.” He also co-founded the startup Opensquare, where he developed systems for analysing surveys carried out among employees of major international companies whose staff speak dozens of different languages.

Image d'illustration d'un manuscrit rempli de caractères chinois. — Mandarin manuscript (image by Markus from Pixabay)

Machine language learning - a booming sector

In tackling these challenges, the researchers within ALMAnaCH are able to count on increasing processing capacities, drawing on machine learning technology while contributing to its development. “Major progress has been made in natural language processing (a sub-domain of artificial intelligence) in recent years thanks to the generalisation of neural networks”, says Benoît Sagot. The purpose of these networks is to teach computers how to analyse and process data in a way that is inspired - albeit remotely - by the workings of the human brain. Neural networks are among the methods used for both supervised learning (using annotated examples) and unsupervised learning (using raw data), thanks chiefly to deep learning, which employs the use of large neural networks.

Educating the public and continuing to innovate

A renowned expert in the field, Benoît Sagot is delighted to now have the chance to present these breakthroughs at the Collège de France. “It is a real honour to have been given this opportunity. This is a social issue with significant implications. My goal is to give as many people as possible the keys to understanding it.”

The chair will run from 30 November 2023 to 9 February 2024, with a one-hour class each week. Catch-up videos will also be made available for each class on the Collège de France website. Each class will be followed by an hour-long lecture by a guest speaker.

Image d'illustration : cour du Collège de France. — François Champollion courtyard at the Collège de France (credits: Patrick Imbert/Collège de France)

The first class (on 30 November 2023 at 6pm), entitled “Teaching Languages to Machines”, will introduce natural language processing in its historical context while providing an overview of where the discipline is currently at. The programme for subsequent classes includes: a look at textual data and how it can be represented; followed by introductions to symbolic and probabilistic approaches, language models, contemporary approaches to neural networks, machine translation systems, the challenges raised by chatbots and current research in multimodality (combining text and speech or text and images).

Looking further ahead: making models more frugal

This chair will also provide the wider public with an opportunity to learn about those research topics judged by ALMAnaCH to be a priority. “One of the biggest challenges facing us over the months and years to come is frugality”, says Benoît Sagot. “Language models and chat models are very expensive. Ideally we wouldn't need as much processing resources or training data to produce them, particularly for languages where there is not much textual data available.”

Other challenges include robustness, which is linked to the capacity of applications to function with texts that are further removed from the levels of more common languages, and “alignment”, a term which refers to the capacity of generative AI systems to respect specific principles and values. Ambitious targets which provide Benoît Sagot and his team with plenty of motivation.

Verbatim

The aim of my classes at the Collège de France will be to introduce the wider public to the most important research currently being carried out in natural language processing. I believe that it's important to shine a spotlight on a subject that has got a lot of publicity over the past year thanks to the release of ChatGPT.

Benoît Sagot

Head of the ALMAnaCH project team, and visiting professor at the Collège de France

Photo de l'amphithéâtre Marguerite de Navarre au Collège de France. — Marguerite de Navarre Amphitheatre at the Collège de France (credits : Patrick Imbert/Collège de France)

Benoît Sagot’s brief bio

Portrait de Benoît Sagot — Benoît Sagot (credits : Patrick Imbert/Collège de France)

2000: graduated from the École Polytechnique.

2002-2006: doctoral student with the ATOLL project team (Atelier d’outils logiciels pour le langage naturel - Natural Language Software Tools Studio) at Inria Rocquencourt.

2006: PhD on “Automated analysis of French: lexicons, formalisms and parsers”) Paris-Diderot University (Paris 7).

2007-2016: Inria research fellow with the Alpage project team (Analyse linguistique profonde à grande échelle - Deep linguistic analysis on a large-scale), before being made head of this team.

2017 to present: head of the ALMAnaCH project team (Automatic Language Modelling and Analysis & Computational Humanities).

2019 to present: holder of a chair at PRAIRIE (Interdisciplinary Research and Education in AI), an interdisciplinary institute.

L’équipe-projet ALMAnaCH

The ALMAnaCH (Automatic Language Modelling and Analysis & Computational Humanities) project-team is dedicated to automatic language processing (NLP), a key area of artificial intelligence and digital humanities, at the interface between theoretical computer science, machine learning and linguistics. Its research concerns the training, analysis and use of neural language models (the team produced the CamemBERT and CamemBERTa models, helped produce BLOOM and is working on the most recent models) as well as applications based on these models (including machine translation and conversational agents) and their interpretability, while continuing some earlier work based on symbolic and statistical approaches.

The team is also working on the development of linguistic resources (e.g. the OSCAR corpus, several tree corpora and parallel corpora, lexicons, historical corpora built using OCR and HTR applied to archival and other historical documents) and on the extraction and retrieval of information, in particular from scientific, medical and legal corpora as well as historical documents. One of the team's cross-cutting issues is that of linguistic variation, both in a historical sense and between contemporary states of language (development of robust NLP systems for noisy web content and dialectal varieties of language, for example).

List of Benoît Sagot's lectures and seminars (in French)

November 30, 2023
Benoit Sagot’s inaugural lecture: "Apprendre les langues aux machines"

December 8, 2023
Benoît Sagot’s first lecture: "Représenter les unités textuelles"
Seminar by Jean-Baptiste Camps: "Quelques exemples d'application du TAL aux humanités numériques"

December 15, 2023
Benoît Sagot’s lecture: "Approches symboliques et probabilistes"
Seminar by Guillaume Jacques: "Deux exemples d’usage des transducteurs en linguistique"

December 22, 2023
Benoît Sagot’s lecture: "Modèles de langue"
Seminar by Emmanuel Dupoux: "Apprendre un modèle de langue à partir de l’audio"

January 12, 2024
Benoît Sagot’s lecture: "Traduction automatique"
Seminar by François Yvon: "Traduction neuronale massivement multilingue"

January 19, 2024
Benoît Sagot’s lecture: "Approches neuronales pour quelques tâches applicatives"
Seminar by Claire Gardent: "Génération de texte à partir de connaissances"

January 26, 2024
Benoît Sagot’s lecture: "Linguistique computationnelle"
Seminar by Elena Cabrio : "Analyse automatique de l'argumentation dans les débats politiques »

February 2, 2024
Benoît Sagot’s lecture: "Converser avec la machine"
Seminar by Philippe Blache "Prédire c'est comprendre : un modèle neuro-cognitif du langage fondé sur la prédiction"

February 9, 2024
Benoît Sagot’s lecture: "Multimodalités : TAL et images, TAL et parole"
Seminar by Yann Lecun: "L'IA axée sur les objectifs : vers des machines capables d'apprendre, de raisonner et de planifier"

Find out more about the annual chair in “IT and Digital Science” (all in French)

Communiqué de presse du Collège de France « Apprendre les langues aux machines - Leçon inaugurale » (PDF).
En savoir plus sur la chaire annuelle du Collège de France « Informatique et sciences numériques ».
Entretien avec Benoît Sagot : "La frontière entre ingénierie et recherche se déplace vite".

Find out more about AI and automatic language processing

Benoit Sagot et Aaron Hertzmann parlent d'IA, conference at the Inria Paris Centre on 11/23/2023, Inria.
[AI and its challenges] “An introduction to deep learning, a crucial component of modern AI” (video) lecture given by Benoît Sagot at a conference organised by the Campus de l’Innovation pour les Lycées (part of the Collège de France) and by SciencesPo on 28/9/2023.
Ethics and chatbots (podcast), Interstices (in French), 4/9/2023.
New technology: Do accents need to be “erased” by artificial intelligence?, 20 Minutes (with The Conversation, in French), 18/1/2023.
“Large-scale Language Models & Their Training Corpora” (video in English), lecture by Benoît Sagot at the Czech-French AI workshop organised by the Czech Ministry of Foreign Affairs and the French Embassy in Prague on 12 and 13/9/2022.
BigScience has big ambitions for language models, CNRS Le Journal, 12/7/2022.
Limiting divergence in legal rulings thanks to artificial intelligence, Inria, 21/2/2022.