What is the best way of managing the billions of data saved in various different formats across thousands of servers? This might seem like an impossible challenge, but it is one that is central to the work carried out by Ioana Manolescu, director of the research team Cedar, a joint undertaking involving Inria and the École Polytechnique. Its applications could have a significant societal impact, including fact-checking in journalism.
Beginning with her first Master’s internship at Inria in 1997, Ioana Manolescu set about tackling specific problems involving databases, auditing the performance levels of software programs used to query them, - an experience that was to leave a lasting impression: “What I found particularly interesting was working on real software programs specifically designed for subjects with an impact on real life”, she recalls. “Most importantly, I realised that the volume of data could see processing times skyrocket, enough in some cases to make them prohibitive. I said to myself ‘What good is it having all this data if we’re not able to make use of it?’”
Inria, a choice made with the heart
More than twenty years later, this question is still the guiding principle for Ioana Manolescu’s research. After a second internship at Inria, followed by a PhD, the researcher joined the institute in 2002 - a choice she says she made with her heart: “I was really made to feel at home within Inria’s international environment, where I found it very easy to settle in, quickly gaining the trust of my supervisors”.
While Ioana was climbing the ladder, becoming a director of research in 2010, the volumes of databases did indeed skyrocket, with each database now containing billions of data. Any database can be distributed across thousands of servers through the magic of the Cloud. The data is recorded in multiple formats, including formatted text, semi-formatted documents, vectors, RDF graphs, digital values, and so on.
“Managing databases has become increasingly complex”, explains the researcher. “However, unlike doctors, who take patients’ lives into their hands, in digital science, we’re fortunate enough to be able to carry out experiments without there being any drastic consequences in the event of an error. What this means is that I’m able to take risks, and to explore avenues I find interesting. And quite often, these risks pay off.”
Data processing and representation, algebraic techniques used for queries, knowledge management, machine learning, etc.: within her research team Cedar, a joint undertaking involving Inria and the École Polytechnique, Ioana Manolescu has brought together a wide range of different skillsets. This puts the team in a position to tackle precise applications, the aim being to devise effective solutions.
An automatic tool for fact-checking in journalism
If there is one project that is particularly close to Manolescu’s heart, it is fact-checking in journalism, or the art of verifying information systematically: “Having grown up in Romania under Ceausescu, I’m very aware of how fortunate we are to live in a country like France with a free press. Unfortunately, that doesn’t stop bias from entering democratic debate, with emotion and impulse triumphing too often over facts and reason. There are a number of high-quality sources, such as the Insee figures, for example, but these are difficult to use because they are insufficiently indexed. That’s where we come in - to provide journalists with really effective tools.”
The result? In 2013, Ioana Manolescu’s team published one of the very first scientific papers on fact-checking. Two years later, the team launched the French National Research Agency project ContentCheck, working in collaboration with the team from the Les décodeurs column in Le Monde.
The aim?
To help journalists check facts more quickly using data available online. Using the tool developed by the PhD student Duc Cao, in collaboration with Xavier Tannier from Paris-Sorbonne University, they were able to modify the format and the design of the Insee databases, making them easier to use. Not to mention making checks in automatic mode that would take hours to perform manually.
The result?
A text simply has to be submitted to ContentCheck in order for it to identify all mentions of statistical data - e.g. “youth unemployment rose to 20% in 2017” - before then verifying it. In less than a second, the tool provides the exact figure or, failing that, the table or the study in which the relevant information can be found.
Automatically cross-referencing sources of information
Still in the world of the press, Ioana Manolescu is also interested in “data journalism”, i.e. the process of exploring public databases and viewing the data contained within them. In the spring of 2019, the scientist gave a demonstration of ConnectionLens, a tool capable of clustering and context-matching several databases at once, to the Minister for Defence Florence Parly: “By incorporating the list of current deputies, extracts from the Official Records of students entering the École Polytechnique over the past ten or so years and Areva’s organisation chart, ConnectionLens was able to demonstrate that a deputy from the LREM party had graduated in the same class as the current CEO of Areva”, explains Ioana Manolescu. Journalists can now access this sort of useful information in just a few clicks, instead of having to carry out a long, painstaking search.
Another example of how Cedar’s research has been applied comes in the search for anomalies in “time series”, i.e. recordings of lengths of time. For example, if you record and store the lengths of time taken for a server to complete successive tasks, the system will be able to distinguish between “normal” fluctuations in length, caused by a spike in the workload, for example, and fluctuations caused by a fault or a breakdown.
Lastly, Ioana Manolescu’s work has also involved developing technology for interactive data exploration, designed to help users find what they’re looking for in enormous databases. Anyone looking for a flat, for example, will know how long they spend on websites scrolling through listings to find properties matching their criteria, whether essential (neighbourhood, surface area, price) or preferred (which floor, whether it has a balcony, etc.). Who knows - perhaps in the future technology will allow us to do this in just a few seconds, thanks to the work carried out within Cedar by the team headed up by Yanlei Diao, a professor at the École Polytechnique.
- For more information on the Cedar team
- For more information on ContentCheck
- A demo of ConnectionLens
Getting public databases ready for AI
Ioana Manolescu was also made the scientific director of the “AI Lab” in early 2019, a public initiative for preparing for the deployment of artificial intelligence by the authorities. In this role, the researcher will be responsible for selecting and overseeing projects, still from the perspective of database operations. This will include: data homogenisation, refining indexing, optimising processing times, whether this is at the IGN (Institut Géographique National), at the French Supreme Court, at the Shom (Service Hydrographique National) or at the DGCCRF (Direction générale de la concurrence, de la consommation et de la répression des fraudes - The General Directorate for Competition, Consumption and Fraud Prevention).