MissingBigData: getting mathematics and computer science to work together for a better solution to the missing data problem

Date:
Publish on 22/01/2020
Julie Josse, professor of statistics at the École polytechnique and researcher at the French Centre for Applied Mathematics (CNAM), and Gaël Varoquaux, Parietal team researcher at the Inria Saclay - Île-de-France centre, have decided to combine their expertise in order to tackle, together, the problems of missing data and to propose new decision support methods. The MissingBigData project has been selected by the DATAIA Institute as part of its first call for research projects. How did this collaboration come about? What are the challenges of their interdisciplinary research? Julie and Gaël present MissingBigData .

Two subjects - but the same problem

Julie Josse works with the Traumabase group, which collects the data of over 15,000 patients admitted for severe trauma, from their admission to hospital until they leave intensive care. Severe trauma is the main cause of death of the young patients and a significant cause of serious disability. There is a major socio-economic impact. The treatment of these patients is therefore a real public health issue. The aim of Julie's research is to analyse the data collected by Traumabase in order to provide decision support tools to emergency doctors , to anticipate - for example - haemorrhagic shock as soon as the patient is treated by the paramedics so that an appropriate medical team is waiting for them at the hospital. However, Julie is faced with a missing data problem: “using the data, I look to see if I can create models in order to correctly anticipate haemorrhagic shock. Except that my data come for numerous different sources, from several hospitals, which do not necessarily have the same practices.”

For his part, Gaël Varoquaux is working on medical imaging and, in particular, its use in epidemiology . In this context, Gaël analyses large volumes of data of different types (medical imaging, state of health, quality of life of the person, etc.) whose quality is not uniform. In particular, he uses the data collected by UK Biobank, which follows the health and well-being of 500,000 volunteers with the aim of improving the prevention, diagnosis and treatment of a wide range of serious - and potentially fatal - illnesses. Gaël notably focuses on neuropsychiatry and the risk factors of a mental illness (schizophrenia, autism, depression, etc.). There too is the problem of missing data slowing down the development of reliable predictive models.

How can we address these causal problems when we are missing data?

Gaël explains: “If we compare people who die in hospital and those who do not die in hospital, we can conclude that hospital is a very dangerous place as many people die there. We are well aware that this is an error. This selection bias needs to be compensated mathematically. The problem is that we are no longer able to do that when there is missing data, in particular informative data.” The aim of the MissingBigData project is to approach the problem from a different angle and to propose new, more powerful models using bigger data samples to impute missing values . “In order to avoid skewing conclusions, we will study multiple imputation and conditions on data dependency. Our project aims to reduce risk factors with regard to healthcare, in particular with the prediction of better results and the identification of the risk factors of undesirable results. We are seeking an operational solution, from the methodology to the implementation, which integrates the diversity and volume of the data [...] by considering several types of missing data” (extract from the MissingBigData project).

Applications not only in the field of healthcare

The objective of these two researchers is to produce a generic model and methods that are applicable in fields other than healthcare . “In order to add value to our work, we will carry out software development made available to the community. Our research problem is motivated by the application - for educational purposes - that everyone will be able to replicate,” Gaël underlines.

Complementary expertise

The interdisciplinarity of this team will enable a thesis student funded by the DATAIA institute to share two team cultures, to do presentations before different publics, to communicate with people who speak different languages: the mathematicians from the École polytechnique and the computer scientists in machine learning at Inria. “The communities have difficulties understanding each other, even though we have the same problems and complementary tools,” Julie remarks. This call for projects will enable these communities to move forward with a common goal: the reusability and the transfer of good practices in order to carry out participative science . To assist Julie and Gaël, the MissingBigData team will consist of Nicolas Prost, PhD student, an engineer who is currently being recruited, Erwan Scornet, lecturer in the mathematics department at the École polytechnique and head of the AI master's programme, Alexandre Gramfort, researcher at the Inria - Saclay-Île-de-France centre and Balázs Kégl, researcher at the French National Centre for Scientific Research (CNRS) and head of the Paris-Saclay Centre for Data Science.