The boom in data science
Machine learning techniques have seen rapid development in recent years and are being put to use in all sectors of the economy. From banking to telecommunications, retail, healthcare, industry, leisure, defence etc., all sectors are seeking ways to take advantage of available data to enrich their service offer, gain in competitiveness, improve production processes (e.g. by reducing their carbon footprint) or optimise use of resources (such as energy or materials in a context of rising costs).
Modal is one of INRIA’s project teams at the cutting edge of machine learning, which consists of using the abilities of digital tools (computers and algorithms) to analyse complex data and render the information it contains intelligible; a task exceeding human understanding, given the sheer amount and diversity of such data!
Graph-based data
Among other tasks, the team focuses on unsupervised learning, a technique which Christophe Biernacki outlines in comparison to the supervised context. ‘Unsupervised learning (also known as “clustering”) uses algorithms enabling an automatic structuring of data to discover new knowledge, and providing a synthetic view in the form of categories (or clusters). Clustering thus acts as a natural extension of supervised learning, where categories are a priori known, based on previously existing human expertise and so potentially more limited.’
Algorithms use the data coming from sensor networks, connected objects, internet use etc. The specificity of this data is that it is commonly represented in the form of graphs, i.e. ‘nodes’ which are interconnected by ‘links’. Available in increasingly sizable amounts (such as the data from social networks), these graphs become even more complex when we consider that the nodes or links can be described by additional and varied forms of information: the content of a message can be considered as a link, for example. Processing complex data with unsupervised learning is one of Modal’s specialities. The team’s researchers develop usable methods which are as ‘agnostic’ as possible (i.e. capable of operating with any type of machine or computer system), while guaranteeing their performance via the use of probabilistic tools.
The advantages of graph learning
One of the most promising methods explored by the team is machine learning on graphs, a new theme introduced by researcher Hemant Tyagi on joining Modal. He explains the principle: ‘For numerous problems encountered by science or engineering, for example, we have access to data in the form of relations between paired objects within a set. These relations can be easily represented by a graph, in which the nodes correspond to objects and the links represent the pairs of objects for which data is available. The aim of unsupervised graph learning is to discover the underlying categories contained in this structure. In other words, discovering which objects communicate in priority and the type of information exchanged.’
More specifically, Christophe Biernacki and Hemant Tyagi explore complex graphs (multi-dimensional and sequential), a typical example being social networks: users are connected between each other (a network can thus be represented by a graph), they can be members of several networks simultaneously (so the graphs have multiple dimensions), and they publish content continuously (these multi-dimensional graphs change over time). For the world of research, this additional complexity opens a vast new playground!
Probabilities to model computer networks
While this research draws primarily on mathematics and probabilities, which enables rigorous study of the characteristics and effectiveness of learning, it can very naturally be transferred to practical applications. Both researchers are thus involved in collaborations with the industrial sector, in particular through CIFRE theses (Industrial Agreement for Training through Research). ‘For instance, we’re working with a software publisher on a project where we apply sequential graph learning to the detection of cyber-attacks’, Hemant Tyagi tells us. ‘We use probabilistic models to characterise the ‘normal’ operations in computer networks.’
By analysing computer data with the help of these models, our graph-learning algorithms can recognise an ‘abnormal’ operating mode possibly due to a cyber-threat which may have gone unnoticed up to now.’
Innovative applications in logistics
Another scope of application for this research, logistics, involves major distribution and sales networks. How can we optimise supply to avoid stock shortage or surplus, while offering customers a wide-ranging catalogue? What could graph learning bring to sales forecasting, which is by nature a challenging task?
‘To optimise their supply strategy, retail brands could draw on the communication vessels represented by potential sale shifts from an unavailable product to a more-or-less-similar available one. However, the data of this ‘substitutability graph’ is not taken into account in practice because brands only focus on end-sale products and their potential stock shortage’, Christophe Biernacki points out.
The aim of the thesis is thus to evaluate those ‘substitutability probabilities’ between products, which corresponds to an estimation of specific links between products. This valuable information for the retailer can only be revealed through graph learning!
Machine learning of the future will be frugal and democratic
The success of both projects attests to the research potential, and our two researchers have no intention of stopping there. Their next challenge is frugality - or how to develop algorithms with the same level of performance using less data. Machine learning requires vast amounts of data to produce the most accurate predictions possible, and also consumes a great deal of computing (memory, calculation capacity) and energy resources. The aim is thus to develop algorithms offering the same guarantees with fewer resources. In addition to its scientific and environmental aspect, this is also a societal challenge.
‘Thanks to their substantial computing resources and extensive databases, digital giants (or countries) are obviously leading learning technologies, but everyone should be able to benefit from these innovations. By working on more frugal methods, we hope to provide smaller players with the same opportunity to develop their activities’, Christophe Biernacki concludes.
- Intelligence des données par Christophe Biernacki [Data intelligence by Christophe Biernacki] (video), Lille University, 10/3/2017.
- Seminar@SystemX on ‘Frugal Gaussian clustering of huge imbalanced datasets (video), presented by Christophe Biernacki, IRT SystemX, 22/12/2022.
- Data science and industrial performance, INRIA, 7/5/2021.
- Apprentissage supervisé et non supervisé: les différencier et les combiner [Differentiating and combining supervised and unsupervised learning], LeMagIT, 14/10/2020.