Mammals: Memory-Augmented Models for low-latency Machine Learning Services
Date:
Changed on 15/04/2021
A machine learning (ML) model is often trained for inference purposes, i.e. to classify specific inputs (e.g. images) or to predict numerical values (e.g. the future position of a vehicle). The ubiquitous deployment of ML in time-critical applications and unpredictable environments poses fundamental challenges to ML inference. Big cloud providers, such as Amazon, Microsoft, and Google, offer their “machine learning as a service” solutions, but running the models in the cloud may fail to meet the tight delay constraints (≤10 ms) of future 5G services, e.g., for connected and autonomous cars, industrial robotics, mobile gaming, augmented and virtual reality. Such requirements can only be met by running ML inference directly at the edge of the network—directly on users’ devices or at nearby servers—without the computing and storage capabilities of the cloud. Privacy and data ownership also call for inference at the edge.
Mammals investigates new approaches to run inference under tight delay constraints and with limited resources. In particular, it aims to provide low-latency inferences by running—close to the end user—simple machine ML models that can also take advantage of a (small) local datastore of examples. The focus is on algorithms to learn online what to store locally to improve inference quality and adapt to the specific context.
The current approach to run inference at the edge is to take large ML models (often neural networks) and generate smaller ones through compression or distillation. Mammals explores a different direction: take advantage of data availability at the edge (where data is usually generated) to compensate for tighter computing constraints. In particular, Mammals aims to combine the decisions of a small ML model, e.g., a compressed neural network, with those of an instance-based algorithm that retrieves from a local datastore examples similar to the current input.
In some sense, we can say that the simple ML model provides the general rule, while the instance-based algorithm retrieves the relevant exceptions from the datastore.
This approach appears very promising because:
Mammals starts from a real-world problem, but it focuses on developing general methodologies to solve it. Hopefully, it will also lead to deepen our understanding of the relation between memorization (the local datastore memorizes previously observed patterns) and generalization (the capability to extract general inference rules), that is still wanting.
At the moment, Mammals is based on a number of pairwise collaborations both with academic partners (Università degli Studi di Torino, Politecnico di Torino, University of Massachusetts - Amherst, Northeastern University, Università degli Studi di Verona) and industrial ones (Nokia Bell Labs).