Every now and then in the real life of processors, things happen that may result into transient errors. Cosmic radiations would be a classical example. Coming from the outer space, a particle could hit the electronic circuitry and occasionally provoke a so-called ‘bit flip’. In other words, a bit of a binary code that is supposed to be a 0, all of a sudden and very unduly becomes a 1, thus potentially snowballing into a malfunctioning further down the line.
Verbatim
With Artificial Intelligence (AI) now performing safety-critical tasks in embedded systems, such a hardware error could have dire consequences. For instance, in an autonomous car, the driving AI could conceivably mistake a pedestrian crossing the street for a just bird.
Associate Professor (University of Rennes) - TARAN team
In order to ensure hardware reliability despite bit flips and other transient errors, the classical method used in critical fields such as avionics or satellites turns to redundancy by triplication. A calculation is performed not just once but three times so that if one output value happens to differ from the other two, it is just voted out as an error. “Yet, in the case of AI, this method no longer applies due to the sheer number of calculations involved. It doesn’t scale. One cannot triplicate every single calculation. It is just too costly. In embedded systems, energy frugality is paramount. So we need a new approach. And that’s the purpose of Re-Trusting.”
Funded by the French Research Agency (ANR), the project is coordinated by the Institute of Nanotechnology of Lyon (INL - mixed research unit CNRS, ECL, INSA Lyon, Université Lyon 1 et CPE Lyon). In addition to Inria, it also includes the Sorbonne University Computer Laboratory (LIP6-CNRS) et Thales.
Rethinking Hardware Reliability for Trustworthy AI (french)
The Chip Industry has Started Rolling Out a Range of Custom Hardware Accelerators
Re-Trusting comes in a context in which a growing number of AI applications that used to run on the on-ground servers of the cloud now migrate into edge device. including mobile phones and the Internet of Things, in order to alleviate communication latency among other reasons. Echoing this trend, the chip industry has started rolling out a range of custom hardware accelerators meant to support the computational needs of these embedded resource-hungry deep learning algorithms.
The project precisely focuses on such accelerators. “We have two of them, says Inria scientist Marcello Traiola. One provided by LIP6. The other one by Thales. Our case study also comprises two different types of deep learning algorithms: a Deep Neural Network (DNN) on the one hand and a Spiking Neural Network (SNN) on the other. In essence, they do the same thing but in a slightly different way.
Verbatim
Our primary goal is to come up with a methodology as generic as possible for addressing the problem of how to best assess the reliability of the hardware+AI system.
Inria research associate - TARAN team
A key concept here is fault severity. In the context of huge neural networks, not every hardware fault will be malignant. Many will actually prove benign. What researchers have in mind is a model analysis of the hardware+software system that could grade fault severity. “Then usual fault-tolerance mechanisms such as triplication could be applied selectively where they are truly needed, thus keeping the extra cost minimal.”
Specialized hardware gas pedals (french)
Analysis Is Challenging Due to the Huge Exploration Space
However, such an analysis is a challenge in and of itself. “We are faced with a huge exploration space. It would be hard to go through in its totality. But as huge as it might be, the system comprises different parts that are not all equally important. So we must find smart and systematic approaches capable of identifying the most important areas of this space. And then explore locally these points as much as possible.” In practical terms, for the scope of this project, experiments of fault injection are performed through software simulation.
“Getting a full understanding of the fault propagation actually calls for a cross-layer analysis, Kritikakou points out. It’s a bottom-up approach. LIP6 will focus on the lowest level: transistors, logical gates and so on. Once all the components are characterized, Inria will pick up from there and study the impact of faults at algorithmic level. In particular, we need to come up with a whole set of metrics that simply do not exist at the present time. Leveraging these research findings, INL will then work on devising a series of fault-tolerance mechanisms capable of protecting the hardware+AI system.”
Ultimately, these fault-tolerance techniques will be integrated into both accelerators by Thales, the partner in charge of the benchmarking work package. At the end of the project, by the fall of 2025, the set goal is to achieve a 100% fault coverage with no more than a 10% energy increase and no more than 10% of the hardware resource being allocated to protection tasks.