Flowing out of the digital cornucopia are billion images these days. Facebookers alone account for 3,500 uploads every single second. Not so long ago, when it came to searching, organizing or comparing pictures, users were left with one sole option: resorting to keywords. An experience a tad frustrating at times. But things have changed in the last decade with the advent of computer vision techniques precisely describing the visual content. A cornerstone of these approaches amounts to converting any given image into several thousand computed vectors, that is some mathematical descriptors.
The industry switftly grasped all the potential of this development. As early as 2009, TinEye.com proposed a search engine to leverage image recognition technology. Once submitted a picture, this bot goes ferreting out not only for copies of this image but also for any modified version. A tool of choice indeed for photographers chasing infringements on their copyright throughout the vastness of the Internet. Google’s goggle is another example of successful recognition technology on smartphones.
Ploughing Through Giant Collections
A member of Inria research center, in Rennes, France, Hervé Jégou is credited with innovative methods that have dramatically increased the performance of such visual query in large imagebanks, designing an engine able to search relevant images in 200ms in a collection of 110 millions images. But all this belongs to the realm of query by sample. Looming ahead is another and far more daunting challenge: ploughing through giant collections with view of automatically uncovering visual links between images —or between objects across images.
“There is currently no methodology for efficiently and accurately uncovering such visual links, ” Jégou posits. Why? “Because we are simply faced with an awesome scalability problem. ” While comparing a single image to a billion-sized collection is not anymore a big deal by the means of brute force standards, complete cross-matching with view of discovering all visual links between all images is another kettle of fish. “It is quadratic in the number of images and in the number of descriptors per image. Linking one million pictures currently requires some 7 hours. But translating this to one billion images means that existing approaches would take about... 7 millions hours! ” As an additional thorn, results prove satisfactory only for frequent visual patterns. State-of-the-art algorithms leave rare matches undetected.
Breaking this lock is precisely the point of a new 5-year research effort for which the scientist has just been allocated a €1.5M by the European Research Council. The scientific team soon to be assembled will tackle three specific issues. First of all, “radically new different image representations are needed to address the visual recognition tasks that we aim at solving, ” Jégou explains.
Finding Subsets of Vectors
The second issue deals with the goal of finding subsets of vectors that are likely to represent an identical object in different images. Current algorithmic solutions do not prove up to the task as they remain insufficiently robust or poorly scalable.
A third issue calls for new coding methods to represent and compare sets of vectors in very large collections, a context where memory and efficiency are key criterions. “True, there are some algorithms such as AltaVista's MinHash that prove suitable for comparing sets of entities lying in a discrete space —words for instance. But they do not allow the similarity-search inherent to real-valued set of vectors ” as the latter lie in a continuous space. Consequently, a significant part of the information is lost due to quantization.
If things play out according to expectations, five years from now, scientists will be able to illustrate the merits of these new approaches throught a couple of demonstrators featuring clickable visual links in images. “I am convinced that this research will also pave the way to new applications and better representations for the query-by-sample traditional setup ” thus impacting the whole processing chain of visual search.