“The machines for sequencing DNA have made great progresses of late. But a problem remains: they can't write a whole sequence in one piece. The human genome, for instance, is a long sentence consisting of a whopping three billion ATGC characters. Sequencers for their part only churn out fragments made of 150 to 200 characters at the most. Piecing together the jigsaw puzzle therefore requires efficient DNA assembly software. The purpose of the GenScale research team is precisely to build such innovative tools, ” sums up Inria scientist Pierre Peterlongo .
The task becomes all the more challenging when one moves to the realm of metagenomics. In other words: when the sequenced sample is made not of one but of many individuals, and when, furthermore, those individuals happen to belong to a variety of species. Such is the case of the 35,000 seawater samples collected from 210 stations in every major oceanic region by French schooner Tara between 2009 and 2013. The expedition was part of a research undertaking meant to acquire a better knowledge of the planktonic biomass and fathom the impact that climate change could exert on the marine ecosystem.
Not an easy enterprise. “One liter of seawater can comprise up to 1,000 sorts of animals, 100,000 sorts of protists, 10 million sorts of bacteria and 100 million sorts of viruses.” Not to mention the number of individuals in each class. “So you end up comparing bits of shrimps and pieces of medusas, hence running the risk of creating chimeras, for metagenomic assembly is a problem far from being resolved, indeed. ”
Started in the wake of the Tara expedition and recently completed, the HydroGen project aimed at introducing novel computational approaches in the field. Funded by the French research agency, it gathered Inria's GenScale team, the French Institute for Agricultural Research (INRA) and the French Sequencing Center (CEA, Genoscope). “When we started working with the Genoscope, we found ourselves faced with data which we could hardly map to any previously referenced genome. In other words, 90% of the living beings in those samples were completely unknown. So our colleagues asked us: have you any idea how we could make sense of the data? "
Computing Genomic Distance
“That's how we eventually came up with the idea of comparing various seawater samples with the aim of computing how different they were. Say sample A and sample B feature somewhat similar DNA, whereas sample C is markedly dissimilar. So we created a new metric for measuring such genomic distance which, by the way, doesn't pretend to reflect an evolutionary distance whatsoever. Keep in mind that we don't even know what the individuals are at this juncture. ”
The scientists first created a tool called Compareads. “Then it was completely redesigned from an algorithmic point of view and the second version came to be known as Simka. ” In essence, it produces a matrix of distances between all samples. “And for doing that efficiently, we reduce the DNA sequences to a set of words of about 30 characters. These words are called k-mers, k being the word's length. When some k-mer are found simultaneously in two distinct sequences, it reflects a similarity between those sequences. ”
Having said that, “this metric doesn't mean anything in and of itself. It must be coupled with physicochemical data such as water temperature, acidity, pressure, etc. We thought it would be also interesting to correlate our genomic distance matrix with the time it takes for the biomass to travel from one collection station to another. So we recreated genocenoses, which are genomic bioregions if you will. And interestingly enough, we were able to associate these genocenoses to specific environments such as the upwelling zones. Those are the spots on the planet where a deeper water comes to the surface. ” What biologists and oceanographers plan to do next is to cross reference this genomic distribution of a region with the anticipated climate change in this particular place. “This will help predict how specific species will decline, migrate or mutate. ”
Discovering SNPs
Mutation is the other focus of interest of the hydrogen project. “A mutation translates into a little variations of a pair base. Such variants are called single-nucleotide polymorphisms (SNPs). In the medical context, they are of great interest as some can be associated to genetic diseases. In order to find variants, sequences are compared to a previously mapped genome. But in metagenomics, we don't have this structure of reference. Nonetheless, we would very much like to extract these variants. It could help, for instance, establish that some plankton featuring such or such SNP is fit to survive in warmer waters. ”
Hence the need for another tool capable of finding variants in metagenomic material for which no reference genome is available. “That's the purpose of DiscoSnp, the second software involved in this research. It fits pretty well with the Tara project as it can process simultaneously the data from hundreds of seawater samples. It takes tens of billions of reads in input and extract all the variants they contain. To my knowledge, there is not any other such tool in the world at the present time. ”