Wednesday, October 10, 2012

What could our genomes actually tell about disease risk?
Despite the recent advances in whole-genome sequencing, two recent studies let us think that we are far from uncovering the genetic basis of common diseases risk. In fact, information relevant to complex diseases might hide within rare or even private genome variations, often too scarce to be studied statistically. We might thus have to change radically our way of thinking of genes-diseases associations to make a step forward and make the DNA talk.

Whereas a few, usually rare and severe “genetic disorders” can be traced to variations at one or two locations, or “loci”, in the DNA sequence, most common diseases are the result of complex interactions between protein-coding genes, non-coding DNA and environmental effects. These well-named “complex diseases” include cardiovascular, metabolic, neurologic and psychiatric conditions of great concern to health policies, such as early-onset stroke, myocardial infarction, diabetes, dyslipemia, Alzeihmer's, bipolar disorder or schizophrenia.

Some of these complex diseases have a high heritability, which means that a great part of individual differences in the probability to develop the disease can be explained by differences in genomes. For example, the heritability of early-onset myocardial infarction is about 60% [1]: genomes are more important than environment in explaining the differences in early-onset infarction between individuals. Thus a lot of work has been going into identifying the changes in DNA sequences involved in complex disease heritability. Especially, the development of new sequencing technologies has allowed for comparison of hundreds of individual sequences and their mapping to various symptoms, a method known as “genome-wide association studies”. Hundreds of disease-related genetic variations have been identified this way. However they explain only a very small fraction of the heritability: in the case of early-onset myocardial infarction, only 2.8% of the heritability has already been linked to particular genes [2].

To explain the low power of association studies to identify genetic variants contributing to complex diseases, it was hypothesized that most variation in disease predisposition were due to “high risk” variants, that have a strong negative impact on health, but remain rare in a population because they are counter-selected [3]. In consequence, we would only need to increase sample size and therefore our power to detect rare variants to better explain the genetic basis of common diseases. In that scope, two studies published in the July issue of Science have used large datasets (respectively 2 440  and 14 002 genomes) to investigate the potential role of rare variants, defined when one of the variants at one locus is present in less than 0.5% of the individuals sampled. The large sample sizes allowed for detection of lots of previously unknown variants, thus highlighting the limits of previous smaller-scale studies: 90% of rare variants, but only 5% of common variants, found in 202 drug-target genes were novel, and estimates of discovery rates showed that lots of new variants are still to discover (Fig. 1). 

Nelson et al., An Abundance of Rare Functional Variants in 202 Drug Target Genes Sequenced in 14,002 People, Science 337, 2012. 

Fig. 1 Number of variants discovered per kilobase of sequence with sample sizes increasing to 5000 people for multiple populations.

The studies also confirmed that variants with an potential impact on health remained rare: the proportion of non-synonymous variants, which result in an alteration of the protein synthesized, was higher in rare than in common variants (Fig. 2).

Nelson et al., An Abundance of Rare Functional Variants in 202 Drug Target Genes Sequenced in 14,002 People, Science 337, 2012. 
Fig. 2 Expected ratios of non-synonymous to synonymous variants in the absence of selection and observed ratios for rare to common alleles, from left to right. MAF (Minor Allele Frequency) is the frequency of the rarest version of a variant.

However, rare variants were found to be more numerous than previously thought: around 90% of variants were rare. Interestingly, individuals of African ancestry exhibited less rare variants, but more variants of intermediate frequency than those of European ancestry. Moreover, most rare variants were population-specific (Fig. 3 and 4) and about 60% of all variants were only present in one individual. 

Casals and Bertranpetit, Human Genetic Variation, Shared and Private,  Science 337, 2012. Data from Tennessen et al., Science 2012.
Fig. 3 Proportion of shared and unshared (private) variants between the African-American and the European-American populations.

Nelson et al., An Abundance of Rare Functional Variants in 202 Drug Target Genes Sequenced in 14,002 People, Science 2012. 
Fig. 4 Allele sharing and variant abundance. (A-C) The average allele sharing between pairs of populations for rare (A), intermediate (B) and common (C) variants computed as the frequency in the pooled population pair. (D) The number of variants per kilobase found in population samples of 2,500 individuals.

Such figures are in contradiction with the current estimates derived from recent population growth. In fact, human demography is currently described by the “Out-of-Africa” model, that posits an emergence of European and Asian populations from a small population in Africa about 60,000 years ago [4]. As the individuals that migrated represented only a small fraction of the ancestral African population, some genetic variants, especially the rarest one, were lost. Then African and European populations were supposed to increase regularly, all the while acquiring new population-specific variants by mutation that would increase in frequency only if they are not deleterious, or else eventually disappear. Such “bottleneck effect” can be observed in the Finns that have less variants, but more population specific variants than other Europeans (Fig. 4). But the overall excess in rare variants in Europeans does not fit the model: most of these variants should have either disappeared or increased in frequency over such a time-scale. Such pattern can however be explained by accelerated population growth in the last thousands year, during which lots of new mutations could occur in a short time (Fig. 5). Therefore, rare variants provide a precious insight into recent demography.

Tennessen et al., Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes, Science 2012.
Fig. 5 Schematic representation (not to scale) of the inferred demographic model. kya, thousand years ago.

These findings are not very good news for complex disease research. Of course, the rare variants discovered in protein-coding genes are numerous and often deleterious, and could therefore play an important role in disease risk. However they rarity makes it difficult to actually test that role: over thousands of genomes, less than 5% protein-coding genes afforded a sufficient power to detect the effect of rare variations on disease risk, even when that effect is relatively strong. Consistently, no significant association was found for 202 drug-target genes. Moreover, most variants are population- or even individual-specific. Thus association studies should be at least replicated across populations, with a careful determination of ancestry, to be universal and avoid false associations between population-specific traits and variants.

The high number of individual-specific variants and the low power of association for other rare variants highlight the importance of genome-wide functional studies to accurately estimate disease risk where association studies fail. Functional predictions might be the next prevailing tool in the study of genes and disease association. Ideally, such studies would directly estimate the functional impact of a given variant, but the methods currently implemented are rather inconsistent and have a high false-positive rate, that is they often detect a functional impact where there is none. Such caveats make them still unsuitable for applied uses in medical diagnosis. A better knowledge of molecular biology and its link to physiology seems still necessary to assess the actual impact of rare variants on complex diseases.

Wednesday, October 3, 2012

Evolutionary consequences of sex: It's not about what you're doing, but who you're doing it with...
Bacteria are one of the most ubiquitous living group and exhibit finely tuned adaptations to a wide range of habitats, even the most inhospitable ones. Their ability to evolve rapidly is at the roots of many public health issues, such as the development of resistances to antibiotics or the rapid evolution of seasonal diseases, but can also be of great help to humans by creating new metabolic pathways to transform human-made pollutants and harmful substances. In the early 20th century, new bacterial genomes were still thought to be the result of mutations only, and to be then transmitted vertically within a clonal strain. In the 40’s, the discovery of bacterial DNA recombination through transformation (Avery, MacLeod and McCarty experiment in 1944) or conjugation (Lederberg and Tatum experiment in 1946) shed light on the processes responsible for the rapid ecological differentiation of bacterial strains: an individual can acquire new genes or alleles through recombination that allow it to stand new ecological conditions.

In Eucaryotes, genetic exchange and recombination through sexual reproduction is considered the basis of gene-specific transmission and selection among a population. However, the importance of genetic exchange between bacteria in uncoupling selection processes between different genes remains a controversial issue. In fact, contradictory observations have elicited two models of selection:
  1. On one hand, the ecological clustering of bacterial biodiversity in genetically consistent ecotypes support the traditional view that adaptive mutations are selected through whole-genome clonal selection. Moreover the low measured levels of recombination are insufficient to unlink a gene from the rest of the genome.
  2. On the other hand, the existence of environment specific genes and alleles suggests that recombination can unlink parts of the genome. Moreover, some loci exhibit low nucleotidic diversities compared to the rest of the genome, with suggest purifying selection on these regions. Thus adaptive mutations seems to be selected quite independently of the rest of the genome.
To disentangle those apparently incompatible observations and assess the degree of gene uncoupling in bacteria, researchers of the MIT examined in a recently published study[i] the genomes of 20 strains representing two ecotypes in the marine species Vibrio cyclitrophicus. As the genomes of these ecotypes are extremely similar, they can be considered the result of recent ecological differentiation, thus giving us a snapshot of this evolutionary process. Based on the comparison of these sequences, the authors claim that gene-specific sweeps do occur and can lead to environment specific-genes on a short time scale, but also to ecological clustering, through preferential within-habitat recombination, on a longer time scale.

In fact, different parts of the genome have different evolutionary histories. Especially, ecotype-specific SNPs are only found on a few locations in the genome, whereas the rest of the polymorphic genome supports a genetic intermingling between the ecotypes. Moreover, the two chromosomes of V. cyclitrophicus support different phylogenies, with chromosome 1 grouping the ecotypes, whereas chromosome 2 splits one ecotype into two groups. The phylogeny within one of these two groups is strongly supported by chr2 but not by chr1. Thus, habitat-specific genes are evolving quite independently and do not drive genomewide selective sweeps, an observation consistent with the environment specific genes and alleles that have already been documented[ii].

These results highlight the need for high quality sequencing data and fine grained analysis to understand the evolutionary histories of different parts of the genome. In fact, the authors show that a few loci with consistent phylogeny, such as the ecotype-specific SNPs here, are sufficient to drive the whole-genome phylogeny, if the signal of clonal ancestry in the rest of the genome has been blurred by homologous recombination (Fig. 1). Therefore, the ecotype theory might be based on phylogenies biased toward the history of a few loci under purifying habitat-driven selection rather than on neutral loci with inconsistent histories accounting for most of the genome.

Fig. 1: A. Maximum-likelihood phylogeny for the core genome (genes presents in all strains) of chromosome I in V. cyclitrophicus. Scale is substitution per site. All nodes have a 100% bootstrap support unless indicated. B. Genome regions with uninterrupted support for (black points) or against (grey points) the ecological split. ML trees for three major regions are shown. Adapted from Shapiro et al. 2012.

The most important point made by the authors remains however their evidences for preferential within-ecotype recombination. When examining recombination events affecting recently diverged pairs of strains, recombination rates were found to be higher within than between habitats. The authors make here an essential point toward a unified theory of bacterial genomes evolution. Such preferential recombination indeed provides an explanation for the development of ecotypes, usually considered an evidence of genomewide genetic sweeps, from gene-specific sweeps. Even if the mechanisms involved in genes transmission are quite different between Eubacteria and Eucaryotes, they seem to converge in allowing gene-specific selective sweeps and in restricting genetic exchange between habitat. This study show a more universal picture of selective pressures on evolutionary mechanisms than previously thought between Eucaryotes and Eubacteria (Fig. 2).

Fig. 2: Model of ecological differentiation between bacterial ecotypes (from Shapiro et al. 2012). Thin grey (resp. black) arrows represent recombination within (resp. between) ecologically associated populations. Thick coloured arrows represent acquisition of adaptive alleles for red or green habitat.

Eventually, this new insight into bacterial evolution plead for a new structure of bacterial diversity. Whether and how a species-level could be defined in Eubacteria has been a long standing controversy. The current species criterion dates back to 1987: it defines a species as a group of clonal strains characterized by at least one phenotypic trait and 70% DNA–DNA hybridization[iii]. However this definition was often criticized for grouping within one species extremely diverse phenotypes[iv]. Moreover, the very idea of bacterial species was sometimes rejected on the basis of common genetic exchanges between morphologically and ecologically distinct bacteria[v]. That last assertion is here contradicted by demonstrating the existence of barriers to gene flow between habitats. If recombination, that is “bacterial sex”, is more frequent within than between habitats, ecotypes are quite close to the definition of a species in Eucaryotes, an “ecotype-hypothesis” already put forward about ten years ago[vi]. Therefore this study provides a first step toward the “solid understanding of the genetic basis of the ecological distinctiveness of the ecotype” advocated by Konstantinidis et al. in 2006[iv], although it does not exclude the possibility that some ecotypes be defined by gene expression rather than gene content.

New paths for research are here opened, especially to define the barriers to gene flow that could explain habitat-specific recombination. The bacteria studied here are not known well enough to define the ecological differentiation observed as sympatric or allopatric: if their habitats within sea water are distinct enough, a physical barrier could be considered. If the differentiation can be regarded as sympatric, the mechanisms that prevent recombination between ecotypes are still to be investigated.

