genomes ecology evolution etc: June 2012

The Article

The authors of this article wanted to find out how the mutational background of adaptation looks like. Specifically, they asked if identical populations adapted to a fixed environment, would adaptation occur via identical mutations or via various alternative pathways. To answer this question they experimentally evolved 115 populations of Escherichia coli to 42.2° Celsius for 2000 generations (6.64 generations of binary fission daily) and sequenced one clone each of every population, what they call “strain” or “line” throughout the paper. All populations originated from the same E. coli B REL1206 ancestral clone. Their system fulfills all of the requirements needed to answer the question: (i) a large number of replicates for statistical power, (ii) complete genome sequencing, so that mutations can be identified unambiguously, and (iii) a complex biological system, to ensure that the number of potential adaptive solutions is not trivial. As the experimental environmental change they chose temperature, a rather complex environmental variable since it affects different biological reactions such as respiration, growth and reproduction.

Performance of the different strains was measured as fitness and yield. Fitness was defined as the density after competition of each of the evolved lines against a newly-derived Ara+ mutant of REL1206 after 1 day of competition. Each competition trial was replicated 6 times. Yield was measured as the number of colony forming units and also as the number of cells per volume, both also replicated 6 times.

All 115 strains were fully sequenced using illumina paired-end sequencing on an Illumina HiSeq 2000. A computational pipeline was developed to identify all de novo mutations. 1331 total mutations were found and over 114 genomes, the ratio of non-synonymous to synonymous mutations per site was 5.75 as was the ratio of intergenic and non-synonymous mutations. It was estimated that ~80% of intergenic and non-synonymous mutations were beneficial. 82 of 119 large (>30bp) deletions were identical between at least two lines. While there were almost no shared point mutations and indels between two lines, there were several shared IS insertions, duplications and large deletions shared among several lines.

On average, two strains shared only 2.6% of mutations (excluding synonymous mutations) but shared 20% of modified genes and 24.5% of affected operons. Focusing on genes with > 5 mutations, genes were clustered based on the literature into 10 functional units containing 37.5% of the mutations. At this level, two lines shared an average of 31.5% of affected units. All synonymous mutations were singletons.

At the level of individual mutations, two mutations had to be identical to be counted as convergent. At higher levels (gene, operon, functional unit), two mutations were assumed shared if at least 10% of the gene affected by the mutations were shared. The difference in convergence between point mutations (2.6%) and functional units (31%) suggests that the diversity of possible adaptive mutations was not fully explored. To estimate the number of sites that contribute to an adaptive response given their data, they developed a, what they call, “simple model”. The model is similar to the coupons collector’s problem [1]. This problem arises when you are collecting a predefined number of targets, for example Panini stickers of all players of the European Championship in soccer. You buy them in pockets of a small number. In the beginning every pocket contains a lot of new players and you advance fast in collecting different players. But as the number of collected players grows you need more and more pockets to find them all. The same goes for finding all the possible beneficial mutations. You need to find a lot of sites with mutations to be able to track down all beneficial mutations that contribute to an adaptive response. Their model assumes a number of L beneficial mutations (sites) and, additionally to the Panini sticker model, it also contains a variance parameter V. This parameter is needed because the sampling probability differs among sites. V captures the compound effect of mutation rates and selection coefficients among sites. For each combination of L and V, for 100 replicates, the sampling probability of each of the L sites was drawn from a shifted gamma distribution and stored in a table. Now, given this table 20, 40, 60, 80, 100 and 114 strains were sampled (without replacement) and for each strain, its exact number of mutations of interest was sampled (without replacement). For each sample size, the number of different sites was estimated. The curve was then averaged across 100 replicates of the process and the squared difference between the averaged curve and the one based on real data was used to identify the parameter space of L and V. With this model it was estimated that with no variance in V, 850 possible sites of beneficial mutations are required to yield the 400 observed point mutations (including > 3 convergent mutations). When the sampling probability of beneficial mutations varied across sites, the estimated number would raise up to ~4500 sites.

A further task that the authors tried to tackle was to decipher the role of epistasis to see how beneficial mutations interact among each other. Epistasis was examined statistically using a resampling procedure. Assuming no epistasis, the presence of a mutation in a gene should not affect the probability of observing another mutation. The randomization procedure was applied to test this Null-Hypothesis. From a database with all observed mutations in the dataset, 114 genomes were sampled without replacement. This randomized dataset conserved the total number of observed mutations in each genome as well as the relative frequencies of every mutational type. For 1000 random samples associations were recorded and compared to the values observed in the real data. The values were compared using a Z-score. Associations among mutations were also measured using D’ [2,3] and the correlation coefficient (r). Only operational units containing > 25 convergent mutations were included for this analysis. The data supported strong signals of associations. Several cases of negative epistasis were found. For example, ICLR, CLS, rho, rpoD, KPS, YBAL, and GLP never had more than one mutation in a strain. The exception was RNApol, which accumulated more than 1 mutation within every single line. Another example of negative epistasis was rpoBC and rho that were in complete repulsion from each other. There were also several cases of positive epistasis. Overall, two competing evolutionary trajectories leading to adaptation to warmer temperature could be found: In the first, mutations in rpoBC were in positive epistasis with changes in rpoD, ILV, KPS, RSS, and ROD. In the second, a mutation in rho eliminated the acquisition of mutations in rpoBC and enhanced selection for mutations in cls and iclR. These two trajectories will require further physiological follow-up studies

Discussion during meeting

The first, striking impression that most students had of the paper was how the level of resolution can affect the strength of convergence. While there were almost no shared, beneficial mutations at the point mutation level, more mutational changes were shared the higher you go up in organization. At the functional unit level there were 31% of all mutations shared. Since we are used to thinking of convergence at a higher level, i.e. phenotypes, it was very impressive to think of convergence at much higher resolution, the molecular level of mutations.

On page 457 the authors state that they estimated ~80% of intergenic and non-synonymous mutations to be beneficial. Some of us were puzzled about this exact number because it was not stated in the main article how this percentage is calculated. The proportion of beneficial mutations was calculated using the ratio of non-synonymous to synonymous mutations (Ka/Ks), as well as the ratio of intergenic to synonymous mutations (Ki/Ks). Under strict neutrality this ratio is assumed to be 1 but the magnitude of the value reflects the strength of selection. If we assume that this ratio equals x, then the fraction of selected non-synonymous mutations is y=(x-1)/x. We could not resolve if Ka/Ks was calculated genome-wide or only for windows of specific coding regions. Even more confusing was the ratio Ki/Ks where we did not know if Ks is also in the intergenic region, in windows of specific coding regions or even genome wide. This would be interesting to know since the authors did not specifically discuss regulatory versus protein-coding mutations. We would assume that intergenic regions in E. coli contain mostly regulatory elements. They state that both ratios, Ka/Ks as well as Ki/Ks were ~5.75, but 90% of all convergent, beneficial mutations occurred in coding-genes. An additional section in the paper with regulatory sequences and protein-coding sequences separately discussed would have been appreciated.

A minor, but important discussion point was the definition of “beneficial” in the paper. We concluded that a Ka/Ks ratio as well as Ki/Ks ratio of significantly > 1 would imply that the changes were beneficial during adaptation. Also the fact that many of these changes were shared among lines (convergence) adds to this conclusion. The authors estimated the performance of the different lines with fitness and yield. We would assume that a beneficial mutation would increase these performance measures but surprisingly there were no mutations found that were significantly associated with a better performance. At the functional unit level this outcome looked different.

One of the main discussion points was the statement that one dimension in the paper was lacking: time! During the course of the experiment (2000 generations of binary fission in E. coli) adaptation was not complete but the strongest effects of the adaptation to higher temperature had already passed. So the most extreme changes had already happened. Sweeps happen earlier than 2000 generations and are anyway faster in prokaryotes because of no recombination (in these experimental conditions only binary fission). Because of this we see only a rather late time point and miss many changes that happened ultimatively in the beginning of adaptation. The lack of time could lead to a wrong picture in the estimation of associations among genes (epistasis). While some mutations (mostly synonymous) might not have happened yet because of too little time, other mutations that did already happen might bias the sign of interaction we find. Another important part of adaptation that we missed because of too much elapsed time is an effect of chaperones. Most mutations that were found were associated with the polymerase. Earlier in adaptation there are also chaperones involved, proteins that assist the non-covalent folding of other proteins. Especially if a change of temperature is involved, these proteins are needed to retain the cell-biological functions in the organism. Usually they are involved in the first steps of adaptation when stress is still very high. They can contribute to phenotypic plasticity. Later we encounter mutations and fixations.

We closed our discussion with an interesting viewpoint on the general findings of the paper: Almost no single, beneficial point mutations were shared between two lines. The same applies to indels. Beneficial IS insertions, large deletions and duplications on the other hand were shared much more often among several lines. The nature of the latter is that they are much bigger features. A simple explanation for the finding of higher convergence among the latter features is that they are limited in number. There is only a certain amount of large deletions, of IS insertion types and of duplications that can happen in the genome of a size of a prokaryote such as E. coli. Point mutations and indels on the other hand are much smaller in size and therefore many more are possible. To reach the upper end of possible point mutations and indels in such a system would take very many more mutations so the authors would have needed to sequence the genomes of very many more different populations of E. coli. Numerical estimations are given in the section on the estimation of the number of sites that contribute to an adaptive response given the data in the article.

Expression of my own opinion

After a first glance of the paper I was strongly reminded of the Richard Lenski experiment with Escherichia coli at University of Michigan [4,5,6]. Lenski claims to have proof of long-term experimental evolution based on a long-term study that keeps proliferating E. coli in the lab since 1988. In February 2010 the gram-negative bacteria reached 50’000 generations of binary fission. Lenski mostly concentrated on morphological and physiological evolution and classical genetics whereas the authors here base their experiment on genomics. For my own research I am more familiar with animal genomes, and I had to keep in mind that the E. coli in this study have a very dense genome of functional genes (~4.6 million base pairs in length organized in ~2590 operons [7]). The coding density is very high and we assume no recombination so a sweep would happen very fast.

I found the article very interesting to read. The hypotheses were well stated in advance and the experiment was designed in the right way to answer the questions. The only point where I would like to add some criticism is the way they chose thresholds in their models. Functional units were defined using genes with > 5 convergent, beneficial mutations. To estimate the number of sites that contribute to an adaptive response, the model contained only genes with > 3 convergent, beneficial mutations. To estimate the degree of associations among genes (epistasis), only operational units with > 25 convergent, beneficial mutational events were included. I did not find any explanations why exactly these thresholds were used. Maybe the cutoffs were changed during the analysis and this lead to the same results. I do not doubt that. The authors state that their thresholds gave them more statistical power. An additional explanation in the methods describing why they chose their thresholds would have made it clearer for me.

Tenaillon, O., Rodriguez-Verdugo, A., Gaut, R., McDonald, P., Bennett, A., Long, A., & Gaut, B. (2012). The Molecular Diversity of Adaptive Convergence Science, 335 (6067), 457-461 DOI: 10.1126/science.1212986

Other references
1. Read KLQ (1998) A lognormal approximation for the collector's problem. American Statistician 52: 175-180.
2. Lewontin RC (1964) The Interaction of Selection and Linkage. Ii. Optimum Models. Genetics 50: 757-782.

3. Lewontin RC (1964) The Interaction of Selection and Linkage. I. General Considerations; Heterotic Models. Genetics 49: 49-67.

4. Cooper TF, Lenski RE (2010) Experimental evolution with E. coli in diverse resource environments. I. Fluctuating environments promote divergence of replicate populations. Bmc Evolutionary Biology 10.

5. Lenski RE, Mongold JA, Sniegowski PD, Travisano M, Vasi F, et al. (1998) Evolution of competitive fitness in experimental populations of E-coli: What makes one genotype a better competitor than another? Antonie Van Leeuwenhoek International Journal of General and Molecular Microbiology 73: 35-47.

6. Novak M, Pfeiffer T, Lenski RE, Sauer U, Bonhoeffer S (2006) Experimental tests for an evolutionary trade-off between growth rate and yield in E-coli. American Naturalist 168: 242-251.

7. Blattner FR, Plunkett G, Bloch CA, Perna NT, Burland V, et al. (1997) The complete genome sequence of Escherichia coli K-12. Science 277: 1453

Posted by MRR for Laetitia G.E. Wilkins

Tuesday, June 5, 2012

The Molecular Diversity of Adaptive Convergence