The Article
The authors of this article wanted to find out how the mutational background of adaptation looks like. Specifically, they asked if identical populations adapted to a fixed environment, would adaptation occur via identical mutations or via various alternative pathways. To answer this question they experimentally evolved 115 populations of Escherichia coli to 42.2° Celsius for 2000 generations (6.64 generations of binary fission daily) and sequenced one clone each of every population, what they call “strain” or “line” throughout the paper. All populations originated from the same E. coli B REL1206 ancestral clone. Their system fulfills all of the requirements needed to answer the question: (i) a large number of replicates for statistical power, (ii) complete genome sequencing, so that mutations can be identified unambiguously, and (iii) a complex biological system, to ensure that the number of potential adaptive solutions is not trivial. As the experimental environmental change they chose temperature, a rather complex environmental variable since it affects different biological reactions such as respiration, growth and reproduction.
Performance of the different strains was
measured as fitness and yield. Fitness was defined as the density after
competition of each of the evolved lines against a newly-derived Ara+ mutant of
REL1206 after 1 day of competition. Each competition trial was replicated 6
times. Yield was measured as the number of colony forming units and also as the
number of cells per volume, both also replicated 6 times.
All 115 strains were fully sequenced using illumina paired-end sequencing on an Illumina HiSeq 2000. A computational pipeline was developed to identify all de novo mutations. 1331 total mutations were found and over 114 genomes, the ratio of non-synonymous to synonymous mutations per site was 5.75 as was the ratio of intergenic and non-synonymous mutations. It was estimated that ~80% of intergenic and non-synonymous mutations were beneficial. 82 of 119 large (>30bp) deletions were identical between at least two lines. While there were almost no shared point mutations and indels between two lines, there were several shared IS insertions, duplications and large deletions shared among several lines.
On average, two strains shared only 2.6% of
mutations (excluding synonymous mutations) but shared 20% of modified genes and
24.5% of affected operons. Focusing on genes with > 5 mutations, genes were
clustered based on the literature into 10 functional units containing 37.5% of
the mutations. At this level, two lines shared an average of 31.5% of affected
units. All synonymous mutations were singletons.
At the level of individual mutations, two mutations had to be identical to be counted as convergent. At higher levels (gene, operon, functional unit), two mutations were assumed shared if at least 10% of the gene affected by the mutations were shared. The difference in convergence between point mutations (2.6%) and functional units (31%) suggests that the diversity of possible adaptive mutations was not fully explored. To estimate the number of sites that contribute to an adaptive response given their data, they developed a, what they call, “simple model”. The model is similar to the coupons collector’s problem [1]. This problem arises when you are collecting a predefined number of targets, for example Panini stickers of all players of the European Championship in soccer. You buy them in pockets of a small number. In the beginning every pocket contains a lot of new players and you advance fast in collecting different players. But as the number of collected players grows you need more and more pockets to find them all. The same goes for finding all the possible beneficial mutations. You need to find a lot of sites with mutations to be able to track down all beneficial mutations that contribute to an adaptive response. Their model assumes a number of L beneficial mutations (sites) and, additionally to the Panini sticker model, it also contains a variance parameter V. This parameter is needed because the sampling probability differs among sites. V captures the compound effect of mutation rates and selection coefficients among sites. For each combination of L and V, for 100 replicates, the sampling probability of each of the L sites was drawn from a shifted gamma distribution and stored in a table. Now, given this table 20, 40, 60, 80, 100 and 114 strains were sampled (without replacement) and for each strain, its exact number of mutations of interest was sampled (without replacement). For each sample size, the number of different sites was estimated. The curve was then averaged across 100 replicates of the process and the squared difference between the averaged curve and the one based on real data was used to identify the parameter space of L and V. With this model it was estimated that with no variance in V, 850 possible sites of beneficial mutations are required to yield the 400 observed point mutations (including > 3 convergent mutations). When the sampling probability of beneficial mutations varied across sites, the estimated number would raise up to ~4500 sites.
A further task that the authors tried to tackle
was to decipher the role of epistasis to see how beneficial mutations
interact among each other. Epistasis was examined statistically using a
resampling procedure. Assuming no epistasis, the presence of a mutation in a
gene should not affect the probability of observing another mutation. The
randomization procedure was applied to test this Null-Hypothesis. From a
database with all observed mutations in the dataset, 114 genomes were sampled
without replacement. This randomized dataset conserved the total number of observed
mutations in each genome as well as the relative frequencies of every
mutational type. For 1000 random samples associations were recorded and
compared to the values observed in the real data. The values were compared
using a Z-score. Associations among mutations were also measured using D’ [2,3] and the correlation coefficient
(r). Only operational units containing > 25 convergent mutations were
included for this analysis. The data supported strong signals of associations.
Several cases of negative epistasis were found. For example, ICLR, CLS, rho, rpoD,
KPS, YBAL, and GLP never had more than one mutation in a strain. The exception
was RNApol, which accumulated more than 1 mutation within every single line.
Another example of negative epistasis was rpoBC
and rho that were in complete
repulsion from each other. There were also several cases of positive epistasis.
Overall, two competing evolutionary trajectories leading to adaptation to
warmer temperature could be found: In the first, mutations in rpoBC were in positive epistasis with
changes in rpoD, ILV, KPS, RSS, and
ROD. In the second, a mutation in rho
eliminated the acquisition of mutations in rpoBC
and enhanced selection for mutations in cls
and iclR. These two trajectories will
require further physiological follow-up studies
Discussion during meeting
The first, striking impression that most
students had of the paper was how the level of resolution can affect the
strength of convergence. While there were almost no shared, beneficial
mutations at the point mutation level, more mutational changes were shared the
higher you go up in organization. At the functional unit level there were 31%
of all mutations shared. Since we are used to thinking of convergence at a
higher level, i.e. phenotypes, it was very impressive to
think of convergence at much higher resolution, the molecular level of
mutations.
On page 457 the authors state that they
estimated ~80% of intergenic and non-synonymous mutations to be beneficial.
Some of us were puzzled about this exact number because it was not stated in
the main article how this percentage is calculated. The proportion of
beneficial mutations was calculated using the ratio of non-synonymous to
synonymous mutations (Ka/Ks), as well as the ratio of intergenic to synonymous
mutations (Ki/Ks). Under strict neutrality this ratio is assumed to be 1 but
the magnitude of the value reflects the strength of selection. If we assume
that this ratio equals x, then the fraction of selected non-synonymous
mutations is y=(x-1)/x. We could not resolve if Ka/Ks was calculated
genome-wide or only for windows of specific coding regions. Even more confusing
was the ratio Ki/Ks where we did not know if Ks is also in the intergenic
region, in windows of specific coding regions or even genome wide. This would
be interesting to know since the authors did not specifically discuss
regulatory versus protein-coding mutations. We would assume that intergenic
regions in E. coli contain mostly
regulatory elements. They state that both ratios, Ka/Ks as well as Ki/Ks were
~5.75, but 90% of all convergent, beneficial mutations occurred in
coding-genes. An additional section in the paper with regulatory sequences and
protein-coding sequences separately discussed would have been appreciated.
A minor, but important discussion point was the
definition of “beneficial” in the paper. We concluded that a Ka/Ks ratio
as well as Ki/Ks ratio of significantly > 1 would imply that the changes
were beneficial during adaptation. Also the fact that many of these changes
were shared among lines (convergence) adds to this conclusion. The authors
estimated the performance of the different lines with fitness and yield. We
would assume that a beneficial mutation would increase these performance
measures but surprisingly there were no mutations found that were significantly
associated with a better performance. At the functional unit level this outcome
looked different.
One of the main discussion points was the
statement that one dimension in the paper was lacking: time! During the
course of the experiment (2000 generations of binary fission in E. coli) adaptation was not complete but
the strongest effects of the adaptation to higher temperature had already
passed. So the most extreme changes had already happened. Sweeps happen earlier
than 2000 generations and are anyway faster in prokaryotes because of no
recombination (in these experimental conditions only binary fission). Because
of this we see only a rather late time point and miss many changes that
happened ultimatively in the beginning of adaptation. The lack of time could
lead to a wrong picture in the estimation of associations among genes
(epistasis). While some mutations (mostly synonymous) might not have happened
yet because of too little time, other mutations that did already happen might
bias the sign of interaction we find. Another important part of adaptation that
we missed because of too much elapsed time is an effect of chaperones. Most
mutations that were found were associated with the polymerase. Earlier in
adaptation there are also chaperones involved, proteins that assist the
non-covalent folding of other proteins. Especially if a change of temperature
is involved, these proteins are needed to retain the cell-biological functions
in the organism. Usually they are involved in the first steps of adaptation
when stress is still very high. They can contribute to phenotypic plasticity.
Later we encounter mutations and fixations.
We closed our discussion with an interesting
viewpoint on the general findings of the paper: Almost no single, beneficial
point mutations were shared between two lines. The same applies to indels.
Beneficial IS insertions, large deletions and duplications on the other hand
were shared much more often among several lines. The nature of the latter is
that they are much bigger features. A simple explanation for the finding of
higher convergence among the latter features is that they are limited in
number. There is only a certain amount of large deletions, of IS insertion
types and of duplications that can happen in the genome of a size of a
prokaryote such as E. coli. Point
mutations and indels on the other hand are much smaller in size and therefore
many more are possible. To reach the upper end of possible point mutations and
indels in such a system would take very many more mutations so the authors
would have needed to sequence the genomes of very many more different
populations of E. coli. Numerical
estimations are given in the section on the estimation of the number of sites
that contribute to an adaptive response given the data in the article.
Expression of my own opinion
After a first glance of the paper I was
strongly reminded of the Richard Lenski experiment with Escherichia coli at University of Michigan [4,5,6].
Lenski claims to have proof of long-term experimental evolution based on a
long-term study that keeps proliferating E.
coli in the lab since 1988. In February 2010 the gram-negative bacteria
reached 50’000 generations of binary fission. Lenski mostly concentrated on
morphological and physiological evolution and classical genetics whereas the
authors here base their experiment on genomics. For my own research I am more
familiar with animal genomes, and I had to keep in mind that the E. coli in this study have a very dense
genome of functional genes (~4.6 million base pairs in length organized in
~2590 operons [7]). The coding density is very high
and we assume no recombination so a sweep would happen very fast.
I found the article very interesting to read.
The hypotheses were well stated in advance and the experiment was designed in
the right way to answer the questions. The only point where I would like to add
some criticism is the way they chose thresholds in their models. Functional
units were defined using genes with > 5 convergent, beneficial mutations. To
estimate the number of sites that contribute to an adaptive response, the model
contained only genes with > 3 convergent, beneficial mutations. To estimate
the degree of associations among genes (epistasis), only operational units with
> 25 convergent, beneficial mutational events were included. I did not find
any explanations why exactly these thresholds were used. Maybe the cutoffs were
changed during the analysis and this lead to the same results. I do not doubt
that. The authors state that their thresholds gave them more statistical power.
An additional explanation in the methods describing why they chose their
thresholds would have made it clearer for me.
Tenaillon, O., Rodriguez-Verdugo, A., Gaut, R., McDonald, P., Bennett, A., Long, A., & Gaut, B. (2012). The Molecular Diversity of Adaptive Convergence Science, 335 (6067), 457-461 DOI: 10.1126/science.1212986
Other references
1. Read KLQ (1998) A lognormal approximation for the collector's problem. American Statistician 52: 175-180.
2. Lewontin RC (1964) The Interaction of Selection and Linkage. Ii. Optimum Models. Genetics 50: 757-782.
3. Lewontin RC (1964) The Interaction of Selection and Linkage. I. General Considerations; Heterotic Models. Genetics 49: 49-67.
4. Cooper TF, Lenski RE (2010) Experimental evolution with E. coli in diverse resource environments. I. Fluctuating environments promote divergence of replicate populations. Bmc Evolutionary Biology 10.
5. Lenski RE, Mongold JA, Sniegowski PD, Travisano M, Vasi F, et al. (1998) Evolution of competitive fitness in experimental populations of E-coli: What makes one genotype a better competitor than another? Antonie Van Leeuwenhoek International Journal of General and Molecular Microbiology 73: 35-47.
6. Novak M, Pfeiffer T, Lenski RE, Sauer U, Bonhoeffer S (2006) Experimental tests for an evolutionary trade-off between growth rate and yield in E-coli. American Naturalist 168: 242-251.
7. Blattner FR, Plunkett G, Bloch CA, Perna NT, Burland V, et al. (1997) The complete genome sequence of Escherichia coli K-12. Science 277: 1453
Posted by MRR for Laetitia G.E. Wilkins
No comments:
Post a Comment