Saturday, January 12, 2013

A Critical Assessment of Storytelling: Gene Ontology Categories and the Importance of Validating Genomic Scans.

ResearchBlogging.org " Where there is life there is wishful thinking "  Gerald F. Lieberman

Finding genes which are under positive selection is an important part of any molecular evolution biologists' work as these genes can be responsible for adaptations in a studied specie. To find such genes, genomic scans are conducted and regions of the genome that show specific patterns, such as selective sweeps, are further studied and sensible biological interpretations are made. In this paper, Pavlidis & al. show that one has to be careful with such biological interpretations as the patterns for positive selection can appear under an a priori known neutrally evolving genome and that it might not be that difficult to come up with a satisfying story about such false-positives.

Figure 1 | Flowchart representing the the steps in the simulation. These steps were repeated for all of the 100 simulations.











To show the existence of  false-positives in the detection of positive selection patterns, Pavlidis & al simulated 100 data sets of 40 D.melanogaster X chromosomes evolving under a neutral Wright-Fisher model. The D.melanogaster X chromosome, which was sampled in the Netherlands, is believed to have gone through a recent and deep bottleneck. A demographic scenario for that population was inferred using the Markovian Coalescent Simulator (MaCS) software.  The group then used the SeepFinder program to find the regions characteristic of selective sweeps in these artificially neutrally evolving genomes and mapped them to the actual X chromosome using Flybase, this allowed the naming of identified genes. Interesting genes were detected and  biological meaning was assigned using the Gene Ontology Statistics (g:GOSt) module of g:Profiler. A "convincing" narrative was then given (Figure 1.).

The results showed  that on average, 43 regions per simulation (min. 27 & max. 60) were found where the site frequency spectrum (SFS) shifted towards low- and high-frequency-derived alleles. These patterns presenting a lack of intermediate allelic frequencies are characteristic of recent selective sweep and are indistinguishable from selective sweeps occurring in nature under selective pressures (Figure 2.). These detected regions were then mapped to the real X chromosome using FlyBase as was mentioned earlier.

Figure 2 | SFS pattern for the highest SweepFinder peak in the first simulated data set. These patterns showing a lack of intermediate allelic frequencies are characteristic of recent selective sweeps. 
For each of the 100 simulated data sets, the g:GOSt enrichment analysis of every detected region showed that on average, 5.19 statistically significant categories were detected per data set with 77 sets yielding at least one significant category and 16 giving rise to more than 10 significant categories. To be able to quantitatively compare these results to real data results, an enrichment analysis was done on 37 inbred lines of D.melanogaster sampled in North Carolina which are accepted to have gone through very recent and deep bottlenecks as well. This real data enrichment analysis showed that 9 statistically significant terms were related to transcription factor binding site. This important result shows that the number of biological terms obtained with a g:GOSt enrichment analysis are not higher in the real data than in the simulated data sets. A few issues in the model were also addressed.

1. It is known that bottlenecks increase the proportion of false-positives in neutrality tests so the group made another simulation with a milder bottleneck model. The g:GOSt analysis still yielded significant categories in 85% of the simulated data sets.

2. It is known that large recombination rates result in different coalescent genealogies every few base pair thus hiding any genetic sweep and that small recombination rates tend to diminish the independence of genes to the hole genome thus not allowing selective sweeps to happen. To address this issue, the group did more simulations with 5 different combinations of recombination rates and bottleneck models. The g:GOSt analysis didn't show substantial differences between these simulations.

3. SweepFinder detects SFS outsider as signatures for recent selective sweeps but there exists other statistics such as the omega-statistic which will detect other signature for recent selective sweeps such as linkage disequilibrium (LD). Two more simulations were done using firstly a LD detection method (OmegaPlus software) and secondly a joint method combining SFS and LD detection. The g:GOSt enrichment for both simulations yielded similar amounts of significant categories even though the distributions of the detected regions along the genome are different (the distribution is more uniform with omega-statistics than with the SFS detection).

The group then tried to make up convincing narratives about the three highest SweepFinder scoring genes (CG15211, CG8188 & CG6788) in the first simulation. In my opinion, these narratives were not the most convincing from a biological point of view but that is not the point of the article.

Selective pressures experienced by organisms are complex, varied and changing with time. Even if we knew all the selective pressures imposed on a population at one point, the ways in which its' organisms could respond are vast! Every gene, as obscure as it might be, is linked one way or another to an important biological process so "meaningful" narratives, even about false-positive, can relatively easily be constructed. The extensive use of Gene Ontology and the ever increasing precision of data bases put at greater risk researchers of seeing patterns of positive selection were there are none.
What the authors of this article have shown isn't that computational nor that statistical approaches for detecting positive selection are wrong but that one should be cautious of not over-interpreting genomic scans and blindly trusting statistics because: No null hypothesis of what "makes sense" exists.


Pavlidis P, Jensen JD, Stephan W, & Stamatakis A (2012). A critical assessment of storytelling: gene ontology categories and the importance of validating genomic scans. Molecular biology and evolution, 29 (10), 3237-48 PMID: 22617950

Tuesday, January 8, 2013

The genomic basis of adaptive evolution in threespine sticklebacks

ResearchBlogging.org

Sticklebacks are originally marine fish that colonized freshwater habitats after the last glaciation. Adaptation to freshwater environment happened independently in various rivers and lakes around the globe, giving rise to similar phenotypes following natural selection. In a recent study, researchers aimed to identify potential loci repeatedly associated with the divergence between marine and freshwater sticklebacks. An underlying question was to uncover if this adaptation is due to regulatory or protein-coding changes.
To ensure that the changes reflected parallel evolution, the authors sequenced a reference freshwater stickleback and 20 other freshwater and marine sticklebacks from both Pacific and Atlantic populations. They selected populations showing characteristic marine and freshwater morphologies (Figure1 a, b).
To find loci involved in repeated adaptation to freshwater habitats, the authors used two methods, aiming to identify regions where sequences from freshwater sticklebacks were similar to each other but different from marine sticklebacks. The first method is a self-organizing map-based iterative Hidden Markov Model (SOM/HMM) (Figure1 c). With this method, they identified the 20 most common patterns of genetic relationships (trees) among the 21 individuals. The authors found that for most of the genome, the fish clustered by geography, with fish from Pacific regions being closer to each other than they were to fish from Atlantic regions. For 215 regions however (0.46% of the genome), the fish clustered by marine / freshwater ecology. 
The second method the authors used was genetic distance based. The idea was to use distance matrices based on 21*21 pairwise nucleotide divergence. They then calculated a marine-freshwater cluster separation score (CSS) for each distance matrix, used to quantify the average distance between marine and freshwater clusters (Figure 1 c). 174 marine-freshwater divergent regions were found, covering 0.26% of the genome. The two methods are complementary, as they found 242 regions identified by either method (0.5% of the genome) and 147 regions identified by both (0.2% of the genome). Both methods confirmed that the previously known chromosome IV EDA locus plays an important role in the difference between marine and freshwater populations.

Figure1: Genome scans for parallel marine-freshwater divergence a. Marine (red) and freshwater (blue) stickleback populations were surveyed from diverse locations. b. Morphometric analysis was used to select individuals for re-sequencing. The 20 chosen individuals are from multiple geographically-proximate pairs of populations with typical marine and freshwater morphology (solid symbols). Points: population mean morphologies; ellipses: 95% confidence intervals for ecotypes. c. Genomes were analysed using SOM/HMM (upper) and CSS (lower) methods to identify parallel marine-freshwater divergent regions. Across most of the genome, the dominant patterns reflect neutral divergence or geographic structure. In contrast, <0.5% of the genome show haplotype-ecotype association, a pattern characteristic of divergent marine and freshwater adaptation via parallel reuse of standing genetic variation.

The authors then aimed to determine to what extent the globally shared regions found with the previous methods are widespread in a particular marine-freshwater species pair, compared to locally evolved genomic regions. To do this, they sequenced whole genomes of a single marine-freshwater pair found across a marine-freshwater hybrid zone in a river in Scotland. By analyzing the 0.1% most divergent regions, they found that they contained 35.3% of globally shared marine-freshwater divergence. This result means that only a part of the divergence is due to globally shared variants and that the major part may be due to locally evolved mutations (Figure4).


Figure 4: How much of local marine-freshwater adaptation occurs by reuse of global variants? a. Classic marine and freshwater ecotypes are maintained in downstream and upstream locations of the River Tyne, despite extensive hybridization at intermediate sites16. b. Pairwise sequence comparisons identify many genomic regions that show high divergence between upstream and downstream fish (X-axis). Many, but not all, of these regions also show high global marine-freshwater divergence (Y-axis; red points indicate significant CSS FDR<0.05), indicating that both global and local variants contribute to formation and reproductive isolation of a marine-freshwater species pair.

The team also observed extended regions of marine-freshwater divergence on chromosomes I, XI and XXI corresponding to chromosome inversions, which are a known genetic mechanism that can maintain diverging ecotypes in hybridizing populations, by preventing recombination between independent adaptive loci (Figure 3).


Figure3: Genome-wide distribution of marine-freshwater divergence regions Whole-genome profiles of SOM/HMM and CSS analyses reveal many loci distributed on multiple chromosomes (plus unlinked scaffolds, here grouped as "ChrUn"). Extended regions of marine-freshwater divergence on chrI, XI, and XXI correspond to inversions (red arrows). Marine-freshwater divergent regions detected by CSS are shown as grey peaks with grey points above chromosomes indicating regions of significant marine-freshwater divergence (FDR 0.05). Genomic regions with marine-freshwater-like tree topologies detected by SOM/HMM are shown as green points below chromosomes.

The authors were then interested in the proportion of regulatory and coding change involved in stickleback’s adaptation to freshwater environment. To estimate this, they analyzed 64 divergent regions showing the strongest evidence of parallel evolution that were identified with the previous SOM/HMM and CSS methods. They found that even though both coding and regulatory changes are involved in stickleback adaptation to freshwater habitats, regulatory changes seem to play a much stronger role. Seventeen percent of these 64 regions consisted of coding regions with consistent non-synonymous substitutions between marine and freshwater fish. On the other hand, 41 % consisted of non coding regions of the genome that were most likely regulatory, while 42% were evaluated as probably regulatory, as they contained both coding and non-coding sequences, but lacked ecotype-specific amino acid substitutions. Finally, the authors investigated whole genome expression levels of freshwater and marine fish. 2817 of the 12594 informative genes across the whole genome showed significant differences in expression levels between freshwater and marine ecotypes. They also found that genes that had a difference in expression between ecotypes were more likely situated in or near adaptive regions previously discovered with the SOM/HMM or CSS methods (Figure 6).


Figure 6: Contributions of coding and regulatory changes to parallel marine-freshwater stickleback adaptation a. A genome-wide set of marine-freshwater loci recovered by both SOM/HMM and CSS analyses includes regions with consistent amino acid substitutions between marine and freshwater ecotypes (yellow sector); regions with no predicted coding sequence (green sector); and regions with both coding and non-coding sequences, but no consistent marine-freshwater amino acid substitutions (grey). b. Genome-wide expression analysis shows that marine-freshwater regions identified by SOM/HMM or CSS analyses are enriched for genes showing significant expression differences in 6 out of 7 tissues between marine LITC and freshwater FTC fish (observed: grey bars; expected: white bars; *P<0.01, **P<0.001, ***P<0.0001, ****P[double less-than sign]0.00001), consistent with a role for regulatory changes in marine-freshwater evolution.

In conclusion, the fact that sticklebacks repeatedly evolved from marine to freshwater habitats, coupled with the power of whole genome sequencing, has allowed to uncover a great number of loci globally involved in marine-freshwater adaptation. The differentiation seems to be spread across the genome, on several different chromosomes. Globally shared mutations, however, only account for a fraction of the differences, as a lot of locally evolved mutations also seem to play a significant role. Moreover, regulatory adaptations are particularly important in this case of repeated evolution, although protein-coding changes have also been found in the set of loci implicated in differences between ecotypes.
The authors finally suggest that although they focused on freshwater-marine differences, other ecological traits could be studied, like lake-stream or open-water and bottom dwelling habitats or gigantism in particular lakes, as sticklebacks have also repeatedly evolved these characteristics.


Jones FC, Grabherr MG, Chan YF, Russell P, Mauceli E, Johnson J, Swofford R, Pirun M, Zody MC, White S, Birney E, Searle S, Schmutz J, Grimwood J, Dickson MC, Myers RM, Miller CT, Summers BR, Knecht AK, Brady SD, Zhang H, Pollen AA, Howes T, Amemiya C, Broad Institute Genome Sequencing Platform & Whole Genome Assembly Team, Baldwin J, Bloom T, Jaffe DB, Nicol R, Wilkinson J, Lander ES, Di Palma F, Lindblad-Toh K, & Kingsley DM (2012). The genomic basis of adaptive evolution in threespine sticklebacks. Nature, 484 (7392), 55-61 PMID: 22481358

Tuesday, December 18, 2012

Genome-wide analysis of a long-term evolution experiment with Drosophila

ResearchBlogging.orgFor decades, most researchers have provided some general insights into the nature of adaptation in asexually reproducing populations with small genome, such as bacteria and yeast. They assumed that sexual species evolve the same way these populations do, i.e. their adaptation is driven by the so-called selective sweeps or newly arising beneficial genetic mutation quickly becomes "fixated" on a particular portion of DNA, with the genome-wide haplotype associated with it. When we relate to obligate sexually reproducing systems, this is much more complicated by the fact that selection can act on standing variation, that means that weak selection can act on many pre-existing genetic variants involved in fitness traits. The idea is that short-term evolution have occurred through a so-called “soft sweep” model, which contrasts the hypothesis of the “hard sweep”, where strong selective sweep originates from a single mutation, while all its linked neutral variants are eliminated. Burke et al. compared outbred, sexually reproducing, replicated populations of D. melanogaster selected for accelerated development and their matched control populations on a genome-wide basis, and this is the first time that such a study of a sexually reproducing species has been done. 


As shown in figure 1, they used the Illumina platform to get short-read sequences from three genomic DNA libraries, obtained from sets of replicated populations experiencing different selection treatments, maintained since 1980 under the specific conditions of large population size (N > 1,000) and discrete generations: 
1) a pooled sample of five replicate populations that have undergone sustained selection for accelerated development and early fertility for over 600 generations (ACO); 
2) a pooled sample of five replicate ancestral control populations, which experience no direct selection on development time (CO); 
3) a single ACO replicate population (ACO1). 
Phenotype was assayed by using longevity assay, starvation resistance assay, development time assay, dry weight assay.

Figure 1. Grey bars represent values measured in each of the five replicate populations in the ACO and CO treatments. Measures from the five baseline (B) replicate populations represent phenotypes typical of populations kept on two-week generation maintenance schedules. Only data for females are shown. Longevity and starvation resistance data were collected after at least 619 generations of ACO treatment, and both development time and dry weight data (dry weight values are mean masses of groups of ten females) were collected after 640 generations of ACO treatment. Error bars, s.e.m. for each replicate population. 

As represented in Figure 2, a 100-kb genome-wide sliding-window analysis was carried out to identify regions diverged in allele frequency, with a large number of genomic regions showing significant difference between the ACO population and their matched controls, while no significant divergence was displayed by the comparison of the single replicate population (ACO1) and the pooled sample consisting of all five ACO populations. The presence of an apparent excess of diverged regions on the X chromosome was explained as a result of selection on initially rare recessive or partially recessive alleles. Another important consideration to do is that the adaptive response was highly multigenic, as not only one or few region were identified to be affected by selection on developmental time, but most likely a larger portion of the genome was involved. 


Figure 2. Sliding-window analysis (100 kb) of differentiation in allele frequency between the ACO and CO populations: the solid black line depicts L10FET5%Q scores at 2-kb steps (Methods). The dotted line is the threshold that any given window has a 0.1% chance of exceeding relative to the genome-wide level of noise. The grey line depicts L10FET5%Q scores for a difference in allele frequency between ACO1 and the ACO pooled sample. The five panels show the five major D. melanogaster chromosome arms (as indicated). 

Looking instead at the heterozigosity throughout the genome, they found a relevant and expected concordance with these results. Regions of reduced heterozygosity are in fact expected to be strongly associated with regions of differentiated allele frequency. Accordingly, if we compare figure 2 with figure 3, we can observe that also in this case the regions identified for divergence in allele frequency were the ones associated with reduced heterozigosity. 

Figure 3. Sliding-window analysis (100 kb) of heterozygosity in the CO pool (blue), the ACO pool (red) and ACO1(grey), with a 2-kb step size. The panels show the five major chromosome arms of D. melanogaster. 

They were also able to exclude that the observed similarity in allele frequencies and patterns of heterozigosity between the ACO1 and ACO libraries was an artifact due to sample preparation or data analysis, by individually genotyping 35 females from the five replicate populations of each selection treatment at 30 loci identified for a divergence in allele frequency (Fig. 4). 

Figure 4. a, Allele frequency estimates of the most common allele at 30 SNPs genotyped in 35 females per replicate population. Red circles represent ACO estimates and grey squares represent CO estimates. Open symbols are allele frequencies for ACO1–ACO5 and CO1–CO5, and filled symbols represent treatment means. Alternating black and grey bars designate the X, 2L, 2R, 3L, and 3R arms, respectively, with grey lines indicating SNP location. b, Scatter plot comparing allele frequency estimates at the same 30 SNPs obtained from the Illumina resequencing versus individual genotyping. Red circles represent ACO, black squares represent CO and the straight line represents a slope of unity. 

Burke found evidence of evolution in more than 500 genes that could be linked to a variety of traits, including size, sexual maturation and life span, indicating a gradual, widespread network of selective adaptation. There are two possible hypothesis to explain the results reported in the paper: either “parallel evolution”, with selection acting on the same intermediate-frequency variants in each population, or unwanted migration between replicate populations. In any case, these results clearly show that the signature of “classic hard sweep” is absent in this population, despite evidence of strong selection, while “soft sweep” model is more consistent with the observations.

Burke, M., Dunham, J., Shahrestani, P., Thornton, K., Rose, M., & Long, A. (2010). Genome-wide analysis of a long-term evolution experiment with Drosophila Nature, 467 (7315), 587-590 DOI: 10.1038/nature09352

Monday, December 17, 2012

Research blogging is 5 years old

And they have published an informative article in PLOS One about their first 4 years:
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0050109

More cited papers also have more views of the blog post
Biology rules

Friday, December 14, 2012

Genome-wide analysis of a long-term evolution experiment with Drosophila

ResearchBlogging.org

In this paper, Molly K. Burke and his collogues did an experimental evolution systems, which allows the genomic study of adaptation. They selected outbred, sexually reproducing, replicated populations of Drosophila melanogaster, which experienced over 600 generations of laboratory selection for accelerated development.

Short-read sequences from three genomic DNA libraries, were obtained using Illumina platform, they are as follows:
a)    A pooled sample of five replicate populations that have undergone sustained selection for accelerated development and early fertility for over 600 generations (ACO);
b)   A pooled sample of five replicate ancestral control populations, which experience no direct selection on development time (CO);
c)    A single ACO replicate population (ACO1);
Figure 1: Phenotypic divergence in the selection treatments    
In the above figure, the grey bar indicates values measured in the ACO and CO treatments for each of the five replicate populations. B indicates replicate populations, which represent phenotypes typical of populations kept on two-week generation maintenance schedules.

This figure shows a comparative analysis between the ACO population and the population with the CO treatment. Every time, ACO featured significantly differentiated phenotypes, including shorter development time and reductions in pre-adult viability, longevity, adult body size and stress resistance. Furthermore, the CO treatment does not show stringent selection, as it entails no more than moderate selection for postponed reproduction, resulting in moderately increased development time and longevity.

Figure 2: Differentiation throughout the genome    

A 100-kb genome-wide sliding-window analysis test was carried out to identify regions diverged in allele frequency between the ACO and CO libraries and between the ACO and ACO1 libraries. This is due to the fact that, linkage disequilibrium in individual ACO and CO replicate populations may extend anywhere from 20 to 100 (kb). The five different panels are basically for the five major D. melanogaster chromosome arms.

It identifies a large number of genomic regions showing significant divergence between the accelerated development populations and their matched controls depicted by black lines in Figure 2, and very little divergence is observed between a single replicate evolved population (ACO1) and the pooled sample consisting of all five ACO populations and this is indicated by the grey lines. The dotted line is the threshold that any given window has a 0.1% chance of exceeding relative to the genome-wide level of noise. Interestingly, an excess of diverged regions on the X chromosome relative to on the autosomes is very evident. This observation is only expected if adaptation were driven by selection on initially rare recessive or partially recessive alleles. Furthermore, the sharpness of the peaks suggests that regions of the genome that have responded to experimental evolution are precisely identified. However, even the sharpest peaks tend to delineate, 50–100-kb regions.

Kaplan et al. stated that, recent research on evolutionary genetics has focused on classic selective sweeps. In a recombining region, a selected sweep is expected to reduce heterozygosity at SNPs flanking the selected site.
Figure 3: Heterozygosity throughout the genome
Figure 3, shows a similar sliding-window analysis (100 kb) of heterozygosity in ACO and CO lines suggesting that there are indeed local losses of heterozygosity, which is depicted by the red and the blue lines, respectively. Heterozygosity in ACO1 depicted by grey line, shows remarkable concordance with the reductions in heterozygosity in the ACO pool. Regions of reduced heterozygosity are strongly associated with regions of differentiated allele frequency. Interestingly, we observed no location in the genome where heterozygosity is reduced to anywhere near zero, and therefore, it lacks the evidence for a classic sweep is a feature of the data regardless of window size. Nevertheless, both the figure 2 and 3 are quite similar.

Figure 4: Analysis of individual genotypes, measured by cleaved amplified
polymorphic sequence (CAPS) techniques. 

Figure 4a shows the allele frequency estimates of the most common allele at 30 SNPs genotyped in 35 females per replicate population. Red circles and grey squares represent ACO and CO estimates. Open symbols are allele frequencies for ACO1–ACO5 and CO1–CO5, and filled symbols represent treatment means. Alternating black and grey bars designate the X, 2L, 2R, 3L, and 3R arms, respectively. The grey lines indicate SNP location. We observe that replicate populations within a selection treatment have very similar allele frequencies.

In Figure 4b, we see a scatter plot comparing allele frequency estimates at the same 30 SNPs obtained from the Illumina resequencing versus individual genotyping. Red circles represent ACO, black squares represent CO and the straight line represents a slope of unity. Here we see that individual genotypes are consistent with allele frequency estimates from the resequenced pooled libraries.
Therefore, it can be concluded that the congruence in allele frequencies and patterns of heterozygosity between the ACO1 and ACO libraries is unlikely to be some sort of artefact of sample preparation or data analysis.

The study shows a convergence of allele frequencies and heterozygosity levels between replicate populations. This convergence might be due to selection, acting on the same intermediate-frequency variants in each population. Under this scenario, convergence in allele frequencies is due to parallel evolution. Another reason could be, unwanted migration between replicate populations, even at very low levels.

Conclusively, it was very interesting to see that despite strong selection, Molly K. Burke and his collogues failure to observe the signature of a classic sweep in these populations.







Burke, M., Dunham, J., Shahrestani, P., Thornton, K., Rose, M., & Long, A. (2010). Genome-wide analysis of a long-term evolution experiment with Drosophila Nature, 467 (7315), 587-590 DOI: 10.1038/nature09352