Wednesday, November 30, 2011

Insights into Human Variation
Higher throughput, better accuracy, and lower costs of DNA sequencing technology revolutionized the field of genetics. Building upon these technological advances, 1000 genomes project marked the new era of human genetics. The ambitious goal of this international project is to build a detailed map of human genetic variation by sequencing 2500 individuals from five major population groups. The first insights into the project results got available upon completion of the pilot phase that covered some hundreds of individuals (The 1000 Genomes Project Consortium 2010).

Whereas sequencing costs drop, data management costs are raising. The tremendous amounts of sequencing data from thousands of genomes over 3 billion DNA base pairs raise important challenges for storage and analysis. To tackle this, EBI developed a dedicated computer platform to manipulate and share large-scale data. Furthermore, although sequencing becomes cheaper, getting the sequences of 2500 genomes remains a burden. Pilot project assessed two cost-containment strategies: low-coverage (4x) sequencing of the whole genome and high coverage (50x) sequencing of exon-targeted regions (8140 exons were included).

According to pilot study, low-coverage whole genome sequencing approach performs reasonably well. Targeting multiple individuals increases the power to detect different frequency variants in the population. The number and accuracy of called genotypes are comparable to that called under 15x coverage of exon-enriched samples. Furthermore, pilot study included the whole genome sequencing at 42x of two mother-father-child trios. This allowed estimating the accuracy and completeness of low-coverage samples. The analysis of trio data subsampled at 4x retrieved about 90% of SNP variants and genotypes. The main issue with low-coverage approach is missing data. The pilot study overcomes this limitation using the imputation methods that infer missing data based on known data for other individuals.

Pilot studies alone show incredible amount of variation in human genome. An individual genome contains on average about 375 loss-off-function variants and tens of thousands of mutations in coding regions, in about equal amounts of both affecting and not the triplet for amino acid call. As expected, most high frequency variants found in pilot study were already present in public databases. In addition, study reports about 8 million novel variants. The authors explain the excess of lower frequency variants in exon data with purifying selection under neutral coalescent model with constant population size. This interpretation is not optimal as similar signature is obtained by population growth not taken into account. Most of the novel variants were found in populations with the African ancestry, which is not surprising as most human diversity lies in African populations. Therefore having better resolution for African populations would be advantageous for analyses.

Often, when talking about genome projects, it is common to say that it is never finished. This applies not only to bridging gaps in the sequence, but also to difficulty in finding the right reference genome for many differing individual genomes. 1000 Genomes Project Consortium reports brand new piece of genome of 3.7 millions of DNA base pairs. This fragment was found in great ape and other human sequences available in public databases.

To conclude, I believe that 1000 genomes initiative is a major breakthrough in human medical genetics. Open access to tremendous amount of variation data will foster genome wide association studies. In addition to that, such data is an important contribution to the studies of human evolution. I look forward to 2012, when full-scale results are expected.

Durbin, R., & al. (2010). A map of human genome variation from population-scale sequencing Nature, 467 (7319), 1061-1073 DOI: 10.1038/nature09534

Classic Selective Sweeps Were Rare in Recent Human Evolution

ResearchBlogging.orgWith the rise of genomics and the availability of whole genome sequences, geneticists hope to be able to understand the recent adaptations humans underwent. Classic selective sweeps, where a beneficial allele arises in a population and subsequently goes to fixation, leave a specific pattern. Indeed, all variation is erased as the selected allele invades the population, and the neighboring neutral variation is also partially swept, with an intensity depending on the linkage with the selected region.

An example of classic selective sweep pattern. As the distance from the selected nucleotide increases, diversity increases. Fig. 2 from Hernandez et al. 2011.

The selective sweep pattern was used to find evidence for recent adaptation in humans. Many candidate genes for recent adaptation in humans were found. Nevertheless, the preeminence of classic selective sweeps compared with other modes of adaptation (like background selection or recurrent a.k.a. "soft" sweeps) is still unknown.

In this paper, the authors claim that classic selective sweeps are in fact a rare event in human recent evolution. They argue that the overall pattern found in genome scan studies can be explained with only nearly neutral mechanisms (neutral evolution plus some purifying selection), without any positive selection going on. This casts a doubt on our ability to detect regions under selection from molecular data with currently available techniques.

Their evidence is based on polymorphism data from 179 human genomes from the 1000 genome project (see Durbin et al. 2010). The authors identified single nucleotide polymorphism. They pooled together all exons in order to see the overall sweep pattern around each substitution. The first blow to the preeminence of classic selective sweeps comes from the fact that synonymous and non-synonymous sites show the exact same sweep pattern. We would expect that non-synonymous sites, as they should be the targets of adaptation, show a stronger sweep pattern. Another concern comes from the comparison of genetic data with the expectation under neutral evolution. They show (see fig. 3) that if classic selective sweeps are frequent (more than 10% of human specific substitutions), we have the statistical power to detect a difference with a purely neutral evolution scenario. Nevertheless, we do not observe any difference between the genomic data and the neutral simulations.

Comparison of simulations under a neutral model with a model with selection, and the actual human genomes data. What is interesting in panel A is that the power is strong for all fractions of the genome under selection the authors tested (alpha parameter). Therefore the authors claim that if classic selective sweeps are frequent in the population, we should be able to detect a significant departure from neutrality. Panel B completes the argument as we can see that all curves (neutral model and human genome data) are merged. Considering that we should have the power to detect a departure from neutrality, the authors claim that the neutral scenario cannot be rejected. Fig. 3 from Hernandez et al. 2011.

They conclude that classic selective sweeps should not have been the major mode of adaptation in recent human evolution.

I personally was not convinced by the relevance of using a mean pattern, over all coding regions, to attest that classic sweeps were rare in human evolution. Indeed, most coding regions have not experienced a selective sweep in the past, and thus the mean pattern should indeed not differ from a neutral or background selection model. Nevertheless, the authors anticipated this argument, as they run simulations where only a fraction of the genome is under positive selection. And as I wrote above, they show that we should be able to discriminate between selection and background mutation, even if the proportion of loci under selection are as low as 10% of human specific substitutions.
We raised during our discussion another concern, regarding the parameter range covered in their simulations. Indeed, the authors tested the power to distinguish selection and neutrality with several fractions of the genome under positive selection, but did not test a wide range of selection coefficient. A selection coefficient of 0.01 already seems very large, and the question remains to see if with weaker selection, we do expect to see a difference in the mean pattern of diversity over all exon SNPs. 
 In conclusion, I believe that the authors showed that so far we can only detect classic AND very strong selective sweeps from molecular data. In my opinion, this means that we can rarely detect classic selective sweeps. The question remains whether classic but weaker selective sweeps were rare in recent human evolution.

Hernandez, R., Kelley, J., Elyashiv, E., Melton, S., Auton, A., McVean, G., , ., Sella, G., & Przeworski, M. (2011). Classic Selective Sweeps Were Rare in Recent Human Evolution Science, 331 (6019), 920-924 DOI: 10.1126/science.1198878

Monday, November 28, 2011

Modes of Adaptation in Recent Human Evolution

Since their first appearance humans have colonized most parts of the world. They have undergone multiple adaptations to a wide range of disparate habitats, which let to the appraisal of different phenotypes. Thus, dark skin and hair, for example, is an evolutionary adaptation to protect against high amounts of radiation coming from the sun. An adaptive trait can be fixed in a population through the mechanisms of natural selection acting on point mutations or on standing genetic variation.

In their article “Classic Selective Sweeps were Rare in Recent Human Evolution” Hernandez et al. 2011 were interested in the modes of natural selection that shaped human adaptations. Up to date, most studies suggest that the principal mode of adaptation is due to positive selection. Therefore, a beneficial mutation appears in a population and is getting rapidly fixed. The decrease in neutral diversity in the linked sites results in the occurrence of a ‘classic selective sweep’. Hernandez et al. 2011 were questioning whether it could be possible that not only selective sweeps but also other types of selection could have been involved in human adaptation.

Resequencing data for 179 human genomes from “three” populations (African, Chinese/Japanese and European) was investigated. They assessed average diversity levels as a function of genetic distance from the nearest exon and the nearest conserved non-coding region. If functional changes in amino acids would result in a classic selective sweep, the diversity level of non-synonymous substitutions would decrease in comparison to synonymous substitutions. This pattern has already been confirmed in Drosophila simulans. Interestingly the authors revealed a decrease in both, synonymous and non-synonymous substitutions. Hence, they suggest instead strong purifying selection on linked size to explain the pattern. So far it has been believed that synonymous sites evolve neutrally in mammals. But recent studies demonstrate that synonymous sites are important in mRNA stability and for correct splicing. So, the decrease in diversity could maybe also be linked to positive selection?

Moreover, tests for classic sweeps were carried out, by comparing the genetic differentiation of the three populations. An enrichment of highly differentiated single nucleotide polymorphisms (SNPs) between pairs of populations in genic regions has been unravelled. So at least some SNPs might have evolved through the action of positive selection according to Hernandez et al. 2011.
However, tests of highly differentiated alleles at non-synonymous sites, transcription start sites and 5’ or 3’ untranslated regions against the genomic background were almost or not at all significant. This suggests that the differentiated alleles were most probably selected from standing genetic variation. This is supported by the fact that alleles with very high differences in frequencies often segregate in both compared populations and tend to lie on shorter haplotypes than expected from classic sweeps. But maybe there might also be the possibility that ‘neutral sweeps’ could have occurred during evolution. The probability is quite low but when populations expand alleles can get fixed by chance, which is a genetic signature of the ‘founder effect’.

All in all, a lot of the hypotheses that have been suggested remain unanswered, referring to future research. Figures were hard to understand, especially when the legend is not comprehensive. It also took some time to go through the article that was referring a lot of times to the supplementary material (54 pages!). But I really appreciate the effort to give a short and comprehensively written overview for the huge amount of work that has been realized.

Hernandez R.D., Kelley J.L., Elyashiv E., Melton S.C., Auton A., McVean J., 1000 Genomes Project, Sella G., Przeworski M. (2011). Population Genomics of Parallel Adaptation in Threespine Stickleback using Sequenced RAD Tags Nature, 6 DOI:

Saturday, November 26, 2011

Positive selection, recombination hot spots and resistance to antimalarial drugs in P. Falciparum: the way to the treatment against malaria ?

Plasmodium Falciparum is a protozoan parasite that cause malaria in human. An estimated 781,000 people died from malaria in 2009 according to the World Health Organization. Different treatments exist against malaria since 1891 such as Atabrine, Chloroquine(CQ) or Artemisinin(ART) but there is not yet any vaccination possible and due to the evolution one can see an increasing in drug resistance of the Falciparum population.

Some information at genomic level are at a high importance to determine the resistance to antimalarial drugs. To study possible treatments, a group of researchers worked on Plasmodium Falciparum to detect variation in recombination rate, loci under recent positive selection and genes associated with drug responses. For this work, the researchers used the GWAS method (Genome-Wide Association Studies) which allows to define if a single-nucleotide polymorphism (SNPs) is associated with a trait, here the malaria.

The authors collected and adapted 189 independent P. falciparum: including 146 from Asia (specifically, Thailand and Cambodia), 26 from Africa, 14 from America and 3 from Papua New Guinea. Antimalarial drug resistance of Falciparum is different according to their localization, thus the choice of the authors is good but not well-balanced. Using population genetics methods and stratification methods, the authors showed that the parasites could be clustered into continental populations. Based on a PCA (Principal Component Analysis) we can see that the presence of SNPs could distinguish parasites with different phenotypes.

Population recombination maps were generated for all 14 chromosomes to detect variation in recombination rate. Recombination spots appeared to be conserved among population. The authors detected several loci with extremely high levels of recombination activity, including a locus at the end of chromosome 1 and a segment on chromosome 7 containing pfcrt (gene encoding the P. Falciparum CQ resistance transporter).

Three different methods were used to define loci under significant positive selection: relative extended haplotype homozygosity (REHH), integrated haplotype scores (iHS) and cross-population extended haplotype homozygosity XP-EHH. Using the REHH method, multiple loci under positive selection were detected such as: locus on chromosome 7 containing pfcrt, a locus on chromosome 11 containing the gene encoding P. Falciparum apical membrane antigen 1 (pfama-1) and a locus on chromosome 13 containing PF13_0271 which encodes an ATP-binding cassette (ABC) transporter. The pfama-1, pfcrt and new SNPs loci are detected using the iHS method. The XP-EHH compared the different populations and allowed the detection of selective sweep that drive some alleles to fixation in one population but remain polymorphic in others. A total of 11 genes under significant selection were detected by all three of the 3 methods.

The parasite half-maximum inhibitory concentration (IC50) measures the effectiveness of a compound in inhibiting biological or biochemical function. In the study, IC50 was measured to detect genes associated with drug responses. Multivariate analyses showed a strong positive correlation between IC50 values of mefloquine (MQ) and Dihydroartemisinin (DHA) and a general sensitivity to piperaquine (PQ) and DHA in all the parasites. The authors detected a higher resistant to the drugs on the Cambodian population.

This publication is very interesting, the authors identified many genes under positive selection, some of which could be drug or immune targets. With further studies, we can hope to obtain an effective treatment against the malaria. As a Nature publication, the authors could have been more attentive in small points such as changing the representative color of the different populations from one graph to the other.

Mu, J., Myers, R., Jiang, H., Liu, S., Ricklefs, S., Waisberg, M., Chotivanich, K., Wilairatana, P., Krudsood, S., White, N., Udomsangpetch, R., Cui, L., Ho, M., Ou, F., Li, H., Song, J., Li, G., Wang, X., Seila, S., Sokunthea, S., Socheat, D., Sturdevant, D., Porcella, S., Fairhurst, R., Wellems, T., Awadalla, P., & Su, X. (2010). Plasmodium falciparum genome-wide scans for positive selection, recombination hot spots and resistance to antimalarial drugs Nature Genetics, 42 (3), 268-271 DOI: 10.1038/ng.528

Thursday, November 17, 2011

Parallel Evolution in Threespine Stickleback

ResearchBlogging.orgThe threespine stickleback (Gasterosteus aculeatus) is a coastal and freshwater form species that lives in marine, eustarine and freshwater habits throughout the Northern hemisphere. Previous studies suggested that the freshwater stickleback populations might have diverged independently from oceanic populations less than 10,000 years ago. Indeed, the search for new space might have caused migration to unexplored freshwater habitats. Among threespine stickleback populations, there is a huge phenotypic variation mainly due to adaptation to differences in feeding behaviours and defence mechanisms. For example, the lateral plate armor is present in oceanic populations but has been lost in many derived freshwater populations. This is of particular importance because despite little or no gene flow among freshwater populations, life history traits appear independently in populations of similar habitats.

Its evolutionary history and its extraordinary phenotypic diversity made it appropriate for studying the genetic changes that underlie adaptation to new environments. Moreover, recent advances in genome biology and next generation sequencing techniques allowed addressing questions about evolutionary processes acting at a genomic scale in natural populations.

In this paper (“Population Genomics of Parallel Adaptation in Threespine Stickleback using Sequenced RAD Tags”) of Hohenlohe et al. 2010 the main goal was to assess whether the rapid adaptation of freshwater populations and their phenotypic similarities might be due to parallel genetic evolution. Therefore, 100 individuals from two oceanic and three freshwater populations have been assessed implementing Illumina-sequenced libraries of restriction-site associated DNA (RAD) tags.

Using RAD Tags has many advantages because it discovers, proves and investigates markers simultaneously. By generating a high amount of single nucleotide polymorphisms (SNP) it is also most likely to cover a large proportion of the linkage disequilibrium (LD) blocks involved in stickleback adaptation and thus to detect even private alleles in natural populations. Interestingly, Hohenlohe et al. 2010 did not find any private alleles in the freshwater populations. Therefore, the author suggested that selection in freshwater populations has acted on haplotypes that were extremely rare in the oceanic. This is in consistency with the hypothesis that genetic variability in freshwater populations is mainly the result of selection on standing genetic variation present in the oceanic stock.

Signatures of selection have been found across six different linkage groups and have been confirmed by previous QTL mapping, like the lateral plate phenotype. Moreover, signs of balancing selection on regions that were implicated in pathogen resistance and immune responses have also been unravelled. Hohenlohe et al. 2010 argued that the loss of armor in all three independently derived populations confirms a parallel genetic evolution. However, parallel evolution is the development of a same trait in two distinct species. This article focused on populations coming from the same species. Therefore, it remains ambiguous to affirm parallel evolution in threespine stickleback even though it seems most likely.
Although, this article was not the easiest one and in some points repetitive, I find that the results are striking. This study is one of the first using RAD Tags for whole genome sequencing in natural populations and gives a lot of ideas for future research. Especially for researchers who do not work on model organisms RAD Tags seem to deliver reliable results because even without a reference genome huge amounts of SNPs can be found. Further on those can be used for genome-wide association studies and the search for candidate genes.

Hohenlohe, P., Bassham, S., Etter, P., Stiffler, N., Johnson, E., & Cresko, W. (2010). Population Genomics of Parallel Adaptation in Threespine Stickleback using Sequenced RAD Tags PLoS Genetics, 6 (2) DOI: 10.1371/journal.pgen.1000862

Wednesday, November 9, 2011

Tutorials on whole exome sequencing

Slightly off topic relative to the usual journal-club posts, but I think that this is relevant to understanding where we are going with genomics for studying variability:

Next-Gen 101: Video Tutorial on Conducting Whole-Exome Sequencing Research from the National Human Genome Research Institute

(from the Nielsen lab blog)

Monday, November 7, 2011

RAD tagging adaptation

The threespine stickleback, Gasterosteus aculeatus, is a small fish that inhabits marine, estuarine and freshwater habitats in the holarctic. It has been previously inferred that in many regions, freshwater populations derived from oceanic ancestors. As soon as the freshwater populations are in different drainage systems, they can be considered as independent of each other. Those natural replicates are one of the reasons why sticklebacks are a model system to study adaptive evolution.

Sticklebacks adapt to freshwater habitats in a recurrent manner by modifying several key phenotypic traits. Many studies focused on identifying those traits and measuring their heritability or fitness properties. At the phenotypic level, there is a striking parallelism between derived freshwater population, but what is unclear is how much this parallelism is underlined by genome-wide patterns of parallel evolution.

That is the main question that Hohenlohe et al. tackled in their 2010 paper entitled "Population genomics of parallel adaptation in threespine stickleback using RAD Tags". They compared the genomes of fish originating from three lakes and two coastal saltwater habitats located along Alaska's southern coast. The three lakes were chosen in different drainage systems to have three independent instances of adaptation to freshwater (and maybe to have an excuse to hike from one sampling point to the other?).

The approach they developed (RAD tags) allows to detect single-nucleotide polymorphism (SNP) across the whole genome. The data processing analysis is nicely illustrated here. Such method produces an enormous amount of results. There is so much data, that any dubious point can be discarded prior to the final analysis to keep only the SNP that have the highest probability of actually representing existing polymorphism in the populations.

The results first confirm the classical hypothesis of a large oceanic population giving rise to divergent freshwater population. They also found many genomic regions showing signatures of balancing and divergent selection across all three freshwater populations. This suggests that phenotypic evolution occurs through parallel genetic evolution at the genome scale. Interestingly, they could, using the stickleback annotated genome, identify candidates genes that are linked with phenotypic changes.

While some parts of the methods lack transparency, the results they get are highly convincing. The fact that they were able to show parallelism at the genome level and then identify candidate loci that are important in the adaptive process is really interesting. This because it may motivate many in-depth studies on specific genes or pathways that have been shown to be related to adaptation. Regarding the paper, it took some time and attention to understand clearly the figures (mostly 6, 7 and 8). They hold tons of results and are not so straightforward to grasp quickly. In conclusion, the correlative patterns outlined by this research are striking, but call for experiments designed to test specific hypothesis on particular genomic regions.

Hohenlohe, P., Bassham, S., Etter, P., Stiffler, N., Johnson, E., & Cresko, W. (2010). Population Genomics of Parallel Adaptation in Threespine Stickleback using Sequenced RAD Tags PLoS Genetics, 6 (2) DOI: 10.1371/journal.pgen.1000862

Friday, November 4, 2011

Paper : genome evolution and adaptation in a long-term experiment with Escherichia coli According to Darwin, adaptation is a gradual process. The rate of adaptation is variable and diverse whose reason is unknown. It ’s well known that genomic changes are linked with adaptation, but exact relationship remain elusive. With imperfect knowledge of organism’s genetics and complicated environment, it’s difficult to make clear conclusion. Thus, this paper designed a experiment using tractable model organisms in controlled laboratory environments, in order to minimize the confounding factors and complexity. Moreover, they sequence complete genomes to find the mutations responsible for particular adaptation. In addition, it’s possible to find out whether the dynamics of genomic and adaptive evolution are coupled very tightly or only loosely.

In the first step, they sequenced the genomes of E. coli clones sampled at generations 2K, 5K, 10K, 20K and 40K. Through 20K generations , 45 mutations were identified, moreover, the number of mutational differences between accumulated in a ncestral and evolved genomes accumulated in a near-linear fashion over this period. Neutral evolution should accumulate by drift at a uniform rate and are not beneficial. However, in this experiment,they found fitness trajectory shows profound adaptation that is not linear. Particularly, the rate of fitness improvement decelerates over time indicating the rate of genomic evolution to decelerate. Under three scenarios, they explore the relationship between rates of adaptation and genomic evolution. The model predicts declining rates of both adaptive and genomic evolution or alternatively, no deceleration in either trajectory.

In the second step, they proved that the mutations are dominantly beneficial using four lines of evidence.1) The results challenged drift hypothesis : the probability of observing no synonymous substitutions is only 0.07%. On the basis of the probability, the mutations are not neutral ;2) In most cases, the evolved alleles differed between the population ;3) almost mutations in the earlier clones were transmitted in subsequent generations, which is against the drift hypothesis ;4) the derived allele is more competent in competition, which contrasts neutral drift hypothesis. Up to sum, mutations offer advantage in the same environment and beneficial substitutions are dominant. Preponderance of neutral substitutions can not explain the rate disparity.

In the study, they observed that in later generations, rate of genomic evolution is elevated, typically, the frequence of mutT gene mutation is much higher in 40K than in the earlier mutations. They sequenced the site of the mut T frameshift in clones and found the appearance of mutation took place in generation 26 500 and became dominant soon. However, unlike before 20 000 generations, only a small fraction of new mutations is beneficial. In order to verify this observation, they examine the proportion of synonymous mutations after the mutator phenotype evolved to determine if it is consistent with a random distribution across sites. Then they found in the 40 K genome the frequency of the new base substitutions is lower than the earlier genome, indicating a high proportion of late-arising no-changes are also neutral or nearly so under the conditions of the evolution experiment.

In the end, they conclude that mutations accumulated at a near-constant rate even as fitness ganis decelerated over the first 20 000 generations. On the other hand, the rate of genomic evolution accelerated markedly when a mutator lineage became established later.

Throughout the paper, I think this paper provided a good model to explore the long-term dynamic coupling between genome evolution and adaptations, such as the effects of clonal interference, compensatory adaptation, and changing mutation rates. But as far as I am concerned, the author should display more figures to demonstrate their opinion. I have impression that too much word but not vivid figure is used to present.

Barrick, J., Yu, D., Yoon, S., Jeong, H., Oh, T., Schneider, D., Lenski, R., & Kim, J. (2009). Genome evolution and adaptation in a long-term experiment with Escherichia coli Nature, 461 (7268), 1243-1247 DOI: 10.1038/nature08480