pollard:research
Links:
Latest Publications:
Transcription factors bind thousands of active and inactive regions in the Drosophila blastoderm - PLoS Biology

Evolution of genes and genomes on the Drosophila phylogeny - Nature
Political Ads:
Overview
Broadly, I am interested in the molecular genetic basis for morphological and behavioral diversity. My thesis work has concentrated on studying variation in cis-regulatory sequences that control gene expression in developing insects. While working on regulatory evolution I have explored many areas of comparative genomics and systematics.
Transcription Factor Binding Site Evolution Modeling

Alignment

Selection On cis-Regulatory Sequences

Phylogenetics

Genome Annotation

Motif Modeling
Transcription Factor Binding Site Evolution Modeling
A few very interesting examples of changes in cis-regulatory sequences causing changes in gene expression and ultimately morphology have been documented (e.g. Prud'homme et al 2006) but the extent to which cis-regulatory sequence variation results in functional and morphological variation is still quite unclear. To address this question the classic phenotype to genotype approach can be turned on its head and phenotype can potentially be read from genotype. This is no small challenge. An initial step in this process was to develop a model of how the binding sites for transcription factors evolve. Transcription factor binding sites are degenerate 6-20 base pair sequences contained in cis-regulatory regions. Adapting a character frequency selection model originally proposed for studying codon evolution (Halpern & Bruno 1998), a fellow student in the Eisen lab, Alan Moses, showed that the rate of evolution at each position in a binding site could be accurately predicted knowing just the biochemical specificity of a transcription factor for the family of sequences it binds (Moses et al 2003). This lead to the development of a method, called MONKEY, to scan multiple sequence alignments for examples of sequences that both look like binding sites for a transcription factor and evolve how we would expect them to evolve (see publications).

With such a model, we proposed that not only could functionally conserved transcription factor binding sites be predicted in regulatory regions, but binding sites known to be functional in one species but that have lost function in other species could be identified based on their deviation from our model of a conserved site. To systematically discover such cases of binding turnover, gain and loss, we developed a statistic to reject the hypothesis that a site is functionally conserved across an alignment of multiple species, implemented this statistic in the MONKEY platform and then applied this method to Chromatin IP on genome-wide tilling array (ChIP chip) data generated for the Zeste transcription factor in Drosophila embryos (see publications). Controlling for alignment errors and noise in our inference of functional binding sites within the bound regions we found that more than 5% of functional binding sites in Drosophila melanogaster have been gained or lost within the Drosophila melanogaster species subgroup (~10 million years) and that most non-conserved sites were either gained along the branch to Drosophila melanogaster lost along another branch in the tree. This approach is now being applied to a large panel of embryonic transcription factors and functional studies of binding locations in other species are being implemented.
Alignment
The comparison of sequences across individuals and species requires a method for inferring the homology of chromosomes, loci and/or individual bases. Local alignment tools such as BLAST are well suited for the matching up of chromosomes and loci but establishing homology at a single base resolution often requires more sensitive methods. Most such methods are based on the Smith Waterman and Needleman Wunsch dynamic programming algorithms, however, many methods have recently been developed to align large stretches of syntenic sequence that contain varying levels of divergence and constraint (e.g. Brudno et al 2003). To understand the potential of tools to make correct inferences about noncoding sequence homology and to aid in the future development of such methods, I compared the accuracy of a suite of alignment tools under various evolutionary scenarios generated with molecular evolution simulations (see publications). The newer methods that utilized local alignments of nearly perfect matches as anchors for more aggressive global alignment proved most sensitive and the ability of such local searches to recover interspersed blocks of constraint was remarkably specific.

The accuracy of tools is useful but for most applications, the effect of the accuracy of an alignment on a specific inference about evolution is really what is of concern. In my own research, two inferences that were of central importance were the ability to properly align conserved transcription factor binding sites and the ability to accurately infer divergence distances from multiple alignments. To address these questions, I implemented a molecular evolution simulation platform, called CisEvolver, that can evolve both neutrally evolving noncoding sequences as well as regulatory sequences containing conserved transcription factor binding sites, down a phylogenetic tree (see publications). Using simulated sequences produced by CisEvolver, I found that multiple alignment accuracy is primarily determined by the divergence distance separating the two most diverged species in an alignment, that binding sites are often miss-aligned (though often still overlapping) at relatively short divergence distances and that divergence distances are systematically underestimated beyond a tool specific divergence distance. Further, I found that the accuracy of alignment, binding site alignment and divergence estimation varies across branches in a tree and is best for branches connecting sister taxa and worst for internal branches in a tree. These results helped focus our research on species with divergence distances within an acceptable amount of error in inferences and ought to be broadly informative for evolutionary studies of genomic sequences.
Selection On cis-Regulatory Sequences
Perhaps the major biological assumption of comparative genomics is that functional elements in the genome will be subject to purifying selection and will therefor accumulate fewer fixed mutations than non-functional sequences. After the completion of the human and mouse genomes, researchers began exploring the conservation landscape across the genome and found many interesting conserved non-coding sequences (e.g. Mayor et al 2000). Many of these conserved sequences turned out to be cis-regulatory elements (e.g. Loots et al 2000). The power of such an analysis is undeniable, yet such findings tell us little about the variation in selection acting on all cis-regulatory elements. To address this question, I have been studying variation in the complete set of known cis-regulatory elements in Drosophila melanogaster.

In collaboration with Dan Halligan in Peter Keightley's group I've been looking at selective constraint (the fraction of mutations removed due to purifying selection) in cis-regulatory elements as well as a large panel of other genomic features. Our initial findings suggest that transcription factor binding sites are under nearly the same level of constraint as non-synonymous sites in protein coding sequences. This constraint is not uniform though, with significant variation across loci and chromosomes. The overall levels of constraint found in the genome suggest that most of noncoding DNA could be comprised of cis-regulatory elements or ncRNAs.

Together with a consortium of population genetics researchers, I have been using whole genome light shotgun sequences of ten Drosophila melanogaster strains to further examine selection acting on cis-regulatory elements.
Phylogenetics
The relationship of species is typically represented by a bifurcating graph referred to as a phylogenetic tree. The topology of the tree represents the order of speciation events and the length of the branches indicate the time between speciation events and terminal states (usually the present day). The correctness of a phylogenetic tree will determine the accuracy of any evolutionary inference conditioned on the tree. Therefor it is very important to construct a good tree for commonly compared species.

Twelve Drosophila genus species were recently sequenced, providing the scientific community with an unprecedented data set. The relationship of the species D. yakuba and D. erecta relative to the D. melanogaster species complex was conflicted in the literature from analyses of relatively small data sets (at most six genes). Each of the three possible trees relating these three taxa had received some published support. With the whole-genome sequences available, I set out to evaluate the support for the different trees (see publications). My initial assumption was that sampling variance from small data sets was the cause of the incongruence but whole genome phylogenetic analysis revealed that the incongruence was well supported. The tree supporting D. yakuba and D. erecta as sister species did, however, have the most support. An examination of the branch lengths in this best supported tree revealed that the time between the split of Dmel and (Dyak/Dere) and the split of Dyak and Dere was very short, perhaps much less than a million years. I then hypothesized that the incongruence was not due to methodological error but rather was real and the result of incomplete lineage sorting. This is a process by which polymorphisms are maintained through multiple rapid speciation events, leading to a tree that does not represent the species splits but rather the order in which the polymorphisms arose. Some fraction of the time two more distantly related species will have inherited the same allele while a species more closely related to one of these species will inherit a different allele. Spatial patterns of support for different trees varied with recombination across the genome, giving strong support to the lineage sorting hypothesis.

Recent research has lead to phylogenetic methods that explicitly model lineage sorting while inferring species phylogenies. I am currently applying these methods to the Drosophila phylogeny.

(If you are looking for my old trees page go here)
Genome Annotation
While the efficiency of genome sequencing technology increases exponentially, allowing researchers to produce ever more sequence from the broad array of organisms on the planet, the technology needed to annotate genomes is advancing at a modest pace. Ab initio annotation of genes, RNAs and cis-regulatory sequences typically coats much of the genome with annotations that are neither specific nor structurally correct (e.g. incorrectly merged or split exons). Homology based annotations have the potential to be much more accurate, particularly if high quality annotations from another species are used as a starting point. While ab initio methodologies have been thoroughly tested and are beginning to mature, homology based methods are only just beginning to be developed.

As the initial assemblies for the Drosophila pseudoobscura genome and then later the other ten Drosophila genomes were being made available as well as the honey bee and the mosquito genomes, I, together with Venky Iyer in the Eisen lab, began developing a 'pipeline' for annotating the coding genes in each species. Our initial pipeline consisted of three basic steps. The first was to use the already very well annotated Drosophila melanogaster coding genes to identify putative orthologous regions in the other species using TBLASTN. The second was to create putative gene models (exons and introns) in each of these putative orthologous regions using GENEWISE. The third was to evaluate each model using its similarity (BLASTP) to all the Drosophila melanogaster genes. If a model matched back to the gene we started with in Drosophila melanogaster and satisfied a few other criteria then we kept that model as an orthologous gene. Our initial annotations were used extensively in our own research, enabling us to build genome-scale phylogenetic trees (see phylogenetics), examine protein evolutionary rates for transcription factors we were studying and identify orthologous noncoding DNA for analysis of cis-regulatory regions and other genome features. Because we also made the annotations freely available to the community, we quickly became involved in the community efforts to annotate and analyse these insect genomes (see publications). While these annotations efforts were done with a great deal of care and reflect relatively high data quality relative to an ab initio standard, many aspects of gene annotation, particularly consensus gene annotation at the transcript level, are in need of further development. This is an area of research that must accelerate its development to keep of with the incredible quantity of genome sequence being generated today.
Motif Modeling
While genes are encoded in DNA as stretches of codons and noncoding RNAs as stems and loops, cis-regulatory sequences are understood to be collections of short degenerate recognition sequences for transcription factors. The identification of these recognition sequences is in practice very difficult. Typically in vivo experimental techniques lack high resolution but can be done on a genome scale, while in vitro techniques are either single-locus-scale or are generalized to provide estimated affinities for all sequences. Regardless of the experimental data available, prediction of functional transcription factor binding sites requires some modeling of sites. A dominant model used today is the position weight matrix (PWM), which is a position independent multinomial for each position in a binding site. Ab initio and evidence-based approaches can both easily utilize PWMs for modeling and prediction purposes. Much of my research has been based on PWMs (see transcription factor binding site evolution modeling, alignment and selection on cis-regulatory sequences). The limitations of PWMs, however, are only just beginning to be explored (e.g. Maerkl & Quake 2007). A growing number of alternative models to the PWM are being developed (e.g. Udalova et al 2002) but none have been thoroughly tested. Much work remains for exploring this area of research.

(if you are looking for my old matrices page, go here)