The proper functioning of a living cell depends on constant regulation of its bio-molecular content, with many regulatory mechanisms controlling the RNA stage of gene products. Recent advances in high-throughput experimental technologies offer us unprecedented ability to monitor how the RNA world, or transcriptome, varies under different conditions and between individuals. However, a major challenge is to translate the resulting vast amounts of data into useful information that unravels the underlying regulatory mechanisms, how they go awry in human disease, and how we may be able to design therapeutics. In this talk I will concentrate on deriving a regulatory code for an RNA processing mechanism termed Splicing. During splicing, segments of a transcribed pre-mRNA molecule are removed while other segments, termed exons, are retained to form the mature mRNA transcript. Removal and retainment of different segments, termed alternative splicing, may lead to thousands of different messages produced from the same genetic code, greatly expending the complexity of the RNA universe. Alternative splicing occurs in more than 90% of human genes, and many relations of splicing defects to human disease have been documented. I will describe how we combine high-throughput experiments with genomic data to derive a splicing code. The computationally derived code enables us to predict splicing outcomes directly from primary DNA sequence, infer novel regulatory mechanisms, and identify active cis regulatory elements that may be involved in several neurodegenerative disease.
Although there is only one human genome sequence, different genes are expressed in many different cell types and tissues, as well as in different developmental stages or diseases. The structure of this ‘expression space’ is still largely unknown, as most transcriptomics experiments focus on sampling small regions. We have integrated data from thousands of high throughput gene expression experiments to construct the global map of gene expression for human and mouse. In my talk I will discuss the challenges of building such a map, the main results of its analysis as well as the comparison of gene expression in human and mouse. I will also discuss how advances in high throughput sequencing are making possible to address new questions related to gene expression.
Human genetics has been dominated by genome-wide association studies for the past several years, and while these have led to the discoveries of thousands of common variants that contribute to common diseases as well as sub-clinical phenotypes, they leave the vast majority of phenotypic variance unexplained. GWAS applied to gene expression profiles likewise is an extremely powerful tool for dissecting the genetic regulation of individual genes, but it has not led to a jump in our understanding of the reasons for the pervasive co-regulation of modules of thousands of transcripts. In this talk, I will discuss the research we are doing to try to dissect the basis for such co-regulation in human peripheral blood. I will describe the major patterns of environmental influence inferred from studies on three continents, concluding that cultural and behavioral factors operate over various time scales and have coordinated influences on specific pathways of gene activity related to immune function. This leads to the concept of a molecular taxonomy of the human leukocyte transcriptome. GWAS data from two Atlanta cohorts will then be used to argue that the genetic influence on transcription may change with chronic disease status, with implications for assessing the contributions that rare and common variants make to disease risk. New approaches to identifying the roles that specific transcription factors and hormone signaling pathways play in orchestrating coordinated transcriptional responses will be described, including a discussion of functional experiments underway to link basic studies of differential gene expression to individual variability in inflammatory and immune responses.
Horizontal Gene Transfer (HGT) has played an important role in the spread of genetic and metabolic innovations between distantly related organisms. The further one moves back in time, the more likely it becomes that a gene transfer originated in lineage that went extinct since the transfer occurred . Because of HGT the most recent common ancestors of different molecules did not all coexist in the organismal most recent common ancestor, aka the cenancestor or the Last Universal Common Ancestor (LUCA). The molecular cenancestors existed in different lineages and at different times . Reconstructing the amino acid composition of ancestral sequences allows rooting the ribosomal phylogeny (calculated from ribosomal proteins) , and provides information on which amino acids were late additions to the genetic code. Using the rooted ribosomal phylogeny as a backbone, one can begin reconstruction of the reticulate history of genomes, which is greatly impacted by highways of gene sharing between divergent organisms. A line of descent can be defined via the ancestry of the majority of genes passed on from ancestor to descendants over short periods time. However, over long time intervals most of the genes contained in a genome might have been acquired by gene transfer from organisms that are not in a direct ancestor descendant relationship with the recipient. For some gene families divergent copies are maintained in higher taxonomic units, although most individual genomes contain only one or the other version of the gene. In case of aminoacyl tRNA synthetases (aaRSs) the replacement of one version with the other takes place mainly through transfer between close relatives; however, the transfers remain detectable because the two versions, which we named homeoalleles, are very divergent, some of them apparently preceding the organismal cenancestor. In this case most gene transfers appear to be biased towards overall similarity between donor and recipient. Consequently, even though genes are frequently transferred, the large-scale molecular phylogeny for each version remains similar to the organismal phylogeny [4, 5]. In contrast, highways of gene sharing that are created through a gene transfer bias due to shared ecological niche, such as between Thermotogales and Clostridia, can overwhelm the majority phylogenomic signal.  Sequence reconstruction for aaRSs that diverged before the organismal LUCA confirms tryptophan as a late addition to the genetic code, but also suggests that for other amino acids tRNA charging mechanisms existed that preceded the currently known aminoacyl tRNA synthetases.
We have demonstrated that the High Density Tiling Arrays (HDTAs) provide important advantages in comparison to standard expression arrays: errors occurring during preparation and handling of the RNA samples can be detected and accurately quantified; probes to the intergenic space significantly help with an identification of transcription unit architecture. Essentially, a single array can be interpreted as an assembly of multiple technical replicas collected for the same sample. These virtual "replicas" can be exploited to assign reliable error bars to the measurements of the differential expression. The developed algorithms are applicable to both HDTA and RNAseq data flows and provide quality guidance for experimental protocol evaluations.
In /Clostridium thermocellum/we found over 1200 distinct transcription units ranging from 1 to 16 genes that by our estimates closely correspond to the underlying operon structure. Remarkably, a large majority of the units could be reliably determined from any pair wise comparison of the investigated conditions. It was revealed that very similar "palitra of transcription" is utilized under very different scenarios: same transcription units are regulated in response to a dramatic change of the growth media and to a mild environmental stress.
On another track of the same project we have mapped exact boundaries of the transcription starts in the CT genome. The direct experimental identification of the positions yields valuable insights into consensus promotor sequences, helps to refine annotations of the genes, etc. We found that the best HDTAs do contain information that can be used to map start positions. In more than 800 cases we were able to locate such starts with an estimated precision of only 5-10 nucleotides. A customized version of Welch t-test was used to find exact locations and provide an analytical estimate of the detection reliability.
Programmable stem cell differentiation holds great promise for regenerative medicine. Replacing lost or damaged tissues in the adult body, where early developmental programs have been lost, might be accomplished by rewiring epigenetic cues that control cell fate-determining genes. We have applied synthetic biology to regulate developmental genes based on their chromatin signatures. Bioinformatics informs synthetic biology by identifying conserved protein motifs that can potentially be used as modules for constructing novel synthetic devices, such as artificial transcription factors (TFs). We used the conserved chromodomain from the human Polycomb protein, and homologues from other species, to construct "Pc-TFs" that recognize the repressive tri-methyl histone H3 lysine 27 signal and switch silenced genes to an active state. Pc-TF expression in U2OS osteosarcoma cells leads to increased transcription of the senescence locus CDKN2A (p16) and MMP12 in a chromodomain and transcription activation module-dependent manner. These results indicate that silenced developmental regulators can be re-activated by a synthetic transcription factor that interacts with chromatin rather than DNA. Our group currently uses bioinformatics to predict and determine the impact of Pc-TFs on cell phenotypes. In stem cells, repressive histone methylation marks are associated with genes that activate differentiation. The genomic distribution of histone methylation helps us to identify potential Pc-TF targets. The accuracy of these predictions are tested by ChIP mapping and expression profiling, which enable a deeper understanding of how Pc-TF engages with chromatin. We aim to identify cohorts of Pc-TF target genes that may be regulated in order to control stem cell fates for future applications in tissue regeneration.
The primary goal of the human genome project was accomplished a decade ago with determination of a draft human genome sequence, that is, a genetic map at single base-pair resolution. Accordingly, a key goal for the post-genomic era has been to map epigenomes at single base-pair resolution. This goal has already been achieved for DNA methylomes and for RNA transcriptomes, driven in large part by rapid progress in read-out technologies, including microarrays and massively parallel DNA sequencing. However, the all-important protein complement of the epigenome is complex and dynamic, and current chromatin maps are difficult to interpret and are relatively low in resolution. I suggest that one basic problem is that most efforts are aimed at mapping regulatory components of chromatin, such as histone ‘marks’, while the basic machinery that is responsible for generating these complex patterns remains poorly understood. Another problem is that the most widely used methodologies do not accurately represent the in vivo situation and are not designed to take full advantage of increasingly powerful DNA sequencing technologies. I will describe efforts to address these limitations, and show how single base-pair resolution mapping of the basic chromatin machinery is a potentially achievable goal. I will also describe how chromatin dynamics can be mapped genome-wide. Mapping the full protein complement of chromatin (‘the beef’) in space and time can lead to insights into chromosome biology, developmental processes and disease. Such efforts are also likely to create many opportunities for data analysis and computational modeling to generate ideas and suggest new experiments.
How stimulus-responsive gene expression is specified remains an important unresolved question. Studies of model gene promoters support the hypothesis of a Combinatorial Code of transcription factors (TFs), whereas recent studies of stimulus-specific TF activation dynamics suggest the hypothesis a Temporal Code. Here, using an iterative approach of mathematical modeling and quantitative experimentation, we examined each hypothesis within the signal-responsive gene regulatory networks (GRNs) that control the endotoxin-inducibility of 715 mammalian genes. Surprisingly, we found gene clusters controlled by a single signal-responsive TF, or TFs functioning independently in OR gates, but not cooperatively in AND gates. However, evidence for Temporal Code was pervasive: considering signaling dynamics and RNA halflife control uncovered unexpected mechanisms of gene control and allowed for recapitulating 83% of observed expression profiles in 27 conditions. Our results demonstrate that predictive GRN models cannot be based on TF-DNA binding events alone but must also consider cytoplasmic dynamic signaling.
Biological networks are at the core of complex cellular phenotypes, and the networks formed by interacting proteins are crucial scaffolds for modeling, data reduction and annotation. In my presentation I will focus on how integrative analysis of different types of data – in particular information obtained through automatic mining of the scientific literature – can help identify interactions among proteins and small molecules. I will also discuss how network-based data and text mining can be used to gain insights into complex regulatory processes and to link drugs, targets, diseases, and side effects.
While Bayesian approaches continue to make inroads in phylogenetic analysis, their widespread deployment has been hindered by serious computational challenges. I discuss two recent lines of work that address the computational challenges head-on: (1) An alternative to Markov chain Monte Carlo (MCMC) that is based on an adaptation of sequential Monte Carlo (SMC) to the space of phylogenies. I present experiments that show that the new framework ("Poset-SMC") can converge two orders-of-magnitude faster than MCMC. (2) A novel stochastic process for modeling insertions and deletions on trees. The model is closely related to the classical Thorne-Kishino-Felsenstein (TKF) model, but where the TKF model incursexponential complexity in the number of taxa for computing the joint probability of a tree and an alignment, under the new model this complexity drops to linear. (Joint work with Alexandre Bouchard-Cote and Sriram Sankararaman)
The human olfactory receptor (OR) subgenome and its genomic variability are relatively well-studied. We have previously reported several dozen cases of deleterious variations in olfactory receptor genes that attain considerable population prevalence, and suggested their role in odorant-specific sensitivity phenotypes (specific anosmia). We have recently augmented 10 fold the list of OR loci that segregate between intact and inactive alleles (segregating pseudogenes) by in-silico data mining, combined with Next Generation Sequencing (NGS) of the OR sub-genome and transcriptome sequences. We identified 285 deleterious SNP and indel variations in 210 ORs. We also applied the CopySeq algorithm (PMID: 21085617) to 1000GP data we additionally obtained 72 OR genes with deletion CNV allele. Altogether, we find that ~60% of the OR genes segregate in the human population between an intact and disrupted allele, indicating huge functional variability and suggesting that each human on the planet has a different functional nose. Much less is known on variability in olfactory non-receptor (accessory) genes those, which mediate odorant signal transduction as well as those that underlie olfactory sensory neuron (OSN) development and integrity. The latter genes likely underlie sensitivity phenotypes pertaining to most or all odorants. We focus on two relevant phenotypes, which we suspect arise from genetic variations in olfactory accessory genes. The first is the “general olfactory factor” – inter-individual variation in average olfactory thresholds. Using two cohorts, each about 350 strong, we currently perform association studies with respect to database-documented variations in 160 candidate accessory genes. For producing such list we scrutinized the literature of olfactory transduction and OSN development, including mouse gene knockouts. In parallel, we have conducted next-generation RNA sequencing of autopsy and biopsy specimens from normosmic individuals, in comparison to transcriptomes from 8 standard tissues, seeking olfactory tissue-enriched transcripts. A second, more extreme general olfactory phenotype is congenital general anosmia (CGA), an in-born complete absence of smell sensations. We seek the genetic basis for CGA by whole exome DNA sequencing of affected individuals in families, as well as 60 isolated CGA cases. The list of candidate accessory genes provides a potential focus for identifying relevant mutations. In sum, we strive to elucidate, by genome variation analyses, both odorant-specific and odorant-general phenotypes, en route to obtaining a complete understanding of the genetic basis of human olfaction. A similar approach is being used in parallel to decipher monogenic neurological diseases, two of which have yielded likely mutated gene candidates. In all the aforementioned projects we extensively use GeneCards, our automatically mined digital compendium of human genes, recently upgraded to accommodate the challenges of high-throughput genomics and the scrutiny of gene-based maladies.
The ability to successfully predict the immunity and effectiveness of vaccines would facilitate the rapid evaluation of new and emerging vaccines, and the identification of individuals who are unlikely to be protected by a vaccine. In this talk, we will describe a multidisciplinary approach involving immunology, genomics and bioinformatics to predict the immunity of a vaccine without exposing individuals to infection.
This approach addresses a long-standing challenge in the development of vaccines -- that of only being able to determine immunity or effectiveness long after vaccination and, often, only after being exposed to infection. The first study involved YF-17D to predict the body's ability shortly after immunization to stimulate a strong and enduring immunity against yellow fever. Healthy individuals were vaccinated with YF-17D and the T cell and antibody responses in their blood were studied. There was a striking variation in these responses between individuals. Analysis of gene expression patterns in white blood cells revealed that in the majority of the individuals the vaccine induced a network of genes involved in the early innate immune response against viruses.
A discrete support vector machine classification model and feature selection algorithm were applied to the gene signatures to identify discriminatory sets and establish the classification rule that can classify the T cell response and the antibody response induced by the vaccine. To validate its predictive accuracy, and whether these gene signatures could predict immune response, a second group of individuals were vaccinated, and we were able to predict with up to 90 percent accuracy which of the vaccinated individuals would develop a strong T or B cell immunity to yellow fever.
To determine whether this approach can be used to predict the effectiveness of other vaccines, including flu vaccines, a second study is based on a series of clinical studies during the annual flu seasons in 2007, 2008 and 2009. Healthy young adults were vaccinated with a standard flu shot (trivalent inactive vaccine). Others were given live attenuated vaccine nasally. Comprehensively surveyed, the activity levels of all human genes in blood samples from the volunteers revealed that the activity of many genes involved in innate immunity, interferon and reactive oxygen species signaling were changing after flu vaccination. Biological analysis also identified genes in the "unfolded protein response," necessary for cells to adapt to the stress of producing high levels of antibodies. Using various trials for establishing the classification rule, predictive accuracy as high as 90% was achieved. Encouragingly, some of the genes identified in the seasonal flu study were also predictors of the antibody response to vaccination against yellow fever. Further, this systems approach facilitates discovery of new functions for genes, even if scientists previously did not suspect their involvement in antibody responses.
The talk will conclude with a summary of the theoretical findings, solution characteristics, and computational challenges of our classification model.
This work was performed in collaboration with lead investigator Bali Pulendran from the Emory Vaccine Center, with the speaker’s main contribution relating to the classification and predictive modeling and algorithms.
The first chromosome of an eukaryote was sequenced almost 20 years ago (Oliver et al., Nature, 1992). Since, genome and sequenced based methods gave access to a new world of molecular information; life sciences are now technology and data driven. The traditional split of biology into discrete research subjects following species, method, or other classifications is on its way to be transformed into subject specific research areas employing a whole toolbox of techniques generating large, heterogeneous data sets, hardly ever structured into comprehensible information. The traditional scientific concept in biology is descriptive, the exchange of information narrative, and the interpretation of the phenomena observed intuitive. Despite the undeniable enormous progress in generating information, living organisms still resist to the three basic challenges of science: explanation, prediction, control. About a million papers are published every year, an enormous wealth of information, presented in the traditional narrative, often anecdotal way. Simple combinatorics of the size of the experimental space as well as the observable data space tells us that both are practically indefinite. In my talk, I will discuss the need for a metaphysics of the life sciences. This need is justified by a number of reasons: (1) the current practice of scientific publication hits the wall - nobody is any more able to cope with the published papers in its own field (2) biological processes are highly complex and involve a larger number of objects than we can handle with our cognitive capacities that are necessarily limited and sequential (3) the traditional way to perform science based on small research groups is limited to very valuable but necessarily small scale results (3) the systematic exploration of data spaces (e.g. like siRNA screens) becomes impossible, since the hypothesis space to ask the right question to the experimental data is mostly undefined and results in intuitive but unsystematic interpretation of the data. The way to generate and explore data is driven by unreflected rules based on our socio-cultural background. Although just this approach was highly successful in molecular and cellular biology, it is limited and requires epistemological reflection. There exists a rich literature on the theory of science, but it needs to be applied to the daily practice in life science research and education.
Minimal gene set for prokaryote lifestyle is a list of genes needed for life of an engineered M.genitalium if the nutrients are provided. The digital and gene-bashing experiments agree that it must have 300-350 proteins. This study asks if these minimal proteins may be made of a limited monomer repertoire. The genome gazing method is employed to respond to this question. Every kind of the residue in the proteins of the minimal genome, and in invariant positions of multiple alignments of these proteins to their orthologs from other lineages is examined. The established or inferred operational roles of the preserved residues are then studied. Generally, the residues that are the rarest overall are also the rarest ones to be preserved. The order to engineer the residues out of the minimal genome may be: That One Residue that has a Thiol Group; Trp; non-initiatory Met; His. The problems inherent in substituting the thiol-bearing residue are examined in more detail, and it is argued that this engineering might be plausible and the minimal organism that is depleted of it may survive.
Chromatin is composed of DNA and a variety of modified histones and non-histone proteins. It has become increasingly clear in recent years that a better characterization of the chromatin states is essential for understanding gene regulation and key cellular processes. I will describe our analysis of the chromatin landscape in Drosophila, summarized by prevalent combinatorial patterns of histone modifications. We find that integrative analysis with other data (non-histone chromatin proteins, DNase I hypersensitivity, GRO-Seq reads produced by engaged polymerase, short/long RNA products) reveals discrete characteristics of chromosomes, genes, regulatory elements and other functional domains. I will also describe some analytical challenges that arise in analysis of chromatin data.
Protein secretion is a key virulence mechanism of pathogenic and symbiotic bacteria. Thereby proteins can be transported from the bacterial cytosol directly into the eukaryotic host cell. This makes the investigation of segregated proteins ("effectors") crucial for understanding the molecular bacterium-host interactions. Eased uptake of the pathogen, manipulation of the immune response and preventing apoptosis of the infected host cell are examples of the complex effects that are triggered by secreted bacterial proteins. Our group is developing novel computational methods addressing three major research questions of effector-based host-pathogen interactions: 1) Prediction of bacterial secreted proteins: Up to now, 7 different bacterial secretion systems and the Sec pathway have been described as molecular ways of transport. Each of them is specific in terms of molecular structure and mechanism of translocation. We have developed two complementary prediction strategies for protein secretion: EffectiveT3, a software predicting targets of the Type III secretion system based on the associated signal peptide, and Effective (http://effectors.org), a database identifying eukaryotic-like proteins in bacterial genomes. 2) Prediction of Host-Pathogen protein-protein interaction networks: Based on experimentally characterized interactions between host and effector proteins, and using protein domain based PPI prediction methods, we are predicting the protein interaction networks of the host around bacterial effectors. These networks indicate that effectors target many different functional modules in the host and prefer specific host interactors. 3) Quantitative modeling of effectors in the host: The effect of bacterial proteins in host cells can only be understood quantitatively, considering the concentration of effectors and the strengths of interactions with host molecules. In a pilot study, we have analyzed the possible effects of bacterial effectors on the NF-kB signalling pathway, which represents a central target for bacterial pathogens. Significant effects on NF-kB expression could be achieved in this model by simultaneously targeting multiple proteins in the NF-kB pathway.
Through earlier computational work we discovered the "pyknon" class of DNA sequence motifs that exhibited a number of intriguing properties. These motifs led us to postulate the existence of specific, previously unseen categories of short RNAs and of an associated framework of putative interactions in which these RNAs participated. Experimental work and additional computational analyses by us and others have begun to provide support for the validity of the pyknon framework and suggest the possibility that a potentially significant portion of cellular process regulation may be mediated by genomic sequences that need not be conserved across organisms.
Complex human phenotypes, such as autism, schizophrenia, and anxiety, undoubtedly partially overlap in genomic variations that predispose to or protect against these maladies. (Genetic overlap of complex phenotypes has gained increasing experimental support and is no longer just an ungrounded scientific hypothesis.) Furthermore, as yet largely unknown shared environmental factors likely tend to trigger the manifestation of more than one phenotype. Although it may seem overly ambitious to target multiple phenotypes jointly, we believe we can obtain much more information from existing data and gain new insights into individual phenotypes by modeling phenotypes jointly. My talk sketches two distinct computational approaches to this problem.
Perhaps the oldest joke in molecular biology is that its Central Dogma is rewritten yearly. Encompassing all the molecular steps to convert information to activity, the Central Dogma has been expressed in many ways — as a concept, a rule, or an anecdote. However, a more precise and predictive quantitative formula is needed for the burgeoning field of Synthetic Biology, where we combine molecular biology, physical chemistry, and engineering mathematics to re-design life. Using the language of physics and chemistry, the Salis lab develops predictive biophysical models of gene expression that are capable of designing completely de novo DNA sequences to rationally control gene expression and regulation. In the process, we systematically explore, quantify, and debunk common anecdotal rules that have often yielded conflicting results, to the dismay of many bench scientists. These biophysical models have been encapsulated in a user-friendly DNA compiler that enables engineers and scientists to reliably and predictably design synthetic DNA for diverse biotechnology applications. We will present examples from metabolic engineering that demonstrate how large and complex genetic systems are designed to introduce new and valuable behaviors into micro-organisms.
The family Chromosome Conformation Capture experimental techniques allow for highly parallel measurement of 3d chromatin interactions. The primer based 5C variant in particular allows assaying all interactions among a set of selected regions. However, achieving unbiased high-resolution measurements with this techniques requires decreasing both the size of and distance between regions interrogated, at which point numerous systematic issues make identifying interactions extremely difficult. We will discuss these challenges and our work on developing new approaches to model these effect and extract interaction patterns from these types of data. We will demonstrate how these techniques allow high-resolution mapping of changes in chromatin interactions across different cellular conditions.