Whole genome phylogeny places the emerging mammalian pathogen Pythium insidiosum in the oomycete tree of life
Presenting Author: Edward L. Braun
Affiliation: Department of Biology and Genetics Institute, University of Florida, Gainesville, FL 32611
Complete Authors List:
Marina S. Ascunce (1,2); Jose C. Huguet-Tapia (2); Almudena Ortiz-Urquiza (3); Nemat O. Keyhani (3); Erica Goss (1,2); Edward L. Braun (4,5)
Complete Author List Affiliations:
(1) Emerging Pathogens Institute, University of Florida, Gainesville, FL 32611;
(2) Department of Plant Pathology, University of Florida, Gainesville, FL 32611;
(3) Department of Microbiology and Cell Science, University of Florida, Gainesville, FL 32611;
(4) Department of Biology, Dept., University of Florida University of Florida, Gainesville, FL 32611;
(5) Genetics Institute, University of Florida, Gainesville, FL 32611
Abstract:
The oomycete genus Pythium comprises more than 250 described species, most of which are saprobes or facultative plant pathogens. Pythium insidiosum is the only Pythium species that infects mammals; it is the causal agent of pythiosis, a deadly disease that infects horses, dogs, cattle, and other mammals in tropical and subtropical regions, including the southeast United States. Human cases of pythiosis have also been reported, with an initial report from Thailand in 1985. P. insidiosum is thought to propagate on aquatic plants in the environment and has been shown to sporulate on plant material in the laboratory, suggesting that it is both a plant and animal pathogen. We generated 14 million PE250 Illumina MiSeq and a total of 356,001 PacBio reads from a 10-kb insert library for strain CDC-B5653 of P. insidiosum (ATCC 200269), originally isolated from necrotizing lesions on the mouth and eye of a 2-year-old boy in Memphis, Tennessee, USA. These data were used for a de novo assembly using SPAdes (v. 3.1.0), generating a final 45.6 Mb assembly with in 8,992 contigs with an average coverage of 28x and an N50 of 13 kb. These values are comparable to those obtained for genome assemblies for other plant pathogenic oomycetes, and the GC-content (57%) is similar to that of other Pythium species. We used Augustus (v. 3.0.1) for ab initio gene prediction, using a previously described gene model for Pythium species. The P. insidiosum genome contains 225 tRNA and 18,045 putative protein-coding genes. Phylogenetic analyses of up to 874 orthologs placed P. insidiosum in a clade with two other Pythium species (P. aphanidermatum and P. arrhenomanes) with confidence (100% bootstrap support). In sharp contrast to the analyses of aligned orthologs, analyses of genome content typically united two obligate parasites of plants (Albugo laibachii and Hyaloperonospora parasitica) with strong support, probably reflecting convergent gene loss.
Email: ebraun68@ufl.edu
Herpesvirus DNA Replication, Recombination and Repair
Presenting Author: Jay C. Brown
Affiliation: University of Virginia School of Medicine
Complete Authors List: Jay C. Brown
Complete Author List Affiliations: University of Virginia School of Medicine
Abstract:
In cells latently infected with a herpesvirus, the virus DNA is present in the cell nucleus, but it is not extensively replicated or transcribed. In this inactive state the virus DNA is vulnerable to host cell DNA rearrangements that have the potential to destroy the virus' genetic integrity. Such DNA changes occur prominently in neurons and in B cells, both cell types able to host latent herpesvirus infections. I have used methods of DNA sequence analysis to test the idea that host-encoded DNA repair is involved in correcting damage to latent herpesvirus DNA. Beginning with a sample of 39 herpes family viruses (Table 2), the genome sequences were examined for the presence of features associated with initiation of homologous recombination-dependent repair (HRR). These included inverted and tandem repeats, a chi-like recombination hotspot (TGGTGG), a classical meiotic recombination sequence (CCTCCCCT) and four other sequences with the potential to initiate HRR (Table 1). The results showed that such features are present above a randomized background in all 39 herpes viruses tested (Table 3). The highest counts were found in alpha- and gamma-herpesviruses at greatest risk to host cell genetic rearrangements. Counts were lower in beta-herpesviruses where the need for genome repair is less apparent. The results are interpreted to support the view that host cell-encoded HRR is involved in repair of latent herpesvirus genomes. Distinctive subsets of initiators were observed in alpha- compared to gamma-herpesviruses. For instance, among the genomes examined, the five with the highest number of inverted repeat initiators were all neurotropic alpha-herpesviruses with high GC contents (Table 3). In contrast, the top five counts were all in gamma-herpesviruses when measurements were made with the specific human initiation sequences TGGAG, CCCAG and GGGCT. The location of HRR-initiating features was also found to be distinctive. Among the high abundance alpha-herpesviruses, inverted and tandem repeats were found to be concentrated in the S genome segment and depleted in L (Figs 2 and 3). By contrast, in the gamma-herpesviruses, HRR-promoting sequences were more uniformly distributed (Fig. 4). The distinctive location of HRR-promoting elements suggests DNA repair is initiated differently in alpha- compared to gamma-herpesviruses.
http://www.sciencedirect.com/science/article/pii/S0888754314001475
Email: jcb2g@virginia.edu
Development of a metagenomics-based method for detection of foodborne pathogens on fresh produce
Presenting Author: Juan C. Castro
Affiliation: School of Biology. Georgia Institute of Technology
Complete Authors List:
Juan C. Castro (1,3); Luis M. Rodriguez 1,3); Janet K. Hatt (2); Michelle Carter (4); Konstantinos T. Konstantinidis (1,2,3)
Complete Author List Affiliations:
1. School of Biology, Georgia Institute of Technology
2. School of Civil and Environmental Engineering, Georgia Institute of Technology
3. Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology
4. Produce Safety and Microbiology, USDA/ARS/WRRC. 800 Buchanan Street, Albany, CA 94710
Abstract:
Monitoring of fresh produce is key to the reduction of the risk associated with outbreaks caused by microbes in our food supply. Escherichia coli strain O157:H7 is an important foodborne pathogen that causes food poisoning. O157:H7 outbreaks have been frequently linked to fresh produce such as alfalfa, clover, lettuce and spinach. To build a null model for the detection of this pathogen in complex metagenomes, we spiked in-silico metagenomes with different concentrations of reads originating from the O157:H7 genome. By aligning O157:H7- vs. non-O157:H7 reads against the genome sequence of E. coli O157:H7, “noisy” regions of the genome, providing false positive signals, were identified and flagged. Based on this information, a model to estimate the probability of (true) O157:H7 presence in a given sample was generated. To test our model, we inoculated spinach leaves with different cell concentrations of this strain (80 to 8*105 cells), and sequenced the resulting samples using the Illumina MiSEQ platform. The predictions of the model based on the Illumina datasets showed tight correlation with the known (spiked in) cell concentrations, e.g., r-squared >0.9. Thus we achieved a simple yet effective bioinformatics approach and the associated wet-lab protocol to detect foodborne pathogens based on metagenomics. Our methodology can be easily extended to other pathogens that may be present in fresh produce or in clinical samples.
Email: jccastrog@gatech.edu
Genomic analysis of Type VI secretion systems of Vibrio cholerae
Presenting Author: Aroon T. Chande
Affiliation: School of Biology, Georgia Institute of Technology, Atlanta GA 30332
Complete Authors List: Aroon T. Chande, Samit S. Watve, Lavanya Rishishwar, Eryn E. Bernardy, I. King Jordan, Brian K. Hammer
Complete Author List Affiliations:
School of Biology, Georgia Institute of Technology, Atlanta GA 30332
Abstract:
The fatal disease cholera is caused by the waterborne bacterium Vibrio cholerae, which is commonly found in biofilm communities with other organisms on chitinous material like crab shells in marine and freshwater environments. V. cholerae encodes a Type VI Secretion System (T6SS) used for contact dependent killing of neighboring prey cells by translocation of toxic effector proteins that result in cell lysis. Each V. cholerae isolate typically possesses three T6SS loci that encode the syringe-like delivery apparatus, multiple toxic effectors, and cognate immunity proteins that protect against self-intoxication. Regulation of T6SS is tightly controlled in patient isolates of V. cholerae and remains repressed during human infection; however repression is relieved during growth on carbon sources like chitin in the environment. In contrast, environmental isolates are constitutive for T6SS killing. Since T6SS activity is co-regulated with natural competence for DNA uptake in several strains, it has been proposed that T6SS-mediated killing may also be a mechanism supporting acquisition of new genes via horizontal gene transfer (HGT). Here we present a bioinformatics study of a diverse set of 26 sequenced clinical and environmental isolates that were collected over the last 40 years and recently characterized for T6SS-mediated killing. T6SS effectors were differentiated into distinct classes based on sequence homology and conserved protein families. We are currently validating effector activity by genetic and biochemical approaches to understand how T6SS-mediated interactions determine structure and diversity of microbial communities.
Email: arch@gatech.edu
Homology modeling and simulation of selection in the mitochondrial genome during tumorigenesis
Presenting Author: Estella B.C. Chen-Quin
Affiliation: Kennesaw State University, Department of Molecular and Cellular Biology (formerly the Department of Biology and Physics), Kennesaw State University
Complete Authors List:
Estella Chen-Quin (1); Richard Uberto (1); Mark Fowler (1)
Complete Author List Affiliations:
(1) Department of Molecular and Cellular Biology Kennesaw State University
Abstract:
Mutations in the mitochondrial genome play a role in tumor growth or metastasis. Over 65% of tumors have mutations in the mitochondrial chromosome (mtDNA), which codes for components of the electron transport chain (ETC); inherited ETC mutations in the nuclear genome also cause hereditary cancers of the head and neck. Deep sequencing of existing tumor data and our previous work show that somatic mtDNA mutations in tumors are biased towards deleterious lesions, although the opposite is true at the organismal level. “Transgenic” mitochondria (cybrids) with deleterious mutations increase the metastatic potential of cancer , suggesting that mitochondrial dysfunction promotes cancer. Thus it may be that mitochondrial dysfunction plays a role in cancer progression but this has been difficult to prove. Relative to the nuclear genome, the 10x faster mutation rate of mtDNA renders association of specific mutations at too low of power to link any one lesion to cancer. Additionally, each cell contains hundreds of copies of the mitochondrial chromosome, resulting in deep heterogeneity (heteroplasmy) of the mtDNA. Empirical cancer data only represents a biopsy of cancer tissue and does not inform about mutation processes throughout the lifetime of the tumor. Here we present a computer simulation of mitochondrial growth and mutation during tumor formation. It is hoped that this simulation will a) allow comparison of mtDNA tumor mutations generated under neutral conditions to real cancer data; b) test various hypotheses of mtDNA selection during tumor growth; c) model the dynamics of mtDNA lineages during tumor growth. We also present structural analysis of known mtDNA cancer mutations, via homology modeling of the ETC proteins. Homology modeling uses the limited empirical data and “hangs” known mtDNA cancer mutations on the protein structure. This identify areas of the mitochondrial proteins that may promote cancer when altered, and informs on the physiological mechanisms that may be involved.
Email: echen1@kennesaw.edu
Ab initio Gene Prediction in Metagenomes of Fungal Species
with Frequent Introns
Presenting Author: Liexiao Ding
Affiliation: School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, USA
Complete Authors List: Liexiao Ding (1), Alexandre Lomsadze (2) and Mark Borodovsky (2,3,4)
Complete Author List Affiliations:
1. School of Industrial and Systems Engineering,
2. Joint Georgia Tech and Emory Wallace H. Coulter Department of Biomedical Engineering,
3. School of Computational Science and Engineering,
4. Center for Bioinformatics and Computational Genomics,
Georgia Institute of Technology, Atlanta, GA, USA
Abstract:
Microbial and viral metagenomics carries volumes of sequence data for studies of environmental and clinical microbial communities. However, gene prediction in short metagenomic sequences meets serious challenges. Tools for gene prediction in metagenomes (e.g. MetaGeneMark, Zhu et al., 2010) were primarily developed for prokaryotic or prokaryote-like sequences; they can work with fragments of bacteria and archaea as well as phages, cytoplasmic eukaryotic viruses or Yeast like eukaryotes. Existing tools are not suitable for prediction of eukaryotic genes with frequent introns. Many more parameters need to be derived from the short eukaryotic sequences with anonymous origin. This task is of practical interest for analysis of metagenomes of eukaryotic microorganisms.
We present here a new method for gene identification in metagenomes of fungal communities. This algorithm computes G+C content of input sequence and uses this information for selecting a model from a set of pre-build heuristic fungi specific models that cover a wide range of possible G+C contents. We used known genomes of more than 200 fungal species to estimate parameters including emission and transition probabilities for generalized hidden Markov model and Markov chain models of coding and non-coding regions. The models also include G+C dependent parameters for models of acceptor and donor sites as well as branch point. Sub-sets of parameters, such as exon length distribution, were modeled by dividing the set of ~200 genomes into several bins by intron density: low, medium and high.
We used 18 fungal species to create a simulated fungal metagenome and to test the new algorithm. We observed that average exon prediction accuracy was at about 70% and nucleotide prediction accuracy at about 95%.
Email: borodovsky@gatech.edu
Cryptic Genetic Relatedness Among 1000 Human Genomes
Presenting Author: Alexei Fedorov
Affiliation: Department of Medicine, University of Toledo, Health Science Campus, OH 43614, USA
Complete Authors List: Larisa Fedorova1, Shuhao Qiu2,3, Rajib Dutta4, Ahmed Al-Khudhair2, Alexei Fedorov2,3
Complete Author List Affiliations:
1 GEMA-biomics, Ottawa Hills, OH 43606, USA.
2 Program in Bioinformatics and Proteomics/Genomics, University of Toledo, Health Science Campus, OH 43614, USA.
3 Department of Medicine, University of Toledo, Health Science Campus, OH 43614.
4 Program in Biomedical Sciences, University of Toledo, HSC, OH 43614.
Abstract:
Nucleotide sequence differences on the whole-genome scale have been computed for 1092 people from 14 populations publicly available by the 1000 Genomes Project. Total number of differences in genetic variants between 96,464 human pairs has been calculated. We also analyzed the distribution patterns of very rare genetic variants (vrGVs), which have minor allele frequencies less than 0.2%, and used these patterns for revealing cryptic genetic relatedness. Contrary to the existing probabilistic approaches our method is rather deterministic, because it considers a group of very rare events which cannot happen together only by chance. This method has been applied for exhaustive computational search of shared IBD segments among 1092 sequenced individuals from 14 populations. It demonstrated that clusters of vrGVs are unique and powerful markers of genetic relatedness, that uncover IBD chromosomal segments between and within populations, irrespective of whether divergence was recent or occurred hundreds-to-thousands of years ago. We found that several IBD segments are shared by practically any possible pair of individuals belonging to the same population. Moreover, shared short IBD segments (median size 183 Kb) were found in 10% of inter-continental human pairs, each comprising of a person from Sub-Saharan Africa and a person from Southern Europe. The shortest shared IBD segments (median size 54 Kb) were found in 0.42% of inter-continental pairs composed of individuals from Chinese/Japanese populations and Africans from Kenya and Nigeria. Knowledge of inheritance of IBD segments is important in clinical case-control and cohort studies, since unknown distant familial relationships could compromise interpretation of collected data. Clusters of vrGVs should be useful markers for familial relationship and common multifactorial disorders.
Email: alexei.fedorov@utoledo.edu
Prokaryotic Gene Prediction with Help of MFinder, a Gibbs Sampling-Based Motif Finder
Presenting Author: Karl Gemayel
Affiliation: School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA
Complete Authors List: Karl Gemayel (1), Alexandre Lomsadze (2) and Mark Borodovsky (2,3,4)
Complete Author List Affiliations:
1. School of Computational Science and Engineering,
2. Joint Georgia Tech and Emory Wallace H. Coulter Department of Biomedical Engineering,
3. School of Computational Science and Engineering,
4. Center for Bioinformatics and Computational Genomics,
Georgia Institute of Technology, Atlanta, GA, USA
Abstract:
Most frequent errors in prokaryotic gene finding are those made in gene start prediction. One of the sequence determinants that helps pinpoint the true location of the start of a gene is the ribosome binding site (RBS) likely to be situated upstream of the gene start.
Unsupervised training based motif finder Gibbs3 attempts to learn parameters of the RBS site model from a probabilistic multiple alignment of a set of sequences. The RBS model can be used to predict new motif positions in conjunction with prediction of gene starts. The current version of Gibbs3 was shown to work well for RBS prediction. However, i/ the algorithm’s performance depends on the length of the input sequences (selected upstream to putative gene starts), and ii/ error rate related to the upstream sequence length increases significantly in high G+C genomes.
In this work, we introduce features which significantly reduce the impact of the sequence length on the Gibbs sampler performance. The new algorithm, MFinder, learns a distribution over motif positions (in addition to the derivation of the frequency model), and works to localize that distribution. As a result, MFinder is able to disregard erroneous motif locations irrespective of the length of provided sequences. Incorporating MFinder into the ab initio gene-finder, GeneMarkS, shows the effectiveness of this approach. In particular, we observe significant improvement in several test genomes. For instance, for M. tuberculosis the Gibbs3 algorithm failed to detect RBS signals in upstream sequences, whereas MFinder extracted a relevant signal, which, in turn, led to better gene prediction accuracy.
Email: borodovsky@gatech.edu
Annotation of Cryptosporidium baileyi
Presenting Author: Shelton Griffith
Affiliation: University of Georgia
Complete Authors List: Shelton Griffith (1); Jessica Kissinger (1)
Complete Author List Affiliations: (1) Center for Tropical and Emerging Global Diseases, University of Georgia
Abstract:
Diarrhea is one of the leading causes of death among children under five globally. More than one in ten child deaths – about 800,000 each year – is due to diarrhea. Today, only 44% of children with diarrhea in low-income countries receive the recommended treatment, and limited trend data suggest that there has been little progress since 2000. In 2013, a massive clinical and epidemiological study1 involving 22,500 children from Africa and Asia revealed — unexpectedly — that the protozoan parasite Cryptosporidium is one of four pathogens responsible for the lion's share of severe diarrhea in infants and toddlers. Vaccines and treatments are already available or fast being developed for three of the four pathogens identified: rotavirus, Shigella bacteria and enterotoxigenic Escherichia coli. But for 'crypto', there is no fully effective drug treatment or vaccine, and the basic research tools and infrastructure needed to discover, evaluate and develop such interventions are mostly lacking. Cryptosporidium is a zoonotic apicomplexan protist parasite that causes gastrointestinal illness with diarrhea in humans and animals. However, the disease is most serious in children and immunocompromised individuals, especially malnourished children, where the infection can aggravate poor nutritional conditions, lead to impaired immune response, chronic infection and long-term negative impact on growth and development. Cryptosporidium parasites are found globally. In the US, an estimated 748,000 cases of cryptosporidiosis occur each year. My long-term goal is to help with the treatment/prevention of Cryptosporidium infections. To achieve this goal I will use bioinformatics to help test the hypothesis that C. baileyi may be a suitable animal model for the human-infecting C. parvum and C. hominis. A comparative genomic analysis of C. baileyi as well as comparative transcriptional profiles will be performed. A C. baileyi model is important because this species can complete its lifecycle in experimentally tractable chicken eggs unlike other species which require cattle or gnotobiotic pigs e.g. C. hominis and C. parvum. In vitro culture is not yet available for any species. To test my hypothesis about the model, I will conduct the first ever annotation of C. baileyi, compare its genome annotation to human-infecting species of Cryptospordium and use RNA-Seq data generated from C. baileyi to compare the gene expression to that of C. parvum (the only other organism gene expression data is available for). My results will contribute to a new animal model for Cryptosporidium.
Email: sheltong@uga.edu
Comparative genomics and phylogenetics on diverse tcdAB-positive Clostridium difficile isolates collected throughout the United States
Presenting Author: Christopher A. Gulvik
Affiliation: Centers for Disease Control and Prevention
Complete Authors List: Christopher A. Gulvik (1); Efe Alyanak (1); Johannetsy J. Avillan (1); Maria Sjölund-Karlsson (1); Brandi M. Limbago
Complete Author List Affiliations: (1) Centers for Disease Control and Prevention
Abstract:
Clostridium difficile isolates were collected during 2010-2011 as part of the Emerging Infections program C. difficile Infection surveillance. Isolates (n = 53) were selected to represent the diversity of strain types as determined by geographic location of isolation, pulsed-field gel electrophoresis (PFGE) type, and PCR ribotype; other molecular data included PCR detection of toxin genes tcdA, tcdB, cdtA, and cdtB, and size of tcdC deletions. Epidemiological metadata associated with isolates includes patient age, U.S. state of residence, isolation year, and epidemiologic classification as healthcare- or community-associated. Paired-end Illumina sequencing was performed on isolate genomes to assess congruence of molecular typing methods commonly used for surveillance. Systematic comparison of nine genome assemblers widely used for C. difficile and other bacterial genomes revealed iterative- or multi-de Bruijn assembly with IDBA and SPAdes providing the fewest contigs, largest N50, most predicted genes, and largest contig. Maximum likelihood phylogeny inference was performed using aligned whole genomes, which averaged 60X coverage. Phylogenetic trees enabled us to classify five isolates with unclassifiable PFGE patterns. The overall concordance of genome extracted multi-locus sequence types (STs), PCR ribotypes (RT), and PFGE groups was very good when compared to whole genome phylogeny, and provides an illustration of how each group of U.S. C. difficile is related to others. This is useful because the nomenclature of various C. difficile typing methods (e.g., NAP01, ST-3, RT 027) lacks evolutionary context, whereas whole genome phylogeny provides a single illustrative comparator for contextualizing isolates regardless of the molecular typing method used. Two isolates occurred in clades with bootstraps of 65 and 100%, which otherwise contained single PCR RTs. Repeat sequencing and PCR ribotyping of these two isolates confirmed the unusual placement of a single RT 014 isolate within the RT 020 clade, and vice versa. Fluoroquinolone resistance determinants were common (82%) among the hypervirulent epidemic RT 027 genomes; only one non-027 fluoroquinolone-resistant isolate (RT 017) harbored a Thr82Ile mutation in GyrA, which is known to confer fluoroquinolone resistance in other species. These molecular and epidemiological data will be publicly available through NCBI, and isolates have been deposited for distribution with BEI Resources.
Email: cgulvik@gmail.com
Horizontal Gene Transfer of Terpene Synthase Genes from Bacteria to Fungi
Presenting Author: Qidong Jia
Affiliation: Graduate School of Genome Science and Technology, The University of Tennessee, Knoxville,TN 37996
Complete Authors List: Qidong Jia (1); Xinlu Chen (2); Tobias G. Köllner (3); Feng Chen (1,2);
Complete Author List Affiliations:
(1) Graduate School of Genome Science and Technology, The University of Tennessee, Knoxville,TN 37996
(2) Department of Plant Sciences, The University of Tennessee, Knoxville, TN 37996
(3) Max Planck Institute for Chemical Ecology, Hans-Knoell-Strasse 8, D-07745 Jena, Germany
Abstract:
Terpenoids constitute the largest class of secondary metabolites. The vast diversity of terpenoids is partly achieved through the continued creation of novel terpene synthase (TPS) genes, which encode key enzymes for terpenoid biosynthesis, through gene duplication followed by function divergence. In contrast, little is known about the contribution of horizontal gene transfer (HGT) of TPS genes for the diversity of terpenoids. The goal of this study was to investigate HGT of TPS genes from bacteria to fungi. By phylogenetic analysis of TPSs from bacteria and fungi, several fungal TPSs were found to be nested within bacterial TPSs, implying HGT from bacteria to fungi. These TPSs were renamed BTPSL (bacterial TPS-like). We then focused our study on a group of entomopathogenic fungi with sequenced genomes. Of the eleven species of fungi analyzed, eight BTPSL genes were found from seven species, of which the majority are Metarhizium species. In most fungal species containing BTPSL, collinearity could be identified for BTPSL and neighbor genes. In addition to BTPSL genes, each of the fungal species was found to contain typical fungal TPS genes, suggesting that terpenoids produced in each fungus are determined by both BTPSL and typical fungal TPSs. We also performed biochemical studies on one of the identified BTPSLs (MAA_08668) and showed that it has sesquiterpene synthase activity. Molecular evolutionary analysis of BTPSL genes implied purifying selection, suggesting that the novel chemistry brought about by the acquisition of BTPSL genes via HGT may have important and conserved functions for the receipt fungi.
Email: qjia2@vols.utk.edu
Locating Divergent and Conserved Loci on the SIV Genome Associated with Cross-Species Transmission
Presenting Author: Sivan Leviyang
Affiliation: Georgetown University
Complete Authors List:
Alison Hill (1,2); Sivan Leviyang (3); Welkin E. Johnson (1)
(1)Biology Department, Boston College, USA
(2) Division of Medical Sciences, Harvard University, USA
(3) Mathematics and Statistics Department, Georgetown University, USA
Abstract:
Many examples exist of viral transmission between species, with HIV and Ebola virus being two examples with significant human health implications. Despite the importance of such transmissions, we lack good models through which to explore zoonosis. In this study, we consider Simian Immunodeficiency Virus (SIV) as a model for cross species transmission. SIVsm (adapted to sooty mangabeys) and SIVmac (adapted to rheses macaques) represent a known cross-species transmission pair, with SIVmac arising from evolution of SIVsm within macaques. We considered data generated by infecting four cohorts of (rhesus) macaques with SIVsmE543, SIVsmE660, SIVmac239, and SIVmac251, respectively. For each animal, full genome, deep sequencing was available at multiple sample times over the first year of infection. We were then able to compare cross species infection by SIVsm to adapted infection by SIVmac. We developed two computational approaches to identify locations within the SIV genome that were mutation hot spots and cold spots on SIVsm and SIVmac. Hot/cold spots were identified through (1) a hypothesis test based on a null model of mutations equally distributed across a gene and (2) a HMM model that allowed for parameter estimation. By identifying differences in hot/cold spots on the SIVsm and SIVmac genomes, we locate potential divergent and conserved loci associated with SIV cross-species transmission.
Email: sr286@georgetown.edu
GeneMark Line of Self-Training Gene Prediction Tools for Eukaryotic Genomes
Presenting Author: Alexandre Lomsadze
Affiliation: Joint Georgia Tech and Emory Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA
Complete Authors List: Alexandre Lomsadze (1) and Mark Borodovsky (1,2,3)
Complete Author List Affiliations:
1. Joint Georgia Tech and Emory Wallace H. Coulter Department of Biomedical Engineering,
2. School of Computational Science and Engineering,
3. Center for Bioinformatics and Computational Genomics,
Georgia Institute of Technology, Atlanta, GA, USA
Abstract
Gene prediction plays fundamental role in genomics. Still, developing automated machine learning algorithms with high precision is an active area of research, as the accuracy of gene prediction can be improved, especially for genomes with high density of introns, e.g. in fungal and protozoan pathogens.
Our original automatic self-training algorithm GeneMark-ES (Lomsadse et al, 2005) is constantly updated; it runs without any external evidence, such as transcript or protein data, and can generate accurate ab initio gene predictions for eukaryotic genomes including those that belong to species from deeply rooted lineages. The line of automated eukaryotic gene finders also includes fully automated GeneMark-ET (Lomsadze et al., 2014) that integrates mapped RNA-Seq reads into training process. Another tool, GeneMark-EP (Lomsadze et al. in preparation), integrates into training information on mapped homologous proteins. The self-training mode makes the eukaryotic gene finders of the GeneMark line an important part of whole genome annotation pipelines both in large centers, e.g. Broad Institute, and in individual labs.
Email: borodovsky@gatech.edu
Evolutionary Analyses of Gene Gain and Loss Patterns Reveals Differences in the Types of Dynein Microtubule Motors Within Apicomplexan Parasites
Presenting Author: Ousman Mahmud
Affiliation: Center for Tropical and Emerging Global Diseases, Department of Genetics, University of Georgia, Athens, GA, 30602
Complete Authors List:
Ousman Mahmud (1,2); Jessica C. Kissinger (1,2,3)
Complete Author List Affiliations:
Department of Genetics, University of Georgia, Athens, GA, 30602 (1);
Center for Tropical and Emerging Global Diseases, University of Georgia, Athens, GA, 30602 (2);
Institute of Bioinformatics, University of Georgia, Athens, GA, 30602 (3)
Abstract:
The phylum Apicomplexa contains mostly obligate intracellular parasites. Apicomplexans have reductive streamlined genomes, which have variable sizes and protein coding content. Analyses of gene gain and loss patterns will be highly informative with respect to our understanding of the biology and evolution of parasitism. Identification of gene gain and loss patterns by an orthology clustering approach led to the discovery of copy number variation in apicomplexan dynein genes. Dyneins are microtubule motors that mediate force and movement. We have performed a phylum-wide characterization of copy number patterns and phylogenetic relationships of the dynein heavy chain (DHC) gene family to identify trends in the evolution of this family. Phylogenetic analyses revealed last common apicomplexan ancestor had at least ten different DHC genes. Coccidians have retained all the ten ancestral DHC genes. Five of the ancestral DHC genes are only retained in coccidians, suggesting their loss in other apicomplexans. Haemosporidians have five types of DHC genes. Cryptosporidians, Piroplasms and Gregarina niphandrodes have only one DHC gene. Piroplasma DHC genes cluster to DHCs of red algae and diplomonad excavates. This suggests the piroplasma DHC gene may have been acquired via a gene transfer event or they have a copy that has been lost in all other examined apicomplexan species. Another possibility could be convergent evolution. Findings from analyses of gene gain and loss patterns have shed light on changes in DHC gene repertoire within apicomplexans. The differences in the types of DHC genes within apicomplexans may reflect differences in microtubule associated transport and movement functions among the parasites. The single copy DHC genes, especially the ones in Piroplasms may have novel or expanded roles. I want to know how apicomplexans are using their dynein genes especially Piroplasma species. Experiments to localize the protein products of DHC genes in Babesia bovis and Toxoplasma gondii are underway. In silico experiments to identify potential proteins that interact with the dynein complex are also underway. These ongoing experiments will further provide insights into dynein movement and transport functions within apicomplexans.
Email: omahmud@uga.edu
Metagenomic Sequencing and Analysis for Viral Diseases
Presenting Author: Terry Fei Fan Ng
Affiliation: Division of Viral Diseases, Centers for Disease Control and Prevention
Complete Authors List:
Terry Fei Fan Ng(1), Laura Magaña(1), Anna Montmayeur (2,3), William A. Nix(1), Shannon Rogers(1), Kaija Maher(1), Cara C. Burns(1), Jane Iber(1), Qi Chen(1), Bettina Bankamp(1), Joseph P. Icenogle(1), Min-hsin Chen(1), Dean Erdman(1), Xiaoyan Lu(1), Suxiang Tong(1), Clinton R. Paden(1), Jan Vinjé (1), Nicole A. Gregoricus(1), Nikail Collins(1), Kshama Aswath(1), Marta Diez-Valcarce(1), Michael Bowen(1), Mathew D. Esona(1), Baoming Jiang(1), Jennifer Hull(1), Edward Ramos(2,3), Yang Xu(2,3), Roman L. Tatusov(2,3), Christina Castro(1), Gregory H. Doho(2,3), Paul Rota(1), Steve Oberste(1)
Complete Author List Affiliations:
(1) Division of Viral Diseases, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, Georgia
(2) SRA International, Inc,
(3) NCIRD Core Bioinformatics Support, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, Georgia
Abstract:
Next generation sequencing (NGS) has revolutionized the field of genomics and is now on the cusp of becoming a critical tool in clinical virology. We used NGS to investigate viral evolution in recent outbreaks of EV-D68, measles virus, calicivirus and rotavirus in 2015, and in molecular surveillance for MERS-CoV, rubella virus and poliovirus. These viruses have a wide range of genetic characteristics (single- or double-stranded, RNA or DNA, 5 – 125 kb). In 2015, we collectively generated more than 500 million NGS raw reads from 500 samples, ranging from nasal swabs and fecal material to viral isolates. Metagenomic protocols were evaluated and optimized for each viral pathogen. We also developed a bioinformatics pipeline that automates NGS data analysis and reporting for viral identification and surveillance. After quality and adaptor trimming, the reads are de novo assembled to produce sequence contigs. Species characterization of these contigs is accomplished using a local BLAST search of a custom viral and GenBank database. These results are then transferred to a web server where the data can be viewed. For most of the viral diseases, the sensitivity of the NGS detection was proportionally related to the virus load measured by quantitative PCR. When compared with the Sanger sequencing results, the NGS-derived genomes were validated with the additional benefit of resolving single nucleotide polymorphisms. Multiplexing greatly increases output and decreases cost, but the optimal number of samples per run is best obtained empirically, because it differs depending on virus and sample types. We used the percentage of viral reads as a quantitative proxy to evaluate the efficiency of the protocols and found that the number of raw reads does not always reflect sequencing success since the ratio between target virus reads to total raw reads ranged from 90% to 0.01%. In general, the ratio varied by the sample type - the highest ratios were obtained for viral isolates, followed by clinical samples with high viral loads. In some cases, the use of SISPA (sequence-independent, single-primer amplification) allowed the generation of enough amplicon for sequencing. By establishing a knowledge base of how to sequence each viral pathogen, we obtained a highly effective NGS protocol and analysis pipeline. NGS facilitates rapid generation of sequence information for outbreaks of viral diseases, and is applicable to the investigation of other emerging pathogens.
Email: ylz9@cdc.gov
The Genomic Basis of Capsule Switching in the Hajj clone of Neisseria meningitidis
Presenting Author: Emily T. Norris
Affiliation: School of Biology, Georgia Institute of Technology, Atlanta, Georgia, USA; PanAmerican Bioinformatics Institute, Santa Marta, Colombia; Applied Bioinformatics Laboratory, Atlanta, Georgia, USA
Complete Authors List: Emily T. Norris (1,2,3); Lavanya Rishishwar (1,2,3); Jennifer T Pentz (1); I. King Jordan (1,2,3)
Complete Author List Affiliations:
1 School of Biology, Georgia Institute of Technology, Atlanta, Georgia, USA;
2 PanAmerican Bioinformatics Institute, Santa Marta, Colombia;
3 Applied Bioinformatics Laboratory, Atlanta, Georgia, USA
Abstract:
The 2000 Hajj pilgrimage to Mecca in Saudi Arabia led to the first large outbreak of bacterial meningitis caused by Neisseria meningitidis, serogroup W. Treatment and prophylactic vaccination for meningococcal disease requires subtyping the bacteria on the basis of the capsule type. The disease causing strain from Saudi Arabia, known as the Hajj clone, was from Clonal Complex and Sequence Type 11 (CC-11 and ST-11, respectively), both of which are more commonly associated with serogroup C isolates. This suggested the possibility of capsule switching in the Hajj clone, whereby the genome region encoding the capsule is exchanged between different serogroup lineages. Due to the global transmission of this clone and the resulting meningitis outbreaks, we sought to determine the genetic basis of meningitis outbreaks due to N. meningitidis strains that were either related to the Hajj clone or due to other serogroup W N. meningitidis strains. Isolates of N. meningitidis, including the Hajj clone, were sequenced, serotyped and sequence typed by the Centers for Disease Control and Prevention (CDC). All of the isolates were serogroup W, with the sequence types being mainly ST-11, but also ST-22 and ST-2881. Whole genome based comparisons showed that all of the ST-11 isolates were most similar to serogroup C ST-11 N. meningitidis reference genome, rather than the serogroup W ST-22 N. meningitidis reference genome, indicative of a capsule switch between serogroups C and W. Phylogenetic analysis of these isolates showed the directionality of the capsule switching being from a serogroup C strain to a serogroup W strain, with the capsule switch being a single event that was then propagated to multiple strains through genetic recombination. Chromosomal painting followed by recombination break-point analysis of the isolates revealed that a complex pattern of recombination along the capsule locus was responsible for the observed capsule switching event in the Hajj clone. The Hajj clone chimeric capsule locus evolved via the independent acquisition of sequences from W and Y serogroups, with the capsule polymerase genes gained from the serogroup W encoding the characteristic W serotype. These results illustrate the complex evolutionary dynamics of a highly virulent strain of N. meningitidis and underscore the insufficiency of serotyping and/or sequence typing alone for the characterization of disease causing lineages. Whole genome sequence analysis can help provide for more targeted treatment and vaccination strategies.
Email: enorris@ihrc.com
metaSPAdes: a new versatile de novo metagenomics assembler
Presenting Author: Sergey Nurk
Affiliation: Center for Algorithmic Biotechnology, Saint Petersburg State University, Saint Petersburg, Russia
Complete Authors List:
Sergey Nurk (1); Dmitry Meleshko (1); Pavel Pevzner (1,2)
Complete Author List Affiliations:
1. Center for Algorithmic Biotechnology, Saint Petersburg State University, Saint Petersburg, Russia
2. Department of Computer Science and Engineering, Univ. of California San Diego, San Diego, USA
Abstract:
Metagenome sequencing has emerged as a technology of choice for analyzing bacterial populations and discovery of novel organisms and genes. Since high quality assemblies are crucial for many metagenomics studies, different groups have developed specialized metagenomics assemblers. However, the problem of accurate de novo assembly of complex metagenomics datasets is far from being resolved, thus stifling the biological discoveries. The key computational challenges that make metagenomic assembly difficult are: non-uniform read coverage of various species within a metagenome; differences between closely related strains of the same bacterial species; similarities between different bacterial species; dataset sizes that dwarf all other DNA sequencing projects. Some of these challenges have already been addressed in the course of development of the SPAdes assembly toolkit, albeit in an application domain outside the field of metagenomics. Indeed, SPAdes (Bankevich et al., 2012) was developed to assemble datasets with non-uniform read coverage (one of the key challenges of single-cell assembly), while dipSPAdes (Safonova et al., 2015) was developed to address the challenge of assembling highly polymorphic diploid genomes (which is not unlike the challenge of assembling mixture of multiple related bacterial strains). Although SPAdes was not originally designed for metagenomics applications, various groups decided to use it in their metagenomics studies. However, while SPAdes indeed worked well for assembling low complexity mini-metagenomes like cyanobacterial filaments, its performance deteriorates in the case of complex metagenomics datasets. To address this limitation, we developed metaSPAdes software that brings together new algorithmic ideas and proven solutions from the SPAdes toolkit to address the metagenomic assembly challenges. We show that metaSPAdes improves the assemblies as compared to the state-of-the-art metagenomics assemblers MEGAHIT, IDBA-UD, and Ray-Meta.
Email: sergeynurk@gmail.com
A bioinformatics pipeline for the comparative analysis of 100s of Bacillus anthracis genome sequences
Presenting Author: Angela V. Pena-Gonzalez
Affiliation: School of Biology, Georgia Institute of Technology, 310 Ferst Dr. NW, Atlanta, GA 30332, U.S.A.
Complete Authors List:
Luis M. Rodriguez-R (1); Angela V. Pena-Gonzalez (1); Chung K. Marston (2); Jay E. Gee (2); Cari A. Beesley (2); Elke Saile (2); Mike Frace (2); Michael R. Weigand (1); Konstantinos T. Konstantinidis (1,3); Alex Hoffmaster (2)
1-School of Biology, Georgia Institute of Technology, Atlanta, GA, U.S.A.
2-Bacterial Special Pathogens Branch, Division of High-Consequence Pathogens and Pathology, National Center for Emerging and Zoonotic Infectious Diseases, Centers for Disease Control and Prevention, Atlanta, GA, U.S.A.
3-School of Civil & Environmental Engineering, Georgia Institute of Technology, Atlanta, GA U.S.A
Abstract:
Bacillus anthracis, the etiological agent of anthrax, is one of the most genetically homogenous pathogens described so far. Its clonal nature makes discrimination between strains and assessment of their evolutionary relationships challenging. Current genotyping approaches include multiple locus variable-number tandem repeat analysis (MLVA) and the analysis of a small number of single nucleotide polymorphisms known as canonical SNPs. However, these methods do not provide genome-level resolution, and data comparison between laboratories is often problematic resulting in significant delays during public health investigations. Here, we developed a bioinformatics pipeline to assemble, annotate and phylogenetically discriminate the genome sequences of 270 Bacillus anthracis strains that represent well the global diversity of the pathogen. Phylogenetic analysis is primarily based on the genome aggregate average nucleotide identity (ANI) values between pairs of strains. Our results show that the ANI-based tree captures all major B. anthracis lineages and is, in general, consistent with the canonical SNP cladogram. However, ANI provides much higher resolution than canonical SNPs or MLVA since several genomes that show identical SNP patterns are clearly separated in distinct sub-clades in the ANI tree. Our genomic analyses also showed that there are, on average, 4 pXO1 and 2 pXO2 plasmid copies per cell, and that the plasmid copy numbers do not show strong biogeographic patterns among North American strains. The pipeline is fully automated and takes as input raw sequence data of isolate genome projects as produced by the next generation sequencers such as the Illumina Mi-Seq and Hi-Seq platforms. The pipeline is expected to facilitate molecular epidemiology studies of the pathogen and can be easily adapted for additional bacterial pathogens.
Email: avpg3@gatech.edu
Staphopia: a web application for "S. aureus" whole genome shotgun sequencing data
Presenting Author: Robert A. Petit III
Affiliation: Division of Infectious Diseases, Department of Medicine, Emory University School of Medicine, Atlanta, Georgia, USA
Complete Authors List:
Robert A. Petit III (1); Timothy D. Read (1,2);
Complete Author List Affiliations: Division of Infectious Diseases, Department of Medicine, Emory University School of Medicine, Atlanta, Georgia, USA
Department of Human Genetics, Emory University School of Medicine, Atlanta, Georgia, USA
Abstract:
Whole-genome sequencing of bacterial strains directly from the patients is becoming a common practice as the cost of sequencing continue to reduce. There has been tremendous increase in the number of sequenced genomes within the last few years in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) database. However, analyzing these genomes and extracting relevant information to better understand the bacterial pathogens and the associated clinical phenotypes remains a challenge as it requires huge computational resources and bioinformatics skills. We have developed a proof of concept web-based platform called, Staphopia, for rapid analysis of large numbers of "Staphylococcus aureus" genomes. Staphopia is available at www.staphopia.com, which is hosted on Amazon Web Services (AWS). Staphopia makes use of publicly available sequencing projects available from SRA. From these projects the raw FASTQ sequences are processed through Staphopia's analysis pipeline. The analysis pipeline can be broken down into a number of sub-modules. Input FASTQ reads first undergo quality control filtration, normalization to a maximum of 50x coverage and then de-novo assembly. From the assembly, genes are predicted and annotated. Using sequence mapping, multi-locus sequence type (MLST) and variants (SNPs and InDels) are determined. The 31-mers are also counted for each input project. Depending on the size of the input file, analysis is completed between 20-60 minutes. The results of the analysis pipeline are stored within a PostgreSQL database, which is hosted on an AWS Relational Database Service (RDS) instance. A front-end has been developed to query the database using the Django framework and is hosted using AWS Elastic Compute Cloud (EC2). An API has been developed to allow programmatic querying of the database.To date, we have loaded > 24,000 genomes from the SRA.
Email: robert.petit@emory.edu
SPAdes family of assembly tools
Presenting Author: Pavel Pevzner
Affiliation: Department of Computer Science and Engineering, Univ. of California San Diego, San Diego, USA
Complete Authors List:
Dmitry Antipov (1); Anton Bankevich (1); Elena Bushmanova (1); Aleksey Gurevich (1); Anton Korobeynikov (1); Alla Mikheenko (1); Dmitry Meleshko (1); Sergey Nurk (1); Andrei Prjibelski (1); Yana Safonova (1); Alla Lapidus (1); Pavel Pevzner (1,2)
Complete Author List Affiliations:
1. Center for Algorithmic Biotechnology, Saint Petersburg State University, Saint Petersburg, Russia
2. Department of Computer Science and Engineering, Univ. of California San Diego, San Diego, USA
Abstract:
Despite its central role in genomics, accurate de novo genome assembly remains challenging. Moreover, the proliferation of new sequencing and sample-preparation technologies introduces additional levels of complications. In 2015, the SPAdes genome assembler (Bankevich et al., 2012), that was originally conceived as a scalable and easy-to-modify platform, was gradually extended into a family of SPAdes tools aimed at various sequencing technologies and applications. In addition to the constantly updated SPAdes assembler itself, it now includes: dipSPAdes tool for assembly of highly polymorphic genomes (Safonova et al., 2015); exSPAnder module for repeat resolution that enables efficient utilization of mate-pair libraries and even mate-pairs only assemblies with NexteraMP libraries (Prjibelsky et al., 2014, Vasilinetc et al., 2015); hybridSPAdes tool for hybrid assembly of accurate short reads with long error-prone reads, such as Pacific Biosciences and Oxford Nanopore reads (Antipov et al., 2015); truSPAdes tool for assembling Illumina’s barcoded True Synthetic Long Reads (Bankevich and Pevzner, 2015); metaSPAdes assembler for metagenomics data (Nurk et al., 2015); rnaSPAdes: de novo RNA-seq data assembler (Prjibelsky et al., in preparation); plasmidSPAdes: assembly of plasmids from the whole genome sequencing data (Antipov et al., in preparation) We will provide an overview of the SPAdes family tools and will describe their benchmarking against state-of-the-art assembly tools using the QUAST family of assembly evaluation tools: QUAST tool for the quality assessment of genomics assemblers (Gurevich et al., 2013); metaQUAST tool for the quality assessment of metagenomics assemblers (Mikheenko et al., 2015); rnaQUAST tool for the quality assessment of RNA-Seq metagenomics assemblers (Bushmanova et al., 2015).
Email: ppevzner@ucsd.edu
Population genomics of reduced vancomycin susceptibility in Staphylococcus aureus
Presenting Author: Lavanya Rishishwar
Affiliation: School of Biology, Georgia Institute of Technology, Atlanta, Georgia, USA
Complete Authors List: Lavanya Rishishwar (1,2,3), Colleen S. Kraft (2,4) and I. King Jordan(1,2,3)
Complete Author List Affiliations:
1 School of Biology, Georgia Institute of Technology, Atlanta, Georgia, USA
2 PanAmerican Bioinformatics Institute, Santa Marta, Magdalena, Colombia
3 Applied Bioinformatics Laboratory, Atlanta, Georgia, USA
4 Division of Infectious Diseases, Emory University, Atlanta, Georgia, USA
Abstract:
Vancomycin-intermediate Staphylococcus aureus (VISA) is an emerging healthcare threat given its increased prevalence and the fact that clinical laboratories do not have sensitive methods to detect reduced vancomycin susceptibility. Genome-based comparative methods hold great promise to uncover the genetic basis of the VISA phenotype, which remains obscure. S. aureus isolates were collected from a single individual that presented with recurrent staphylococcal bacteremia at three time points, each of which showed successively reduced levels of vancomycin susceptibility. A population genomic approach was taken to compare patient S. aureus isolates with decreasing vancomycin susceptibility across the three time points. To do this, patient isolates were sequenced to high coverage (~500x) and sequence reads were used to model site-specific allelic variation within and between isolate populations. Population genetic methods were then applied to evaluate the overall levels of variation across the three time points and to identify individual variants that show anomalous levels of allelic change between populations. A successive reduction in the overall levels of population genomic variation was observed across the three time points, consistent with a population bottleneck resulting from antibiotic treatment. Despite this overall reduction in variation, a number of individual mutations were swept to high frequency in the VISA population. These mutations were implicated as potentially involved in the VISA phenotype and interrogated with respect to their functional roles. This approach allowed us to identify a number of mutations previously implicated in VISA along with allelic changes within a novel class of genes, encoding for LPXTG-motif containing cell wall anchoring proteins, which shed light on a novel mechanistic aspect of vancomycin resistance. The population genomic approach to genome sequence comparison taken here, whereby high coverage sequencing allows for read-based evaluation of allelic variation within and between patient isolate populations, is far more sensitive than the more widely employed consensus-sequence based approach previously used to compare isolates with distinct phenotypes. The analytical approaches laid out in this study can be broadly applied to compare patient isolates with distinct antibiotic resistance profiles, or distinct phenotypes of any kind that may have evolved under selective pressure.
Email: lavanya.rishishwar@gatech.edu
A novel method to measure genetic diversity within natural bacterial populations
Presenting Author: Luis M. Rodriguez-R
Affiliation: Georgia Institute of Technology
Complete Authors List:
Luis M Rodriguez-R (1); Despina Tsementzi (2); Konstantinos T. Konstantinidis (1,2)
(1) School of Biology, and
(2) School of Civil and Environmental Engineering,
Georgia Institute of Technology, Atlanta, GA, USA.
Abstract:
The rapid decrease in the cost of sequencing technologies has recently allowed an unprecedented wealth of genomic data, most evident in the bacterial domain. Several model species have been subjected to extensive genome sequencing, with hundreds to thousands of sequenced representatives (strains) currently available, allowing a detailed characterization of the sequence and gene-content diversity accumulated within a single species through its evolutionary history: the species pangenome. However, many of the strains of a species originate from different environments (e.g., E. coli strains from human vs. environmental samples) and are likely under varying selective pressures. Hence, to what extent the level of genetic diversity observed within a species based on sequenced strains is also found in situ within natural populations of the species remains unclear. Yet, this issue is important for better modeling and understanding bacterial diversity and genome plasticity, and for the (problematic) bacterial species concept. Here, we present a novel methodology for the quantification of intra-population gene-content and sequence diversity in metagenomes. By modeling the observed sequencing depths across a genome as the combination of skewed-log-normal distributions representing different levels of gene-presence conservation across genotypes, we can quantify the population core-genome and estimate properties of its pangenome. We derived the parameters of the estimation using over two thousand simulations on a set of 2,838 genomes from ten bacterial species and validated our method using simulations on a wider taxonomic range, including all bacterial species in NCBI Genbank with more than five complete genomes available. We applied this methodology to previously determined metagenomes from a variety of environments including human gut and posterior fornix, marine environments, and acid mine drainage, and demonstrate the extent of the pangenome overestimation based on genome collections relative to naturally-occurring populations.
Email: lmrodriguezr@gmail.com
Complicated Malaria: Gene Expression Profiling in a Colombian Case Study
Presenting Author: Monica L. Rojas-Peña
Affiliation: Center for Integrative Genomics, School of Biology, Georgia Institute of Technology, Atlanta, GA, USA
Complete Authors List: Monica L. Rojas-Peña (1); Myriam Arévalo-Herrera (2,3); Sócrates Herrera (3,4); Greg Gibson (1)
Complete Author List Affiliations:
1. Center for Integrative Genomics, School of Biology, Georgia Institute of Technology, Atlanta, GA, USA
2. Faculty of Health, Universidad del Valle, Cali, Colombia
3. Malaria Vaccine and Drug Development Center, Cali, Colombia
4. Caucaseco Scientific Research Center, Cali, Colombia
Abstract:
Each year approximately one third of all human deaths are caused by infectious and parasitic diseases. One of the most serious infectious diseases around the world is Malaria. The causative agent of malaria is a parasite of the genus Plasmodium. There are an estimated 124 to 283 million infections annually, most of which occur in tropical areas where more than 3 billion of people are at risk. Among the Plasmodium species, P. falciparum causes the most debilitating form of the disease and is the most common causative agent of complicated malaria, although P. vivax may also cause severe infections. The risk may be increased if treatment of an uncomplicated attack of malaria caused by these parasites is delayed. Infections with this parasite can be severe and even fatal in the absence of prompt treatment of the disease. Here we present a pilot case study designed to test the hypothesis that the duration and nature of human immune responses during complicated malaria (acute illness) can be associated with altered peripheral blood gene expression. Peripheral blood RNAseq from one Individual with complicated symptoms of P. vivax malaria infection from Quibdo, pacific coast of Colombia was sequenced on 5 different days from the day of hospitalization to the day that Chloroquine and Primaquine treatment is judged to have cleared the parasite. Our analyses show that Day 1, at the peak of parasitemia, is the most perturbed of the different days, along with day 3. Axis of variance and blood transcript module analyses indicates that inflammation and a interferon activation are the most perturbed. Similar studies of a dozen cases are under way to assess to what extent transcriptome analysis of peripheral blood samples can be used to identify molecular signatures of the immune response and to identify pathways that contribute to clearance of the parasite.
Email: monica.rojas@gatech.edu
Comparative genomic analysis of recombination rates among bacteria species
Presenting Author: Maria J. Soto-Giron
Affiliation: School of Biology and Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology, Atlanta, GA, USA.
Complete Authors List: Maria J. Soto-Giron (1,3); Luis M. Rodriguez-R (1,3); Konstantinos Konstantinidis (1,2,3)
Complete Author List Affiliations:
1. School of Biology, Georgia Institute of Technology, Atlanta, GA, USA.
2. School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA, USA.
3. Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology, Atlanta, GA, USA.
Abstract:
The frequency and magnitude of recombination (i.e., number and length of recombined fragments) varies among bacterial species. Some bacteria are thought to follow a clonal structure model while others present a recombining mixture of diverse genotypes. The dramatic increase of sequenced bacterial genomes makes the robust investigation of population diversification now possible. In this study, we propose an alternative method to estimate the rates of genetic exchange using the number of observed shared genes relative to the number of identical genes expected by chance based on the average amino acid identity (AAI) between genome pairs. We applied this method to several bacterial species with distinct ecologies and found that the expected level of recombination was robust and agreed with previous reports detecting close to zero recombination rates for obligatory and intracellular bacteria such as Buchnera aphidicola. However, the contribution of recombination to bacterial diversification was more similar than previously thought in generalist organisms such as Neisseria meningitidis and Campylobacter jejuni. C. jejuni, a pathogen associated with foodborne infections, was analyzed in more detail using isolates from distinct hosts and regions of the world (n >200). Our results suggest a cosmopolitan population clustered mainly in four clades with pervasive recombination among the clades, e.g., recombination to mutation ratio was 3.6 (3.6 times more nucleotide changes are affected by recombination than mutation). Additionally, genes related to resistance to bacitracin, erythromycin, and amikacin antibiotics are frequently exchanged among the clades. The transmission of antibiotic-resistant genes requires attention to control the emergence of multi-resistance genotypes.
Email: juliana.soto@gatech.edu
Impacts of Chromatin States and Long-Range Genomic Segments on Aging and DNA Methylation
Presenting Author: Dan Sun
Affiliation: Georgia Institute of Technology
Complete Authors List: Dan Sun; Soojin Yi
Complete Author List Affiliations:
School of Biology, Georgia Institute of Technology
Abstract:
Understanding the fundamental dynamics of epigenome variation during normal aging is critical for elucidating key epigenetic alterations that affect development, cell differentiation and diseases. Advances in the field of aging and DNA methylation strongly support the aging epigenetic drift model. Although this model aligns with previous studies, the role of other epigenetic marks, such as histone modification, as well as the impact of sampling specific CpGs, must be evaluated. Ultimately, it is crucial to investigate how all CpGs in the human genome change their methylation with aging in their specific genomic and epigenomic contexts. Here, we analyze whole genome bisulfite sequencing DNA methylation maps of brain frontal cortex from individuals of diverse ages. Comparisons with blood data reveal tissue-specific patterns of epigenetic drift. By integrating chromatin state information, divergent degrees and directions of aging-associated methylation in different genomic regions are revealed. Whole genome bisulfite sequencing data also open a new door to investigate whether adjacent CpG sites exhibit coordinated DNA methylation changes with aging. We identified significant ‘aging-segments’, which are clusters of nearby CpGs that respond to aging by similar DNA methylation changes. These segments not only capture previously identified aging-CpGs but also include specific functional categories of genes with implications on epigenetic regulation of aging. For example, genes associated with development are highly enriched in positive aging segments, which are gradually hyper-methylated with aging. On the other hand, regions that are gradually hypo-methylated with aging (‘negative aging segments’) in the brain harbor genes involved in metabolism and protein ubiquitination. Given the importance of protein ubiquitination in proteome homeostasis of aging brains and neurodegenerative disorders, our finding suggests the significance of epigenetic regulation of this posttranslational modification pathway in the aging brain. Utilizing aging segments rather than individual CpGs will provide more comprehensive genomic and epigenomic contexts to understand the intricate associations between genomic neighborhoods and developmental and aging processes. These results complement the aging epigenetic drift model and provide new insights.
Email: dsun33@gatech.edu
New Approach to ab initio Gene Prediction in Prokaryotic Genomes: Introduction of Locally Optimal Models and Adaptive Training
Presenting Author: Shiyuyun Tang
Affiliation: School of Biology, Georgia Institute of Technology
Complete Authors List: Shiyuyun Tang (1), Alex Lomsadze (2), Mark Borodovsky (2,3,4)
Complete Author List Affiliations:
1. School of Biology,
2. Joint Georgia Tech and Emory Wallace H. Coulter Department of Biomedical Engineering,
3. School of Computational Science and Engineering,
4. Center for Bioinformatics and Computational Genomics,
Georgia Institute of Technology, Atlanta, GA, USA
Prokaryotic genes can be predicted with much higher average accuracy than eukaryotic ones. However, the error rate is not negligible and largely species-specific. Most errors in gene prediction are genes located in genomic regions with atypical G+C composition e.g. genes in novel pathogenicity islands. We have accumulated significant data on testing the earlier developed GeneMarkS (Besemer et al., 2001), a self-training software tool, that has been used in many genome projects. GeneMarkS is constantly updated. One of the versions of GeneMarkS has served as a core element of the NCBI prokaryotic genome annotation pipeline (PGAP); in August 2015 PGAP annotated and re-annotated more than 48,000 prokaryotic genomes (ncbi.nlm.nih.gov/genome/annotation_prok/process/).
Here we present a new algorithm and software tool GeneMarkS-2. In the first step of analysis the algorithm uses local G+C-specific heuristic models for scoring individual ORFs. Predicted atypical genes are retained and serve as ‘external’ evidence in subsequent runs of self-training. GeneMarkS-2 controls the quality of training process by effectively computing a measure of relative entropy between protein-coding and non-coding sequence models determined by self-training. With respect to this measure the algorithm selects optimal orders of the Markov chain models as well as duration parameters in the generalized HMM. Accuracy of GeneMarkS-2 has been tested on a large number of prokaryotic genomes and compared with other state-of-the-art gene prediction tools.
Human population-specific expression and transcriptional network rewiring with polymorphic transposable element insertions
Presenting Author: Lu Wang
Affiliation: School of Biology, Georgia Institute of Technology, Atlanta, Georgia, USA
Complete Authors List: Lu Wang (1,2) Lavanya Rishishwar (1,2) Leonardo Mariño-Ramírez (2,3) I. King Jordan (1,2)
Complete Author List Affiliations:
1 School of Biology, Georgia Institute of Technology, Atlanta, Georgia, USA
2 PanAmerican Bioinformatics Institute, Santa Marta, Colombia
3 National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland, USA
Abstract:
Transposable elements (TEs) can be considered as genomic parasites that engage in co-evolutionary arms races with their host genomes. TEs continually evolve mechanisms to ensure their spread and proliferation, whereas host genomes evolve repression mechanisms to silence TEs and thereby mitigate their deleterious effects. Over time, this co-evolutionary dynamic can result in TEs having profound effects on the structure, function and regulation of their host genomes. The human genome contains three families of TEs that remain active: Alu, L1, and SVA. Germline transposition of these elements creates insertional polymorphisms within and between human populations. These polymorphic TE insertions represent an important source of human genetic variation that may, in some cases, have demonstrable phenotypic consequences. We evaluated the effect of human polymorphic TE insertions on gene regulation using an expression quantitative trait loci (eQTL) approach; the eQTL approach identifies polymorphic TE loci with presence/absence insertion genotypes that are associated with differential gene expression among individuals. To do this, we analyzed data on polymorphic TE genotypes and gene expression levels (RNA-seq) characterized for 87 African individuals and 358 Europeans. Genotypes for 2,617 polymorphic TE loci from these individuals were compared to the expression levels of 22,102 genes in those same individuals using linear regression. We found several hundred polymorphic TE loci that are associated with inter-individual gene expression differences within and between human populations (i.e. polymorphic TEs that are eQTLs). The set of genes that is associated with population specific TE-eQTLs is enriched for immune-related functions and pathways, consistent with the fact that the expression data we analyzed is from B-cells. We show several examples of population-specific TE-eQTLs that are associated with differential expression of human leukocyte antigen (HLA) encoding genes. We also show evidence that polymorphic TEs can play a role in transcriptional network rewiring. In this case, a single cis TE-eQTL that changes the expression of a transcription factor encoding gene (PAX5) can affect an entire regulatory network via concomitant trans changes in the expression of downstream target genes of that factor. These preliminary results suggest that human polymorphic TE insertions can have a substantial effect on differential gene expression between individuals.
Email: lu.wang@gatech.edu