In partial fulfillment of the requirements for the degree of Doctor of Philosophy in Bioinformatics in the College of Computing Hamid Reza Hassanzadeh Defends his thesis:Advanced Machine Learning Approaches for Characterization of Transcriptional Regulatory Elements and Genome-Wide Associations Wednesday, Nov 20th, 2019 3:30 PM Eastern Time Howey Physics Classroom S204 Thesis Advisors: Dr. Charles Isbell School of Interactive Computing Georgia Institute of Technology Dr. Gregory Gibson School of Biological Sciences Georgia Institute of Technology Committee Members: Dr. Constantine Dovrolis School of Computer Science Georgia Institute of Technology Dr. Denis Tsygankov Department of Biomedical Engineering Georgia Institute of Technology Dr. Peng Qiu Department of Biomedical Engineering Georgia Institute of Technology and Emory University Abstract The deep learning revolution has initiated a surge of remarkable achievements in diverse research areas where large volumes of data that underlie complex processes exist. Despite the successful application of deep models in solving certain problems in the Biomedical and Bioinformatics domains, the field has not brought any promise in solving many other challenging problems that deal with the genomic complexities. The goal of my Ph.D. research has been to develop advanced machine learning techniques to address two relevant challenging problems in the Bioinformatics domain, namely, the characterization transcriptional regulatory elements and, modeling genome-wide associations and linkage disequilibrium using genomic and evolutionary annotation of variants. Genome codes for almost all biological phenomena that take place inside living cells. One such key interactions is the association between transcription factors and a number of degenerate binding sites on DNA which facilitate initiation of transcription of genes. While each protein can potentially bind to any site on the DNA, it is the strength of this binding that plays the key role in the initiation process. Predicting these binding sites as well as binding affinities, are two interesting and yet challenging problems that remain largely unsolved. Yet, we know that the cell machineries constantly identify such sites on DNA with near perfect accuracy. The last two decade witnessed production of multiple in-vivo and in-vitro high-throughput technologies for elucidating these interactions. Protein Binding Microarrays (PBM) have been one of the most effective in-vitro technologies developed so far. The result of PBM experiments, however, are not easily interpretable and require advanced downstream analysis tools to discover the patterns of bindings. In the first half of my thesis, I will develop a series of computational methods that can learn such patterns from data generated by this technology, using tools and techniques from the natural language and image processing domains. I will also show the superiority of my proposed pipelines in predicting binding patterns and affinity. The second part of my thesis devotes to developing methods for modeling of genome-wide associations and the linkage disequilibrium. Both of these tasks pose similar challenges that restrict our ability in utilizing recent advances in deep learning research. Specifically, when dealing with GWA studies, we are often bound by high dimensionality of variants data, a significant degree of missing information (i.e. missing heritability), high complexity weak patterns to learn, and relatively small datasets. As a consequence, the state-of-the-art approaches for GWAS that are used in practice are different variations of linear models. In my thesis, I showed that part of the failure in learning higher-capacity models can be attributed to how we are training such models. Specifically, I showed that using Siamese networks and tools from graph theory we can achieve a performance higher or on par with the state-of-the-art Bayesian non-parametric approaches. Being successful in learning weak relationships using the proposed model, I then extended my approach to show that there is a relation between variants annotations and their underlying haplotype structure, which was not known before. Existence of such a relationship can increase the power of GWA models and if proved biologically will have important implications in population genetics.