Genetics, Techniques

Controlling gene expression with deep generative design of regulatory DNA

Abstract

Design of de novo synthetic regulatory DNA is a promising avenue to control gene expression in biotechnology and medicine. Using mutagenesis typically requires screening sizable random DNA libraries, which limits the designs to span merely a short section of the promoter and restricts their control of gene expression. Here, we prototype a deep learning strategy based on generative adversarial networks (GAN) by learning directly from genomic and transcriptomic data. Our ExpressionGAN can traverse the entire regulatory sequence-expression landscape in a gene-specific manner, generating regulatory DNA with prespecified target mRNA levels spanning the whole gene regulatory structure including coding and adjacent non-coding regions. Despite high sequence divergence from natural DNA, in vivo measurements show that 57% of the highly-expressed synthetic sequences surpass the expression levels of highly-expressed natural controls. This demonstrates the applicability and relevance of deep generative design to expand our knowledge and control of gene expression regulation in any desired organism, condition or tissue.

Introduction

Gene expression is a fundamental process underlying the cellular functionality of all living organisms. Researchers have been trying to control it for decades, since it can help us design efficient gene therapies¹ and microbial cell factories², hopefully curing cancer among other diseases and aiding the transformation to a sustainable biobased society. Our ability to control gene expression derives from understanding the cell’s intrinsic regulatory code³, which can be used to design new regulatory sequences leading to desired expression levels^4,5,6. State-of-the-art machine learning approaches have recently proven highly useful in this endeavor, expanding our knowledge of the DNA regulatory grammar underlying gene expression^7,8,9,10, helping us to design promoter and gene sequences^11,12 and accurately predict gene expression across multiple model organisms^7,13. The striking capacity of random DNA to evolve into functioning regulatory sequences by introducing only a small number of base pair mutations^14,15 suggests that the richness and plasticity of cis-regulatory grammar results in a vast functional regulatory sequence space, far larger than the one currently observed in natural systems⁸. By learning this regulatory sequence space using advanced deep learning approaches^11,16,17, we can in principle design systems that precisely traverse it to generate regulatory sequence variants with targeted expression levels.

Popular strategies to design synthetic regulatory DNA of varying expression levels include stacking multiple known functional sequence motifs^4,6,18,19 and applying random mutagenesis to a specific region, most commonly the promoter^{8,20,21,22,23} though also UTRs^24,25,26 and terminators^27,28 have been targeted, typically in a form of short sequence segments of <100 bp. Using in silico screening approaches^7,8, which evaluate the fitness of candidate sequences by predicting their expression levels, more intricate solutions based on evolutionary computation²⁹ have been implemented, including genetic algorithms^{15,24,29,30,31}. However, these algorithms still employ random mutagenesis in every round of sequence evolution, relying solely on the sequence-function mapping of the predictive models^29,32. Rather than to generate valid sequences predicted to improve the target objective, they produce selection candidates via arbitrary sequence changes, many of which are not feasible regulatory DNA. This can potentially lead to highly untrustworthy predictions (predictor pathologies)^33,34,35 and local minima problems³², exacerbating the difficulty of finding the small subset of sequences that satisfy the target objective in the enormous search space. Therefore, the search for functional sequence variants frequently requires experimental screening with multiple rounds of trial and error or experimentally testing enormous sequence batches^5,8. The inherent inability in relating sequence to expression and the high resource intensiveness of the mutagenesis-based approaches are also the major factors constraining the explored DNA to only short segments of single regulatory regions and specific reporter genes^15,24. This ultimately limits gene expression control, thus not fulfilling the key design objective.

Alternatively, the idea of novel solutions for regulatory DNA design, facilitated by deep neural networks, is to directly generate valid sequences by learning functional and biologically feasible sequence spaces^11,12,33,36. This can resolve many mutagenesis-related problems and helps to optimize resources after the generative step, both in the case of experimental^11,37 or in silico screening^33,36, as sequence validity enables testing lower amounts of candidates and alleviates predictor pathologies^33,34,35. However, despite not being restricted by sequence length, as no brute force testing of mutations spanning large sequence spaces is required, current generative approaches also focus on mere single regulatory regions¹¹ or shorter segments¹⁷ and are rarely tested experimentally³³. As evidenced by the strong agreement between protein and mRNA levels^38,39,40,41, mRNA transcription, a major determinant of protein abundance, is controlled by the interaction of cis-regulatory patterns across the whole regulatory structure of the gene. This comprises coding and regulatory regions that include promoters, untranslated regions (UTRs) and terminators, each encoding a significant amount of information related to mRNA levels^7,8,24,27. Ultimately, to accurately control gene expression, the entire gene regulatory structure must be fine-tuned^3,7,42,43. Therefore, based on the recent achievements in modeling DNA and protein spaces^11,12,44, we hypothesize that state-of-the-art generative deep neural networks are capable of learning the entire DNA regulatory landscape directly from natural genomic sequences and transcriptomic data. By leveraging information from the whole gene regulatory structure including the coding region^3,7, de novo regulatory DNA with highly accurate target expression levels can be generated, helping to overcome the limitations of existing approaches and enabling precise and gene-specific navigation of the regulatory sequence space in potentially any organism and tissue.

In the present study, we use deep learning frameworks to demonstrate that a generative modeling approach can successfully design de novo functional regulatory DNA in Saccharomyces cerevisiae. First, we train a deep generative adversarial network (GAN) only on natural genomic sequences spanning the whole gene regulatory structure and find that the generated regulatory sequences exhibit properties highly similar to those of natural regulatory DNA. Next, using an optimization procedure that couples the generative network with a highly accurate deep predictive model^7,17 (ExpressionGAN), we add coding sequence information to the generative approach and learn to precisely navigate the regulatory sequence-expression landscape of a specific gene across almost 6 orders of magnitude of expression levels, accurately controlling the sampling of sequences with targeted expression levels. Similarly, we then train and optimize additional generators based on commonly used single regulatory region parts^15,24, demonstrating how the use of the whole gene regulatory structure can outperform single-region solutions by expanding the achievable dynamic range of expression levels. By sampling 20,000 generated regulatory sequences with high and low predicted expression levels from ExpressionGAN and measuring their sequence properties, including cis-regulatory grammar and core promoter features, we observe that the generated DNA carries known sequence determinants of gene expression control. Finally, we experimentally verify a selection of the generated sequences that retain a natural or even higher level of dissimilarity (>33%) to any currently known regulatory sequence. We find that experimentally measured mRNA expression levels recapitulate predicted ones across 3 orders of magnitude. In fact, 57% of the constructs designed to be highly expressed surpass the gene expression level of natural high-expression control sequences, demonstrating the effectiveness of the generative approach for designing functional regulatory DNA in practice.

Results

Implementing a generative strategy to design regulatory DNA

Based on the knowledge that the whole gene regulatory structure is involved in controlling gene expression³, we previously demonstrated the combined DNA sequence across all regulatory regions (Fig. 1a: promoter, 5′ UTR, 3′ UTR and terminator) is highly predictive of gene expression⁷. We also observed that gene expression of individual genes varies across the majority of biological conditions within a mere 1-fold range for >80% of yeast protein-coding genes⁷. The dynamic range of gene expression (Fig. 1b: spanning nearly 5 orders of magnitude of median TPM values across the whole range of biological conditions) is thus predictable directly from the DNA sequence, irrespective of the biological conditions (Fig. 1c: R²_test = 0.8, models tuned and tested on independent held-out datasets, see the “Methods” section). Apart from the properties of the coding region, the most relevant parts of DNA for these predictions were the respective sequences of the 4 regulatory regions totaling 1000 bp (as illustrated in Fig. 1a). Moreover, training and testing multiple deep neural networks using sequence data from different region combinations as input and median TPM values as the target (Fig. 1b) demonstrated that only the whole gene regulatory structure spans the key regulatory features important for predicting the full dynamic range of mRNA expression levels⁷….

Original Source Article

Jan Zrimec, Xiaozhi Fu, Azam Sheikh Muhammad, et al.

08/30/2022

Molecular Medicine Israel

Controlling gene expression with deep generative design of regulatory DNA

Abstract

Introduction

Results

Implementing a generative strategy to design regulatory DNA

Recent Posts

Genomic deletions explain the generation of alternative BRAF isoforms conferring resistance to MAPK inhibitors in melanoma

Autoimmunity against melanoma differentiation–associated gene 5 induces interstitial lung disease mimicking dermatomyositis in mice

A SWI/SNF-dependent transcriptional regulation mediated by POU2AF2/C11orf53 at enhancer

Adipose tissue macrophages secrete small extracellular vesicles that mediate rosiglitazone-induced insulin sensitization

Complex activity and short-term plasticity of human cerebral organoids reciprocally connected with axons

Support Us
By Promoting your Ad HERE:

Sign up for our Newsletter

MMiP

Archive

Weekly NewslEtter

Molecular Medicine Israel

Controlling gene expression with deep generative design of regulatory DNA

Abstract

Introduction

Results

Implementing a generative strategy to design regulatory DNA

Recent Posts

Genomic deletions explain the generation of alternative BRAF isoforms conferring resistance to MAPK inhibitors in melanoma

Autoimmunity against melanoma differentiation–associated gene 5 induces interstitial lung disease mimicking dermatomyositis in mice

A SWI/SNF-dependent transcriptional regulation mediated by POU2AF2/C11orf53 at enhancer

Adipose tissue macrophages secrete small extracellular vesicles that mediate rosiglitazone-induced insulin sensitization

Complex activity and short-term plasticity of human cerebral organoids reciprocally connected with axons

Support UsBy Promoting your Ad HERE:

Sign up for our Newsletter

MMiP

Archive

Weekly NewslEtter

Support Us
By Promoting your Ad HERE: